This document describes the SZDD and KWAJ file formats which are implemented in the MS-DOS commands COMPRESS.EXE and EXPAND.EXE.
Both formats compress a single file to another single file, replacing the last character in the filename with an underscore or dollar character, e.g. README.TXT becomes README.TX_ or README.TX$.
An SZDD file begins with this fixed header:
Offset | Length | Description |
---|---|---|
0x00 | 8 | "SZDD" signature: 0x53,0x5A,0x44,0x44,0x88,0xF0,0x27,0x33 |
0x08 | 1 | Compression mode: only "A" (0x41) is valid here |
0x09 | 1 | The character missing from the end of the filename (0=unknown) |
0x0A | 4 | The integer length of the file when unpacked |
The header is immediately followed by the compressed data. The following pseudocode explains how to unpack this data; it's a form of the LZSS algorithm.
char window[4096]; int pos = 4096 - 16; memset(window, 0x20, 4096); /* window initially full of spaces */ for (;;) { int control = GETBYTE(); if (control == EOF) break; /* exit if no more to read */ for (int cbit = 0x01; cbit & 0xFF; cbit <<= 1) { if (control & cbit) { /* literal */ PUTBYTE(window[pos++] = GETBYTE()); } else { /* match */ int matchpos = GETBYTE(); int matchlen = GETBYTE(); matchpos |= (matchlen & 0xF0) << 4; matchlen = (matchlen & 0x0F) + 3; while (matchlen--) { PUTBYTE(window[pos++] = window[matchpos++]); pos &= 4095; matchpos &= 4095; } } } } |
There is also a variant SZDD format seen in the installation package for QBasic 4.5, so I call it the QBasic variant. It has a different header and the pos variable in the pseudocode above is set to 4096-18 instead of 4096-16.
Offset | Length | Description |
---|---|---|
0x00 | 8 | "SZ" signature: 0x53,0x5A,0x20,0x88,0xF0,0x27,0x33,0xD1 |
0x08 | 4 | The integer length of the file when unpacked |
A KWAJ file begins with this fixed header:
Offset | Length | Description |
---|---|---|
0x00 | 8 | "KWAJ" signature: 0x4B,0x57,0x41,0x4A,0x88,0xF0,0x27,0xD1 |
0x08 | 2 | compression method (0-4) |
0x0A | 2 | file offset of compressed data |
0x0C | 2 | header flags to mark header extensions |
The "compression method" field indicates the type of data compression used:
Header extensions immediately follow the header.
If you don't care about the header extensions, use the file offset to skip to the compressed data.
The header extensions appear in this order:
Compression method 3 is unique to the KWAJ format. It's an LZ+Huffman algorithm created by Jeff Johnson.
Bits are always read from MSB to LSB, one byte at a time.
There are three parts:
KWAJ uses 5 huffman trees. They always have the same number of symbols in them. They are, in order:
Canonical huffman codes are used, which means you simply need to know how many symbols in each huffman tree (given above), and how long each huffman symbol is
How the symbol lengths are encoded depends on the encoding type, as given by the 6 nybbles at the start of the compressed data.
Symbol lengths are read in ascending order, and the number of symbols to read is implied by which tree you're defining.
At this point, the compressed data begins.
We have a 4096 byte ring buffer, initially filled with byte 0x20 (ASCII space). Unlike the SZDD format, the starting position in the buffer is irrelevant, as match positions are stored relative to the current position in the window, not as absolute positions in the window.
Pseudo-code:
ring buffer position = 4096-17 selected table = MATCHLEN LOOP: code = read huffman code using selected table (MATCHLEN or MATCHLEN2) if EOF reached, exit loop if code > 0, this is a match: match length = code + 2 x = read huffman code using OFFSET table y = read 6 bits match offset = current ring buffer position - (x<<6 | y) copy match as output and into the ring buffer selected table = MATCHLEN if code == 0, this is a run of literals: x = read huffman code using LITLEN table if x != 31, selected table = MATCHLEN2 read {x+1} literals using LITERAL huffman table, copy as output and into the ring buffer
Offset | Length | Description |
---|---|---|
0 | 2 | Compressed length of this block (n). Stored in Intel byte order. Doesn't include these two bytes. |
2 | 2 | "CK" in ASCII (0x43, 0x4B) |
4 | n-2 | Data compressed in DEFLATE format |