1205194Sdelphij 2205194Sdelphij 3205194Sdelphij 4205194Sdelphij 5205194Sdelphij 6205194Sdelphij 7205194SdelphijNetwork Working Group P. Deutsch 8205194SdelphijRequest for Comments: 1951 Aladdin Enterprises 9205194SdelphijCategory: Informational May 1996 10205194Sdelphij 11205194Sdelphij 12205194Sdelphij DEFLATE Compressed Data Format Specification version 1.3 13205194Sdelphij 14205194SdelphijStatus of This Memo 15205194Sdelphij 16205194Sdelphij This memo provides information for the Internet community. This memo 17205194Sdelphij does not specify an Internet standard of any kind. Distribution of 18205194Sdelphij this memo is unlimited. 19205194Sdelphij 20205194SdelphijIESG Note: 21205194Sdelphij 22205194Sdelphij The IESG takes no position on the validity of any Intellectual 23205194Sdelphij Property Rights statements contained in this document. 24205194Sdelphij 25205194SdelphijNotices 26205194Sdelphij 27205194Sdelphij Copyright (c) 1996 L. Peter Deutsch 28205194Sdelphij 29205194Sdelphij Permission is granted to copy and distribute this document for any 30205194Sdelphij purpose and without charge, including translations into other 31205194Sdelphij languages and incorporation into compilations, provided that the 32205194Sdelphij copyright notice and this notice are preserved, and that any 33205194Sdelphij substantive changes or deletions from the original are clearly 34205194Sdelphij marked. 35205194Sdelphij 36205194Sdelphij A pointer to the latest version of this and related documentation in 37205194Sdelphij HTML format can be found at the URL 38205194Sdelphij <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. 39205194Sdelphij 40205194SdelphijAbstract 41205194Sdelphij 42205194Sdelphij This specification defines a lossless compressed data format that 43205194Sdelphij compresses data using a combination of the LZ77 algorithm and Huffman 44205194Sdelphij coding, with efficiency comparable to the best currently available 45205194Sdelphij general-purpose compression methods. The data can be produced or 46205194Sdelphij consumed, even for an arbitrarily long sequentially presented input 47205194Sdelphij data stream, using only an a priori bounded amount of intermediate 48205194Sdelphij storage. The format can be implemented readily in a manner not 49205194Sdelphij covered by patents. 50205194Sdelphij 51205194Sdelphij 52205194Sdelphij 53205194Sdelphij 54205194Sdelphij 55205194Sdelphij 56205194Sdelphij 57205194Sdelphij 58205194SdelphijDeutsch Informational [Page 1] 59205194Sdelphij 60205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 61205194Sdelphij 62205194Sdelphij 63205194SdelphijTable of Contents 64205194Sdelphij 65205194Sdelphij 1. Introduction ................................................... 2 66205194Sdelphij 1.1. Purpose ................................................... 2 67205194Sdelphij 1.2. Intended audience ......................................... 3 68205194Sdelphij 1.3. Scope ..................................................... 3 69205194Sdelphij 1.4. Compliance ................................................ 3 70205194Sdelphij 1.5. Definitions of terms and conventions used ................ 3 71205194Sdelphij 1.6. Changes from previous versions ............................ 4 72205194Sdelphij 2. Compressed representation overview ............................. 4 73205194Sdelphij 3. Detailed specification ......................................... 5 74205194Sdelphij 3.1. Overall conventions ....................................... 5 75205194Sdelphij 3.1.1. Packing into bytes .................................. 5 76205194Sdelphij 3.2. Compressed block format ................................... 6 77205194Sdelphij 3.2.1. Synopsis of prefix and Huffman coding ............... 6 78205194Sdelphij 3.2.2. Use of Huffman coding in the "deflate" format ....... 7 79205194Sdelphij 3.2.3. Details of block format ............................. 9 80205194Sdelphij 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11 81205194Sdelphij 3.2.5. Compressed blocks (length and distance codes) ...... 11 82205194Sdelphij 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12 83205194Sdelphij 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13 84205194Sdelphij 3.3. Compliance ............................................... 14 85205194Sdelphij 4. Compression algorithm details ................................. 14 86205194Sdelphij 5. References .................................................... 16 87205194Sdelphij 6. Security Considerations ....................................... 16 88205194Sdelphij 7. Source code ................................................... 16 89205194Sdelphij 8. Acknowledgements .............................................. 16 90205194Sdelphij 9. Author's Address .............................................. 17 91205194Sdelphij 92205194Sdelphij1. Introduction 93205194Sdelphij 94205194Sdelphij 1.1. Purpose 95205194Sdelphij 96205194Sdelphij The purpose of this specification is to define a lossless 97205194Sdelphij compressed data format that: 98205194Sdelphij * Is independent of CPU type, operating system, file system, 99205194Sdelphij and character set, and hence can be used for interchange; 100205194Sdelphij * Can be produced or consumed, even for an arbitrarily long 101205194Sdelphij sequentially presented input data stream, using only an a 102205194Sdelphij priori bounded amount of intermediate storage, and hence 103205194Sdelphij can be used in data communications or similar structures 104205194Sdelphij such as Unix filters; 105205194Sdelphij * Compresses data with efficiency comparable to the best 106205194Sdelphij currently available general-purpose compression methods, 107205194Sdelphij and in particular considerably better than the "compress" 108205194Sdelphij program; 109205194Sdelphij * Can be implemented readily in a manner not covered by 110205194Sdelphij patents, and hence can be practiced freely; 111205194Sdelphij 112205194Sdelphij 113205194Sdelphij 114205194SdelphijDeutsch Informational [Page 2] 115205194Sdelphij 116205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 117205194Sdelphij 118205194Sdelphij 119205194Sdelphij * Is compatible with the file format produced by the current 120205194Sdelphij widely used gzip utility, in that conforming decompressors 121205194Sdelphij will be able to read data produced by the existing gzip 122205194Sdelphij compressor. 123205194Sdelphij 124205194Sdelphij The data format defined by this specification does not attempt to: 125205194Sdelphij 126205194Sdelphij * Allow random access to compressed data; 127205194Sdelphij * Compress specialized data (e.g., raster graphics) as well 128205194Sdelphij as the best currently available specialized algorithms. 129205194Sdelphij 130205194Sdelphij A simple counting argument shows that no lossless compression 131205194Sdelphij algorithm can compress every possible input data set. For the 132205194Sdelphij format defined here, the worst case expansion is 5 bytes per 32K- 133205194Sdelphij byte block, i.e., a size increase of 0.015% for large data sets. 134205194Sdelphij English text usually compresses by a factor of 2.5 to 3; 135205194Sdelphij executable files usually compress somewhat less; graphical data 136205194Sdelphij such as raster images may compress much more. 137205194Sdelphij 138205194Sdelphij 1.2. Intended audience 139205194Sdelphij 140205194Sdelphij This specification is intended for use by implementors of software 141205194Sdelphij to compress data into "deflate" format and/or decompress data from 142205194Sdelphij "deflate" format. 143205194Sdelphij 144205194Sdelphij The text of the specification assumes a basic background in 145205194Sdelphij programming at the level of bits and other primitive data 146205194Sdelphij representations. Familiarity with the technique of Huffman coding 147205194Sdelphij is helpful but not required. 148205194Sdelphij 149205194Sdelphij 1.3. Scope 150205194Sdelphij 151205194Sdelphij The specification specifies a method for representing a sequence 152205194Sdelphij of bytes as a (usually shorter) sequence of bits, and a method for 153205194Sdelphij packing the latter bit sequence into bytes. 154205194Sdelphij 155205194Sdelphij 1.4. Compliance 156205194Sdelphij 157205194Sdelphij Unless otherwise indicated below, a compliant decompressor must be 158205194Sdelphij able to accept and decompress any data set that conforms to all 159205194Sdelphij the specifications presented here; a compliant compressor must 160205194Sdelphij produce data sets that conform to all the specifications presented 161205194Sdelphij here. 162205194Sdelphij 163205194Sdelphij 1.5. Definitions of terms and conventions used 164205194Sdelphij 165205194Sdelphij Byte: 8 bits stored or transmitted as a unit (same as an octet). 166205194Sdelphij For this specification, a byte is exactly 8 bits, even on machines 167205194Sdelphij 168205194Sdelphij 169205194Sdelphij 170205194SdelphijDeutsch Informational [Page 3] 171205194Sdelphij 172205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 173205194Sdelphij 174205194Sdelphij 175205194Sdelphij which store a character on a number of bits different from eight. 176205194Sdelphij See below, for the numbering of bits within a byte. 177205194Sdelphij 178205194Sdelphij String: a sequence of arbitrary bytes. 179205194Sdelphij 180205194Sdelphij 1.6. Changes from previous versions 181205194Sdelphij 182205194Sdelphij There have been no technical changes to the deflate format since 183205194Sdelphij version 1.1 of this specification. In version 1.2, some 184205194Sdelphij terminology was changed. Version 1.3 is a conversion of the 185205194Sdelphij specification to RFC style. 186205194Sdelphij 187205194Sdelphij2. Compressed representation overview 188205194Sdelphij 189205194Sdelphij A compressed data set consists of a series of blocks, corresponding 190205194Sdelphij to successive blocks of input data. The block sizes are arbitrary, 191205194Sdelphij except that non-compressible blocks are limited to 65,535 bytes. 192205194Sdelphij 193205194Sdelphij Each block is compressed using a combination of the LZ77 algorithm 194205194Sdelphij and Huffman coding. The Huffman trees for each block are independent 195205194Sdelphij of those for previous or subsequent blocks; the LZ77 algorithm may 196205194Sdelphij use a reference to a duplicated string occurring in a previous block, 197205194Sdelphij up to 32K input bytes before. 198205194Sdelphij 199205194Sdelphij Each block consists of two parts: a pair of Huffman code trees that 200205194Sdelphij describe the representation of the compressed data part, and a 201205194Sdelphij compressed data part. (The Huffman trees themselves are compressed 202205194Sdelphij using Huffman encoding.) The compressed data consists of a series of 203205194Sdelphij elements of two types: literal bytes (of strings that have not been 204205194Sdelphij detected as duplicated within the previous 32K input bytes), and 205205194Sdelphij pointers to duplicated strings, where a pointer is represented as a 206205194Sdelphij pair <length, backward distance>. The representation used in the 207205194Sdelphij "deflate" format limits distances to 32K bytes and lengths to 258 208205194Sdelphij bytes, but does not limit the size of a block, except for 209205194Sdelphij uncompressible blocks, which are limited as noted above. 210205194Sdelphij 211205194Sdelphij Each type of value (literals, distances, and lengths) in the 212205194Sdelphij compressed data is represented using a Huffman code, using one code 213205194Sdelphij tree for literals and lengths and a separate code tree for distances. 214205194Sdelphij The code trees for each block appear in a compact form just before 215205194Sdelphij the compressed data for that block. 216205194Sdelphij 217205194Sdelphij 218205194Sdelphij 219205194Sdelphij 220205194Sdelphij 221205194Sdelphij 222205194Sdelphij 223205194Sdelphij 224205194Sdelphij 225205194Sdelphij 226205194SdelphijDeutsch Informational [Page 4] 227205194Sdelphij 228205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 229205194Sdelphij 230205194Sdelphij 231205194Sdelphij3. Detailed specification 232205194Sdelphij 233205194Sdelphij 3.1. Overall conventions In the diagrams below, a box like this: 234205194Sdelphij 235205194Sdelphij +---+ 236205194Sdelphij | | <-- the vertical bars might be missing 237205194Sdelphij +---+ 238205194Sdelphij 239205194Sdelphij represents one byte; a box like this: 240205194Sdelphij 241205194Sdelphij +==============+ 242205194Sdelphij | | 243205194Sdelphij +==============+ 244205194Sdelphij 245205194Sdelphij represents a variable number of bytes. 246205194Sdelphij 247205194Sdelphij Bytes stored within a computer do not have a "bit order", since 248205194Sdelphij they are always treated as a unit. However, a byte considered as 249205194Sdelphij an integer between 0 and 255 does have a most- and least- 250205194Sdelphij significant bit, and since we write numbers with the most- 251205194Sdelphij significant digit on the left, we also write bytes with the most- 252205194Sdelphij significant bit on the left. In the diagrams below, we number the 253205194Sdelphij bits of a byte so that bit 0 is the least-significant bit, i.e., 254205194Sdelphij the bits are numbered: 255205194Sdelphij 256205194Sdelphij +--------+ 257205194Sdelphij |76543210| 258205194Sdelphij +--------+ 259205194Sdelphij 260205194Sdelphij Within a computer, a number may occupy multiple bytes. All 261205194Sdelphij multi-byte numbers in the format described here are stored with 262205194Sdelphij the least-significant byte first (at the lower memory address). 263205194Sdelphij For example, the decimal number 520 is stored as: 264205194Sdelphij 265205194Sdelphij 0 1 266205194Sdelphij +--------+--------+ 267205194Sdelphij |00001000|00000010| 268205194Sdelphij +--------+--------+ 269205194Sdelphij ^ ^ 270205194Sdelphij | | 271205194Sdelphij | + more significant byte = 2 x 256 272205194Sdelphij + less significant byte = 8 273205194Sdelphij 274205194Sdelphij 3.1.1. Packing into bytes 275205194Sdelphij 276205194Sdelphij This document does not address the issue of the order in which 277205194Sdelphij bits of a byte are transmitted on a bit-sequential medium, 278205194Sdelphij since the final data format described here is byte- rather than 279205194Sdelphij 280205194Sdelphij 281205194Sdelphij 282205194SdelphijDeutsch Informational [Page 5] 283205194Sdelphij 284205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 285205194Sdelphij 286205194Sdelphij 287205194Sdelphij bit-oriented. However, we describe the compressed block format 288205194Sdelphij in below, as a sequence of data elements of various bit 289205194Sdelphij lengths, not a sequence of bytes. We must therefore specify 290205194Sdelphij how to pack these data elements into bytes to form the final 291205194Sdelphij compressed byte sequence: 292205194Sdelphij 293205194Sdelphij * Data elements are packed into bytes in order of 294205194Sdelphij increasing bit number within the byte, i.e., starting 295205194Sdelphij with the least-significant bit of the byte. 296205194Sdelphij * Data elements other than Huffman codes are packed 297205194Sdelphij starting with the least-significant bit of the data 298205194Sdelphij element. 299205194Sdelphij * Huffman codes are packed starting with the most- 300205194Sdelphij significant bit of the code. 301205194Sdelphij 302205194Sdelphij In other words, if one were to print out the compressed data as 303205194Sdelphij a sequence of bytes, starting with the first byte at the 304205194Sdelphij *right* margin and proceeding to the *left*, with the most- 305205194Sdelphij significant bit of each byte on the left as usual, one would be 306205194Sdelphij able to parse the result from right to left, with fixed-width 307205194Sdelphij elements in the correct MSB-to-LSB order and Huffman codes in 308205194Sdelphij bit-reversed order (i.e., with the first bit of the code in the 309205194Sdelphij relative LSB position). 310205194Sdelphij 311205194Sdelphij 3.2. Compressed block format 312205194Sdelphij 313205194Sdelphij 3.2.1. Synopsis of prefix and Huffman coding 314205194Sdelphij 315205194Sdelphij Prefix coding represents symbols from an a priori known 316205194Sdelphij alphabet by bit sequences (codes), one code for each symbol, in 317205194Sdelphij a manner such that different symbols may be represented by bit 318205194Sdelphij sequences of different lengths, but a parser can always parse 319205194Sdelphij an encoded string unambiguously symbol-by-symbol. 320205194Sdelphij 321205194Sdelphij We define a prefix code in terms of a binary tree in which the 322205194Sdelphij two edges descending from each non-leaf node are labeled 0 and 323205194Sdelphij 1 and in which the leaf nodes correspond one-for-one with (are 324205194Sdelphij labeled with) the symbols of the alphabet; then the code for a 325205194Sdelphij symbol is the sequence of 0's and 1's on the edges leading from 326205194Sdelphij the root to the leaf labeled with that symbol. For example: 327205194Sdelphij 328205194Sdelphij 329205194Sdelphij 330205194Sdelphij 331205194Sdelphij 332205194Sdelphij 333205194Sdelphij 334205194Sdelphij 335205194Sdelphij 336205194Sdelphij 337205194Sdelphij 338205194SdelphijDeutsch Informational [Page 6] 339205194Sdelphij 340205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 341205194Sdelphij 342205194Sdelphij 343205194Sdelphij /\ Symbol Code 344205194Sdelphij 0 1 ------ ---- 345205194Sdelphij / \ A 00 346205194Sdelphij /\ B B 1 347205194Sdelphij 0 1 C 011 348205194Sdelphij / \ D 010 349205194Sdelphij A /\ 350205194Sdelphij 0 1 351205194Sdelphij / \ 352205194Sdelphij D C 353205194Sdelphij 354205194Sdelphij A parser can decode the next symbol from an encoded input 355205194Sdelphij stream by walking down the tree from the root, at each step 356205194Sdelphij choosing the edge corresponding to the next input bit. 357205194Sdelphij 358205194Sdelphij Given an alphabet with known symbol frequencies, the Huffman 359205194Sdelphij algorithm allows the construction of an optimal prefix code 360205194Sdelphij (one which represents strings with those symbol frequencies 361205194Sdelphij using the fewest bits of any possible prefix codes for that 362205194Sdelphij alphabet). Such a code is called a Huffman code. (See 363205194Sdelphij reference [1] in Chapter 5, references for additional 364205194Sdelphij information on Huffman codes.) 365205194Sdelphij 366205194Sdelphij Note that in the "deflate" format, the Huffman codes for the 367205194Sdelphij various alphabets must not exceed certain maximum code lengths. 368205194Sdelphij This constraint complicates the algorithm for computing code 369205194Sdelphij lengths from symbol frequencies. Again, see Chapter 5, 370205194Sdelphij references for details. 371205194Sdelphij 372205194Sdelphij 3.2.2. Use of Huffman coding in the "deflate" format 373205194Sdelphij 374205194Sdelphij The Huffman codes used for each alphabet in the "deflate" 375205194Sdelphij format have two additional rules: 376205194Sdelphij 377205194Sdelphij * All codes of a given bit length have lexicographically 378205194Sdelphij consecutive values, in the same order as the symbols 379205194Sdelphij they represent; 380205194Sdelphij 381205194Sdelphij * Shorter codes lexicographically precede longer codes. 382205194Sdelphij 383205194Sdelphij 384205194Sdelphij 385205194Sdelphij 386205194Sdelphij 387205194Sdelphij 388205194Sdelphij 389205194Sdelphij 390205194Sdelphij 391205194Sdelphij 392205194Sdelphij 393205194Sdelphij 394205194SdelphijDeutsch Informational [Page 7] 395205194Sdelphij 396205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 397205194Sdelphij 398205194Sdelphij 399205194Sdelphij We could recode the example above to follow this rule as 400205194Sdelphij follows, assuming that the order of the alphabet is ABCD: 401205194Sdelphij 402205194Sdelphij Symbol Code 403205194Sdelphij ------ ---- 404205194Sdelphij A 10 405205194Sdelphij B 0 406205194Sdelphij C 110 407205194Sdelphij D 111 408205194Sdelphij 409205194Sdelphij I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are 410205194Sdelphij lexicographically consecutive. 411205194Sdelphij 412205194Sdelphij Given this rule, we can define the Huffman code for an alphabet 413205194Sdelphij just by giving the bit lengths of the codes for each symbol of 414205194Sdelphij the alphabet in order; this is sufficient to determine the 415205194Sdelphij actual codes. In our example, the code is completely defined 416205194Sdelphij by the sequence of bit lengths (2, 1, 3, 3). The following 417205194Sdelphij algorithm generates the codes as integers, intended to be read 418205194Sdelphij from most- to least-significant bit. The code lengths are 419205194Sdelphij initially in tree[I].Len; the codes are produced in 420205194Sdelphij tree[I].Code. 421205194Sdelphij 422205194Sdelphij 1) Count the number of codes for each code length. Let 423205194Sdelphij bl_count[N] be the number of codes of length N, N >= 1. 424205194Sdelphij 425205194Sdelphij 2) Find the numerical value of the smallest code for each 426205194Sdelphij code length: 427205194Sdelphij 428205194Sdelphij code = 0; 429205194Sdelphij bl_count[0] = 0; 430205194Sdelphij for (bits = 1; bits <= MAX_BITS; bits++) { 431205194Sdelphij code = (code + bl_count[bits-1]) << 1; 432205194Sdelphij next_code[bits] = code; 433205194Sdelphij } 434205194Sdelphij 435205194Sdelphij 3) Assign numerical values to all codes, using consecutive 436205194Sdelphij values for all codes of the same length with the base 437205194Sdelphij values determined at step 2. Codes that are never used 438205194Sdelphij (which have a bit length of zero) must not be assigned a 439205194Sdelphij value. 440205194Sdelphij 441205194Sdelphij for (n = 0; n <= max_code; n++) { 442205194Sdelphij len = tree[n].Len; 443205194Sdelphij if (len != 0) { 444205194Sdelphij tree[n].Code = next_code[len]; 445205194Sdelphij next_code[len]++; 446205194Sdelphij } 447205194Sdelphij 448205194Sdelphij 449205194Sdelphij 450205194SdelphijDeutsch Informational [Page 8] 451205194Sdelphij 452205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 453205194Sdelphij 454205194Sdelphij 455205194Sdelphij } 456205194Sdelphij 457205194Sdelphij Example: 458205194Sdelphij 459205194Sdelphij Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, 460205194Sdelphij 3, 2, 4, 4). After step 1, we have: 461205194Sdelphij 462205194Sdelphij N bl_count[N] 463205194Sdelphij - ----------- 464205194Sdelphij 2 1 465205194Sdelphij 3 5 466205194Sdelphij 4 2 467205194Sdelphij 468205194Sdelphij Step 2 computes the following next_code values: 469205194Sdelphij 470205194Sdelphij N next_code[N] 471205194Sdelphij - ------------ 472205194Sdelphij 1 0 473205194Sdelphij 2 0 474205194Sdelphij 3 2 475205194Sdelphij 4 14 476205194Sdelphij 477205194Sdelphij Step 3 produces the following code values: 478205194Sdelphij 479205194Sdelphij Symbol Length Code 480205194Sdelphij ------ ------ ---- 481205194Sdelphij A 3 010 482205194Sdelphij B 3 011 483205194Sdelphij C 3 100 484205194Sdelphij D 3 101 485205194Sdelphij E 3 110 486205194Sdelphij F 2 00 487205194Sdelphij G 4 1110 488205194Sdelphij H 4 1111 489205194Sdelphij 490205194Sdelphij 3.2.3. Details of block format 491205194Sdelphij 492205194Sdelphij Each block of compressed data begins with 3 header bits 493205194Sdelphij containing the following data: 494205194Sdelphij 495205194Sdelphij first bit BFINAL 496205194Sdelphij next 2 bits BTYPE 497205194Sdelphij 498205194Sdelphij Note that the header bits do not necessarily begin on a byte 499205194Sdelphij boundary, since a block does not necessarily occupy an integral 500205194Sdelphij number of bytes. 501205194Sdelphij 502205194Sdelphij 503205194Sdelphij 504205194Sdelphij 505205194Sdelphij 506205194SdelphijDeutsch Informational [Page 9] 507205194Sdelphij 508205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 509205194Sdelphij 510205194Sdelphij 511205194Sdelphij BFINAL is set if and only if this is the last block of the data 512205194Sdelphij set. 513205194Sdelphij 514205194Sdelphij BTYPE specifies how the data are compressed, as follows: 515205194Sdelphij 516205194Sdelphij 00 - no compression 517205194Sdelphij 01 - compressed with fixed Huffman codes 518205194Sdelphij 10 - compressed with dynamic Huffman codes 519205194Sdelphij 11 - reserved (error) 520205194Sdelphij 521205194Sdelphij The only difference between the two compressed cases is how the 522205194Sdelphij Huffman codes for the literal/length and distance alphabets are 523205194Sdelphij defined. 524205194Sdelphij 525205194Sdelphij In all cases, the decoding algorithm for the actual data is as 526205194Sdelphij follows: 527205194Sdelphij 528205194Sdelphij do 529205194Sdelphij read block header from input stream. 530205194Sdelphij if stored with no compression 531205194Sdelphij skip any remaining bits in current partially 532205194Sdelphij processed byte 533205194Sdelphij read LEN and NLEN (see next section) 534205194Sdelphij copy LEN bytes of data to output 535205194Sdelphij otherwise 536205194Sdelphij if compressed with dynamic Huffman codes 537205194Sdelphij read representation of code trees (see 538205194Sdelphij subsection below) 539205194Sdelphij loop (until end of block code recognized) 540205194Sdelphij decode literal/length value from input stream 541205194Sdelphij if value < 256 542205194Sdelphij copy value (literal byte) to output stream 543205194Sdelphij otherwise 544205194Sdelphij if value = end of block (256) 545205194Sdelphij break from loop 546205194Sdelphij otherwise (value = 257..285) 547205194Sdelphij decode distance from input stream 548205194Sdelphij 549205194Sdelphij move backwards distance bytes in the output 550205194Sdelphij stream, and copy length bytes from this 551205194Sdelphij position to the output stream. 552205194Sdelphij end loop 553205194Sdelphij while not last block 554205194Sdelphij 555205194Sdelphij Note that a duplicated string reference may refer to a string 556205194Sdelphij in a previous block; i.e., the backward distance may cross one 557205194Sdelphij or more block boundaries. However a distance cannot refer past 558205194Sdelphij the beginning of the output stream. (An application using a 559205194Sdelphij 560205194Sdelphij 561205194Sdelphij 562205194SdelphijDeutsch Informational [Page 10] 563205194Sdelphij 564205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 565205194Sdelphij 566205194Sdelphij 567205194Sdelphij preset dictionary might discard part of the output stream; a 568205194Sdelphij distance can refer to that part of the output stream anyway) 569205194Sdelphij Note also that the referenced string may overlap the current 570205194Sdelphij position; for example, if the last 2 bytes decoded have values 571205194Sdelphij X and Y, a string reference with <length = 5, distance = 2> 572205194Sdelphij adds X,Y,X,Y,X to the output stream. 573205194Sdelphij 574205194Sdelphij We now specify each compression method in turn. 575205194Sdelphij 576205194Sdelphij 3.2.4. Non-compressed blocks (BTYPE=00) 577205194Sdelphij 578205194Sdelphij Any bits of input up to the next byte boundary are ignored. 579205194Sdelphij The rest of the block consists of the following information: 580205194Sdelphij 581205194Sdelphij 0 1 2 3 4... 582205194Sdelphij +---+---+---+---+================================+ 583205194Sdelphij | LEN | NLEN |... LEN bytes of literal data...| 584205194Sdelphij +---+---+---+---+================================+ 585205194Sdelphij 586205194Sdelphij LEN is the number of data bytes in the block. NLEN is the 587205194Sdelphij one's complement of LEN. 588205194Sdelphij 589205194Sdelphij 3.2.5. Compressed blocks (length and distance codes) 590205194Sdelphij 591205194Sdelphij As noted above, encoded data blocks in the "deflate" format 592205194Sdelphij consist of sequences of symbols drawn from three conceptually 593205194Sdelphij distinct alphabets: either literal bytes, from the alphabet of 594205194Sdelphij byte values (0..255), or <length, backward distance> pairs, 595205194Sdelphij where the length is drawn from (3..258) and the distance is 596205194Sdelphij drawn from (1..32,768). In fact, the literal and length 597205194Sdelphij alphabets are merged into a single alphabet (0..285), where 598205194Sdelphij values 0..255 represent literal bytes, the value 256 indicates 599205194Sdelphij end-of-block, and values 257..285 represent length codes 600205194Sdelphij (possibly in conjunction with extra bits following the symbol 601205194Sdelphij code) as follows: 602205194Sdelphij 603205194Sdelphij 604205194Sdelphij 605205194Sdelphij 606205194Sdelphij 607205194Sdelphij 608205194Sdelphij 609205194Sdelphij 610205194Sdelphij 611205194Sdelphij 612205194Sdelphij 613205194Sdelphij 614205194Sdelphij 615205194Sdelphij 616205194Sdelphij 617205194Sdelphij 618205194SdelphijDeutsch Informational [Page 11] 619205194Sdelphij 620205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 621205194Sdelphij 622205194Sdelphij 623205194Sdelphij Extra Extra Extra 624205194Sdelphij Code Bits Length(s) Code Bits Lengths Code Bits Length(s) 625205194Sdelphij ---- ---- ------ ---- ---- ------- ---- ---- ------- 626205194Sdelphij 257 0 3 267 1 15,16 277 4 67-82 627205194Sdelphij 258 0 4 268 1 17,18 278 4 83-98 628205194Sdelphij 259 0 5 269 2 19-22 279 4 99-114 629205194Sdelphij 260 0 6 270 2 23-26 280 4 115-130 630205194Sdelphij 261 0 7 271 2 27-30 281 5 131-162 631205194Sdelphij 262 0 8 272 2 31-34 282 5 163-194 632205194Sdelphij 263 0 9 273 3 35-42 283 5 195-226 633205194Sdelphij 264 0 10 274 3 43-50 284 5 227-257 634205194Sdelphij 265 1 11,12 275 3 51-58 285 0 258 635205194Sdelphij 266 1 13,14 276 3 59-66 636205194Sdelphij 637205194Sdelphij The extra bits should be interpreted as a machine integer 638205194Sdelphij stored with the most-significant bit first, e.g., bits 1110 639205194Sdelphij represent the value 14. 640205194Sdelphij 641205194Sdelphij Extra Extra Extra 642205194Sdelphij Code Bits Dist Code Bits Dist Code Bits Distance 643205194Sdelphij ---- ---- ---- ---- ---- ------ ---- ---- -------- 644205194Sdelphij 0 0 1 10 4 33-48 20 9 1025-1536 645205194Sdelphij 1 0 2 11 4 49-64 21 9 1537-2048 646205194Sdelphij 2 0 3 12 5 65-96 22 10 2049-3072 647205194Sdelphij 3 0 4 13 5 97-128 23 10 3073-4096 648205194Sdelphij 4 1 5,6 14 6 129-192 24 11 4097-6144 649205194Sdelphij 5 1 7,8 15 6 193-256 25 11 6145-8192 650205194Sdelphij 6 2 9-12 16 7 257-384 26 12 8193-12288 651205194Sdelphij 7 2 13-16 17 7 385-512 27 12 12289-16384 652205194Sdelphij 8 3 17-24 18 8 513-768 28 13 16385-24576 653205194Sdelphij 9 3 25-32 19 8 769-1024 29 13 24577-32768 654205194Sdelphij 655205194Sdelphij 3.2.6. Compression with fixed Huffman codes (BTYPE=01) 656205194Sdelphij 657205194Sdelphij The Huffman codes for the two alphabets are fixed, and are not 658205194Sdelphij represented explicitly in the data. The Huffman code lengths 659205194Sdelphij for the literal/length alphabet are: 660205194Sdelphij 661205194Sdelphij Lit Value Bits Codes 662205194Sdelphij --------- ---- ----- 663205194Sdelphij 0 - 143 8 00110000 through 664205194Sdelphij 10111111 665205194Sdelphij 144 - 255 9 110010000 through 666205194Sdelphij 111111111 667205194Sdelphij 256 - 279 7 0000000 through 668205194Sdelphij 0010111 669205194Sdelphij 280 - 287 8 11000000 through 670205194Sdelphij 11000111 671205194Sdelphij 672205194Sdelphij 673205194Sdelphij 674205194SdelphijDeutsch Informational [Page 12] 675205194Sdelphij 676205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 677205194Sdelphij 678205194Sdelphij 679205194Sdelphij The code lengths are sufficient to generate the actual codes, 680205194Sdelphij as described above; we show the codes in the table for added 681205194Sdelphij clarity. Literal/length values 286-287 will never actually 682205194Sdelphij occur in the compressed data, but participate in the code 683205194Sdelphij construction. 684205194Sdelphij 685205194Sdelphij Distance codes 0-31 are represented by (fixed-length) 5-bit 686205194Sdelphij codes, with possible additional bits as shown in the table 687205194Sdelphij shown in Paragraph 3.2.5, above. Note that distance codes 30- 688205194Sdelphij 31 will never actually occur in the compressed data. 689205194Sdelphij 690205194Sdelphij 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) 691205194Sdelphij 692205194Sdelphij The Huffman codes for the two alphabets appear in the block 693205194Sdelphij immediately after the header bits and before the actual 694205194Sdelphij compressed data, first the literal/length code and then the 695205194Sdelphij distance code. Each code is defined by a sequence of code 696205194Sdelphij lengths, as discussed in Paragraph 3.2.2, above. For even 697205194Sdelphij greater compactness, the code length sequences themselves are 698205194Sdelphij compressed using a Huffman code. The alphabet for code lengths 699205194Sdelphij is as follows: 700205194Sdelphij 701205194Sdelphij 0 - 15: Represent code lengths of 0 - 15 702205194Sdelphij 16: Copy the previous code length 3 - 6 times. 703205194Sdelphij The next 2 bits indicate repeat length 704205194Sdelphij (0 = 3, ... , 3 = 6) 705205194Sdelphij Example: Codes 8, 16 (+2 bits 11), 706205194Sdelphij 16 (+2 bits 10) will expand to 707205194Sdelphij 12 code lengths of 8 (1 + 6 + 5) 708205194Sdelphij 17: Repeat a code length of 0 for 3 - 10 times. 709205194Sdelphij (3 bits of length) 710205194Sdelphij 18: Repeat a code length of 0 for 11 - 138 times 711205194Sdelphij (7 bits of length) 712205194Sdelphij 713205194Sdelphij A code length of 0 indicates that the corresponding symbol in 714205194Sdelphij the literal/length or distance alphabet will not occur in the 715205194Sdelphij block, and should not participate in the Huffman code 716205194Sdelphij construction algorithm given earlier. If only one distance 717205194Sdelphij code is used, it is encoded using one bit, not zero bits; in 718205194Sdelphij this case there is a single code length of one, with one unused 719205194Sdelphij code. One distance code of zero bits means that there are no 720205194Sdelphij distance codes used at all (the data is all literals). 721205194Sdelphij 722205194Sdelphij We can now define the format of the block: 723205194Sdelphij 724205194Sdelphij 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286) 725205194Sdelphij 5 Bits: HDIST, # of Distance codes - 1 (1 - 32) 726205194Sdelphij 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19) 727205194Sdelphij 728205194Sdelphij 729205194Sdelphij 730205194SdelphijDeutsch Informational [Page 13] 731205194Sdelphij 732205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 733205194Sdelphij 734205194Sdelphij 735205194Sdelphij (HCLEN + 4) x 3 bits: code lengths for the code length 736205194Sdelphij alphabet given just above, in the order: 16, 17, 18, 737205194Sdelphij 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 738205194Sdelphij 739205194Sdelphij These code lengths are interpreted as 3-bit integers 740205194Sdelphij (0-7); as above, a code length of 0 means the 741205194Sdelphij corresponding symbol (literal/length or distance code 742205194Sdelphij length) is not used. 743205194Sdelphij 744205194Sdelphij HLIT + 257 code lengths for the literal/length alphabet, 745205194Sdelphij encoded using the code length Huffman code 746205194Sdelphij 747205194Sdelphij HDIST + 1 code lengths for the distance alphabet, 748205194Sdelphij encoded using the code length Huffman code 749205194Sdelphij 750205194Sdelphij The actual compressed data of the block, 751205194Sdelphij encoded using the literal/length and distance Huffman 752205194Sdelphij codes 753205194Sdelphij 754205194Sdelphij The literal/length symbol 256 (end of data), 755205194Sdelphij encoded using the literal/length Huffman code 756205194Sdelphij 757205194Sdelphij The code length repeat codes can cross from HLIT + 257 to the 758205194Sdelphij HDIST + 1 code lengths. In other words, all code lengths form 759205194Sdelphij a single sequence of HLIT + HDIST + 258 values. 760205194Sdelphij 761205194Sdelphij 3.3. Compliance 762205194Sdelphij 763205194Sdelphij A compressor may limit further the ranges of values specified in 764205194Sdelphij the previous section and still be compliant; for example, it may 765205194Sdelphij limit the range of backward pointers to some value smaller than 766205194Sdelphij 32K. Similarly, a compressor may limit the size of blocks so that 767205194Sdelphij a compressible block fits in memory. 768205194Sdelphij 769205194Sdelphij A compliant decompressor must accept the full range of possible 770205194Sdelphij values defined in the previous section, and must accept blocks of 771205194Sdelphij arbitrary size. 772205194Sdelphij 773205194Sdelphij4. Compression algorithm details 774205194Sdelphij 775205194Sdelphij While it is the intent of this document to define the "deflate" 776205194Sdelphij compressed data format without reference to any particular 777205194Sdelphij compression algorithm, the format is related to the compressed 778205194Sdelphij formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below); 779205194Sdelphij since many variations of LZ77 are patented, it is strongly 780205194Sdelphij recommended that the implementor of a compressor follow the general 781205194Sdelphij algorithm presented here, which is known not to be patented per se. 782205194Sdelphij The material in this section is not part of the definition of the 783205194Sdelphij 784205194Sdelphij 785205194Sdelphij 786205194SdelphijDeutsch Informational [Page 14] 787205194Sdelphij 788205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 789205194Sdelphij 790205194Sdelphij 791205194Sdelphij specification per se, and a compressor need not follow it in order to 792205194Sdelphij be compliant. 793205194Sdelphij 794205194Sdelphij The compressor terminates a block when it determines that starting a 795205194Sdelphij new block with fresh trees would be useful, or when the block size 796205194Sdelphij fills up the compressor's block buffer. 797205194Sdelphij 798205194Sdelphij The compressor uses a chained hash table to find duplicated strings, 799205194Sdelphij using a hash function that operates on 3-byte sequences. At any 800205194Sdelphij given point during compression, let XYZ be the next 3 input bytes to 801205194Sdelphij be examined (not necessarily all different, of course). First, the 802205194Sdelphij compressor examines the hash chain for XYZ. If the chain is empty, 803205194Sdelphij the compressor simply writes out X as a literal byte and advances one 804205194Sdelphij byte in the input. If the hash chain is not empty, indicating that 805205194Sdelphij the sequence XYZ (or, if we are unlucky, some other 3 bytes with the 806205194Sdelphij same hash function value) has occurred recently, the compressor 807205194Sdelphij compares all strings on the XYZ hash chain with the actual input data 808205194Sdelphij sequence starting at the current point, and selects the longest 809205194Sdelphij match. 810205194Sdelphij 811205194Sdelphij The compressor searches the hash chains starting with the most recent 812205194Sdelphij strings, to favor small distances and thus take advantage of the 813205194Sdelphij Huffman encoding. The hash chains are singly linked. There are no 814205194Sdelphij deletions from the hash chains; the algorithm simply discards matches 815205194Sdelphij that are too old. To avoid a worst-case situation, very long hash 816205194Sdelphij chains are arbitrarily truncated at a certain length, determined by a 817205194Sdelphij run-time parameter. 818205194Sdelphij 819205194Sdelphij To improve overall compression, the compressor optionally defers the 820205194Sdelphij selection of matches ("lazy matching"): after a match of length N has 821205194Sdelphij been found, the compressor searches for a longer match starting at 822205194Sdelphij the next input byte. If it finds a longer match, it truncates the 823205194Sdelphij previous match to a length of one (thus producing a single literal 824205194Sdelphij byte) and then emits the longer match. Otherwise, it emits the 825205194Sdelphij original match, and, as described above, advances N bytes before 826205194Sdelphij continuing. 827205194Sdelphij 828205194Sdelphij Run-time parameters also control this "lazy match" procedure. If 829205194Sdelphij compression ratio is most important, the compressor attempts a 830205194Sdelphij complete second search regardless of the length of the first match. 831205194Sdelphij In the normal case, if the current match is "long enough", the 832205194Sdelphij compressor reduces the search for a longer match, thus speeding up 833205194Sdelphij the process. If speed is most important, the compressor inserts new 834205194Sdelphij strings in the hash table only when no match was found, or when the 835205194Sdelphij match is not "too long". This degrades the compression ratio but 836205194Sdelphij saves time since there are both fewer insertions and fewer searches. 837205194Sdelphij 838205194Sdelphij 839205194Sdelphij 840205194Sdelphij 841205194Sdelphij 842205194SdelphijDeutsch Informational [Page 15] 843205194Sdelphij 844205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 845205194Sdelphij 846205194Sdelphij 847205194Sdelphij5. References 848205194Sdelphij 849205194Sdelphij [1] Huffman, D. A., "A Method for the Construction of Minimum 850205194Sdelphij Redundancy Codes", Proceedings of the Institute of Radio 851205194Sdelphij Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101. 852205194Sdelphij 853205194Sdelphij [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data 854205194Sdelphij Compression", IEEE Transactions on Information Theory, Vol. 23, 855205194Sdelphij No. 3, pp. 337-343. 856205194Sdelphij 857205194Sdelphij [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources, 858205194Sdelphij available in ftp://ftp.uu.net/pub/archiving/zip/doc/ 859205194Sdelphij 860205194Sdelphij [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources, 861205194Sdelphij available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/ 862205194Sdelphij 863205194Sdelphij [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix 864205194Sdelphij encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169. 865205194Sdelphij 866205194Sdelphij [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes," 867205194Sdelphij Comm. ACM, 33,4, April 1990, pp. 449-459. 868205194Sdelphij 869205194Sdelphij6. Security Considerations 870205194Sdelphij 871205194Sdelphij Any data compression method involves the reduction of redundancy in 872205194Sdelphij the data. Consequently, any corruption of the data is likely to have 873205194Sdelphij severe effects and be difficult to correct. Uncompressed text, on 874205194Sdelphij the other hand, will probably still be readable despite the presence 875205194Sdelphij of some corrupted bytes. 876205194Sdelphij 877205194Sdelphij It is recommended that systems using this data format provide some 878205194Sdelphij means of validating the integrity of the compressed data. See 879205194Sdelphij reference [3], for example. 880205194Sdelphij 881205194Sdelphij7. Source code 882205194Sdelphij 883205194Sdelphij Source code for a C language implementation of a "deflate" compliant 884205194Sdelphij compressor and decompressor is available within the zlib package at 885205194Sdelphij ftp://ftp.uu.net/pub/archiving/zip/zlib/. 886205194Sdelphij 887205194Sdelphij8. Acknowledgements 888205194Sdelphij 889205194Sdelphij Trademarks cited in this document are the property of their 890205194Sdelphij respective owners. 891205194Sdelphij 892205194Sdelphij Phil Katz designed the deflate format. Jean-Loup Gailly and Mark 893205194Sdelphij Adler wrote the related software described in this specification. 894205194Sdelphij Glenn Randers-Pehrson converted this document to RFC and HTML format. 895205194Sdelphij 896205194Sdelphij 897205194Sdelphij 898205194SdelphijDeutsch Informational [Page 16] 899205194Sdelphij 900205194SdelphijRFC 1951 DEFLATE Compressed Data Format Specification May 1996 901205194Sdelphij 902205194Sdelphij 903205194Sdelphij9. Author's Address 904205194Sdelphij 905205194Sdelphij L. Peter Deutsch 906205194Sdelphij Aladdin Enterprises 907205194Sdelphij 203 Santa Margarita Ave. 908205194Sdelphij Menlo Park, CA 94025 909205194Sdelphij 910205194Sdelphij Phone: (415) 322-0103 (AM only) 911205194Sdelphij FAX: (415) 322-1734 912205194Sdelphij EMail: <ghost@aladdin.com> 913205194Sdelphij 914205194Sdelphij Questions about the technical content of this specification can be 915205194Sdelphij sent by email to: 916205194Sdelphij 917205194Sdelphij Jean-Loup Gailly <gzip@prep.ai.mit.edu> and 918205194Sdelphij Mark Adler <madler@alumni.caltech.edu> 919205194Sdelphij 920205194Sdelphij Editorial comments on this specification can be sent by email to: 921205194Sdelphij 922205194Sdelphij L. Peter Deutsch <ghost@aladdin.com> and 923205194Sdelphij Glenn Randers-Pehrson <randeg@alumni.rpi.edu> 924205194Sdelphij 925205194Sdelphij 926205194Sdelphij 927205194Sdelphij 928205194Sdelphij 929205194Sdelphij 930205194Sdelphij 931205194Sdelphij 932205194Sdelphij 933205194Sdelphij 934205194Sdelphij 935205194Sdelphij 936205194Sdelphij 937205194Sdelphij 938205194Sdelphij 939205194Sdelphij 940205194Sdelphij 941205194Sdelphij 942205194Sdelphij 943205194Sdelphij 944205194Sdelphij 945205194Sdelphij 946205194Sdelphij 947205194Sdelphij 948205194Sdelphij 949205194Sdelphij 950205194Sdelphij 951205194Sdelphij 952205194Sdelphij 953205194Sdelphij 954205194SdelphijDeutsch Informational [Page 17] 955205194Sdelphij 956