1205194Sdelphij
2205194Sdelphij
3205194Sdelphij
4205194Sdelphij
5205194Sdelphij
6205194Sdelphij
7205194SdelphijNetwork Working Group                                         P. Deutsch
8205194SdelphijRequest for Comments: 1951                           Aladdin Enterprises
9205194SdelphijCategory: Informational                                         May 1996
10205194Sdelphij
11205194Sdelphij
12205194Sdelphij        DEFLATE Compressed Data Format Specification version 1.3
13205194Sdelphij
14205194SdelphijStatus of This Memo
15205194Sdelphij
16205194Sdelphij   This memo provides information for the Internet community.  This memo
17205194Sdelphij   does not specify an Internet standard of any kind.  Distribution of
18205194Sdelphij   this memo is unlimited.
19205194Sdelphij
20205194SdelphijIESG Note:
21205194Sdelphij
22205194Sdelphij   The IESG takes no position on the validity of any Intellectual
23205194Sdelphij   Property Rights statements contained in this document.
24205194Sdelphij
25205194SdelphijNotices
26205194Sdelphij
27205194Sdelphij   Copyright (c) 1996 L. Peter Deutsch
28205194Sdelphij
29205194Sdelphij   Permission is granted to copy and distribute this document for any
30205194Sdelphij   purpose and without charge, including translations into other
31205194Sdelphij   languages and incorporation into compilations, provided that the
32205194Sdelphij   copyright notice and this notice are preserved, and that any
33205194Sdelphij   substantive changes or deletions from the original are clearly
34205194Sdelphij   marked.
35205194Sdelphij
36205194Sdelphij   A pointer to the latest version of this and related documentation in
37205194Sdelphij   HTML format can be found at the URL
38205194Sdelphij   <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
39205194Sdelphij
40205194SdelphijAbstract
41205194Sdelphij
42205194Sdelphij   This specification defines a lossless compressed data format that
43205194Sdelphij   compresses data using a combination of the LZ77 algorithm and Huffman
44205194Sdelphij   coding, with efficiency comparable to the best currently available
45205194Sdelphij   general-purpose compression methods.  The data can be produced or
46205194Sdelphij   consumed, even for an arbitrarily long sequentially presented input
47205194Sdelphij   data stream, using only an a priori bounded amount of intermediate
48205194Sdelphij   storage.  The format can be implemented readily in a manner not
49205194Sdelphij   covered by patents.
50205194Sdelphij
51205194Sdelphij
52205194Sdelphij
53205194Sdelphij
54205194Sdelphij
55205194Sdelphij
56205194Sdelphij
57205194Sdelphij
58205194SdelphijDeutsch                      Informational                      [Page 1]
59205194Sdelphij
60205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
61205194Sdelphij
62205194Sdelphij
63205194SdelphijTable of Contents
64205194Sdelphij
65205194Sdelphij   1. Introduction ................................................... 2
66205194Sdelphij      1.1. Purpose ................................................... 2
67205194Sdelphij      1.2. Intended audience ......................................... 3
68205194Sdelphij      1.3. Scope ..................................................... 3
69205194Sdelphij      1.4. Compliance ................................................ 3
70205194Sdelphij      1.5.  Definitions of terms and conventions used ................ 3
71205194Sdelphij      1.6. Changes from previous versions ............................ 4
72205194Sdelphij   2. Compressed representation overview ............................. 4
73205194Sdelphij   3. Detailed specification ......................................... 5
74205194Sdelphij      3.1. Overall conventions ....................................... 5
75205194Sdelphij          3.1.1. Packing into bytes .................................. 5
76205194Sdelphij      3.2. Compressed block format ................................... 6
77205194Sdelphij          3.2.1. Synopsis of prefix and Huffman coding ............... 6
78205194Sdelphij          3.2.2. Use of Huffman coding in the "deflate" format ....... 7
79205194Sdelphij          3.2.3. Details of block format ............................. 9
80205194Sdelphij          3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
81205194Sdelphij          3.2.5. Compressed blocks (length and distance codes) ...... 11
82205194Sdelphij          3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
83205194Sdelphij          3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
84205194Sdelphij      3.3. Compliance ............................................... 14
85205194Sdelphij   4. Compression algorithm details ................................. 14
86205194Sdelphij   5. References .................................................... 16
87205194Sdelphij   6. Security Considerations ....................................... 16
88205194Sdelphij   7. Source code ................................................... 16
89205194Sdelphij   8. Acknowledgements .............................................. 16
90205194Sdelphij   9. Author's Address .............................................. 17
91205194Sdelphij
92205194Sdelphij1. Introduction
93205194Sdelphij
94205194Sdelphij   1.1. Purpose
95205194Sdelphij
96205194Sdelphij      The purpose of this specification is to define a lossless
97205194Sdelphij      compressed data format that:
98205194Sdelphij          * Is independent of CPU type, operating system, file system,
99205194Sdelphij            and character set, and hence can be used for interchange;
100205194Sdelphij          * Can be produced or consumed, even for an arbitrarily long
101205194Sdelphij            sequentially presented input data stream, using only an a
102205194Sdelphij            priori bounded amount of intermediate storage, and hence
103205194Sdelphij            can be used in data communications or similar structures
104205194Sdelphij            such as Unix filters;
105205194Sdelphij          * Compresses data with efficiency comparable to the best
106205194Sdelphij            currently available general-purpose compression methods,
107205194Sdelphij            and in particular considerably better than the "compress"
108205194Sdelphij            program;
109205194Sdelphij          * Can be implemented readily in a manner not covered by
110205194Sdelphij            patents, and hence can be practiced freely;
111205194Sdelphij
112205194Sdelphij
113205194Sdelphij
114205194SdelphijDeutsch                      Informational                      [Page 2]
115205194Sdelphij
116205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
117205194Sdelphij
118205194Sdelphij
119205194Sdelphij          * Is compatible with the file format produced by the current
120205194Sdelphij            widely used gzip utility, in that conforming decompressors
121205194Sdelphij            will be able to read data produced by the existing gzip
122205194Sdelphij            compressor.
123205194Sdelphij
124205194Sdelphij      The data format defined by this specification does not attempt to:
125205194Sdelphij
126205194Sdelphij          * Allow random access to compressed data;
127205194Sdelphij          * Compress specialized data (e.g., raster graphics) as well
128205194Sdelphij            as the best currently available specialized algorithms.
129205194Sdelphij
130205194Sdelphij      A simple counting argument shows that no lossless compression
131205194Sdelphij      algorithm can compress every possible input data set.  For the
132205194Sdelphij      format defined here, the worst case expansion is 5 bytes per 32K-
133205194Sdelphij      byte block, i.e., a size increase of 0.015% for large data sets.
134205194Sdelphij      English text usually compresses by a factor of 2.5 to 3;
135205194Sdelphij      executable files usually compress somewhat less; graphical data
136205194Sdelphij      such as raster images may compress much more.
137205194Sdelphij
138205194Sdelphij   1.2. Intended audience
139205194Sdelphij
140205194Sdelphij      This specification is intended for use by implementors of software
141205194Sdelphij      to compress data into "deflate" format and/or decompress data from
142205194Sdelphij      "deflate" format.
143205194Sdelphij
144205194Sdelphij      The text of the specification assumes a basic background in
145205194Sdelphij      programming at the level of bits and other primitive data
146205194Sdelphij      representations.  Familiarity with the technique of Huffman coding
147205194Sdelphij      is helpful but not required.
148205194Sdelphij
149205194Sdelphij   1.3. Scope
150205194Sdelphij
151205194Sdelphij      The specification specifies a method for representing a sequence
152205194Sdelphij      of bytes as a (usually shorter) sequence of bits, and a method for
153205194Sdelphij      packing the latter bit sequence into bytes.
154205194Sdelphij
155205194Sdelphij   1.4. Compliance
156205194Sdelphij
157205194Sdelphij      Unless otherwise indicated below, a compliant decompressor must be
158205194Sdelphij      able to accept and decompress any data set that conforms to all
159205194Sdelphij      the specifications presented here; a compliant compressor must
160205194Sdelphij      produce data sets that conform to all the specifications presented
161205194Sdelphij      here.
162205194Sdelphij
163205194Sdelphij   1.5.  Definitions of terms and conventions used
164205194Sdelphij
165205194Sdelphij      Byte: 8 bits stored or transmitted as a unit (same as an octet).
166205194Sdelphij      For this specification, a byte is exactly 8 bits, even on machines
167205194Sdelphij
168205194Sdelphij
169205194Sdelphij
170205194SdelphijDeutsch                      Informational                      [Page 3]
171205194Sdelphij
172205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
173205194Sdelphij
174205194Sdelphij
175205194Sdelphij      which store a character on a number of bits different from eight.
176205194Sdelphij      See below, for the numbering of bits within a byte.
177205194Sdelphij
178205194Sdelphij      String: a sequence of arbitrary bytes.
179205194Sdelphij
180205194Sdelphij   1.6. Changes from previous versions
181205194Sdelphij
182205194Sdelphij      There have been no technical changes to the deflate format since
183205194Sdelphij      version 1.1 of this specification.  In version 1.2, some
184205194Sdelphij      terminology was changed.  Version 1.3 is a conversion of the
185205194Sdelphij      specification to RFC style.
186205194Sdelphij
187205194Sdelphij2. Compressed representation overview
188205194Sdelphij
189205194Sdelphij   A compressed data set consists of a series of blocks, corresponding
190205194Sdelphij   to successive blocks of input data.  The block sizes are arbitrary,
191205194Sdelphij   except that non-compressible blocks are limited to 65,535 bytes.
192205194Sdelphij
193205194Sdelphij   Each block is compressed using a combination of the LZ77 algorithm
194205194Sdelphij   and Huffman coding. The Huffman trees for each block are independent
195205194Sdelphij   of those for previous or subsequent blocks; the LZ77 algorithm may
196205194Sdelphij   use a reference to a duplicated string occurring in a previous block,
197205194Sdelphij   up to 32K input bytes before.
198205194Sdelphij
199205194Sdelphij   Each block consists of two parts: a pair of Huffman code trees that
200205194Sdelphij   describe the representation of the compressed data part, and a
201205194Sdelphij   compressed data part.  (The Huffman trees themselves are compressed
202205194Sdelphij   using Huffman encoding.)  The compressed data consists of a series of
203205194Sdelphij   elements of two types: literal bytes (of strings that have not been
204205194Sdelphij   detected as duplicated within the previous 32K input bytes), and
205205194Sdelphij   pointers to duplicated strings, where a pointer is represented as a
206205194Sdelphij   pair <length, backward distance>.  The representation used in the
207205194Sdelphij   "deflate" format limits distances to 32K bytes and lengths to 258
208205194Sdelphij   bytes, but does not limit the size of a block, except for
209205194Sdelphij   uncompressible blocks, which are limited as noted above.
210205194Sdelphij
211205194Sdelphij   Each type of value (literals, distances, and lengths) in the
212205194Sdelphij   compressed data is represented using a Huffman code, using one code
213205194Sdelphij   tree for literals and lengths and a separate code tree for distances.
214205194Sdelphij   The code trees for each block appear in a compact form just before
215205194Sdelphij   the compressed data for that block.
216205194Sdelphij
217205194Sdelphij
218205194Sdelphij
219205194Sdelphij
220205194Sdelphij
221205194Sdelphij
222205194Sdelphij
223205194Sdelphij
224205194Sdelphij
225205194Sdelphij
226205194SdelphijDeutsch                      Informational                      [Page 4]
227205194Sdelphij
228205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
229205194Sdelphij
230205194Sdelphij
231205194Sdelphij3. Detailed specification
232205194Sdelphij
233205194Sdelphij   3.1. Overall conventions In the diagrams below, a box like this:
234205194Sdelphij
235205194Sdelphij         +---+
236205194Sdelphij         |   | <-- the vertical bars might be missing
237205194Sdelphij         +---+
238205194Sdelphij
239205194Sdelphij      represents one byte; a box like this:
240205194Sdelphij
241205194Sdelphij         +==============+
242205194Sdelphij         |              |
243205194Sdelphij         +==============+
244205194Sdelphij
245205194Sdelphij      represents a variable number of bytes.
246205194Sdelphij
247205194Sdelphij      Bytes stored within a computer do not have a "bit order", since
248205194Sdelphij      they are always treated as a unit.  However, a byte considered as
249205194Sdelphij      an integer between 0 and 255 does have a most- and least-
250205194Sdelphij      significant bit, and since we write numbers with the most-
251205194Sdelphij      significant digit on the left, we also write bytes with the most-
252205194Sdelphij      significant bit on the left.  In the diagrams below, we number the
253205194Sdelphij      bits of a byte so that bit 0 is the least-significant bit, i.e.,
254205194Sdelphij      the bits are numbered:
255205194Sdelphij
256205194Sdelphij         +--------+
257205194Sdelphij         |76543210|
258205194Sdelphij         +--------+
259205194Sdelphij
260205194Sdelphij      Within a computer, a number may occupy multiple bytes.  All
261205194Sdelphij      multi-byte numbers in the format described here are stored with
262205194Sdelphij      the least-significant byte first (at the lower memory address).
263205194Sdelphij      For example, the decimal number 520 is stored as:
264205194Sdelphij
265205194Sdelphij             0        1
266205194Sdelphij         +--------+--------+
267205194Sdelphij         |00001000|00000010|
268205194Sdelphij         +--------+--------+
269205194Sdelphij          ^        ^
270205194Sdelphij          |        |
271205194Sdelphij          |        + more significant byte = 2 x 256
272205194Sdelphij          + less significant byte = 8
273205194Sdelphij
274205194Sdelphij      3.1.1. Packing into bytes
275205194Sdelphij
276205194Sdelphij         This document does not address the issue of the order in which
277205194Sdelphij         bits of a byte are transmitted on a bit-sequential medium,
278205194Sdelphij         since the final data format described here is byte- rather than
279205194Sdelphij
280205194Sdelphij
281205194Sdelphij
282205194SdelphijDeutsch                      Informational                      [Page 5]
283205194Sdelphij
284205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
285205194Sdelphij
286205194Sdelphij
287205194Sdelphij         bit-oriented.  However, we describe the compressed block format
288205194Sdelphij         in below, as a sequence of data elements of various bit
289205194Sdelphij         lengths, not a sequence of bytes.  We must therefore specify
290205194Sdelphij         how to pack these data elements into bytes to form the final
291205194Sdelphij         compressed byte sequence:
292205194Sdelphij
293205194Sdelphij             * Data elements are packed into bytes in order of
294205194Sdelphij               increasing bit number within the byte, i.e., starting
295205194Sdelphij               with the least-significant bit of the byte.
296205194Sdelphij             * Data elements other than Huffman codes are packed
297205194Sdelphij               starting with the least-significant bit of the data
298205194Sdelphij               element.
299205194Sdelphij             * Huffman codes are packed starting with the most-
300205194Sdelphij               significant bit of the code.
301205194Sdelphij
302205194Sdelphij         In other words, if one were to print out the compressed data as
303205194Sdelphij         a sequence of bytes, starting with the first byte at the
304205194Sdelphij         *right* margin and proceeding to the *left*, with the most-
305205194Sdelphij         significant bit of each byte on the left as usual, one would be
306205194Sdelphij         able to parse the result from right to left, with fixed-width
307205194Sdelphij         elements in the correct MSB-to-LSB order and Huffman codes in
308205194Sdelphij         bit-reversed order (i.e., with the first bit of the code in the
309205194Sdelphij         relative LSB position).
310205194Sdelphij
311205194Sdelphij   3.2. Compressed block format
312205194Sdelphij
313205194Sdelphij      3.2.1. Synopsis of prefix and Huffman coding
314205194Sdelphij
315205194Sdelphij         Prefix coding represents symbols from an a priori known
316205194Sdelphij         alphabet by bit sequences (codes), one code for each symbol, in
317205194Sdelphij         a manner such that different symbols may be represented by bit
318205194Sdelphij         sequences of different lengths, but a parser can always parse
319205194Sdelphij         an encoded string unambiguously symbol-by-symbol.
320205194Sdelphij
321205194Sdelphij         We define a prefix code in terms of a binary tree in which the
322205194Sdelphij         two edges descending from each non-leaf node are labeled 0 and
323205194Sdelphij         1 and in which the leaf nodes correspond one-for-one with (are
324205194Sdelphij         labeled with) the symbols of the alphabet; then the code for a
325205194Sdelphij         symbol is the sequence of 0's and 1's on the edges leading from
326205194Sdelphij         the root to the leaf labeled with that symbol.  For example:
327205194Sdelphij
328205194Sdelphij
329205194Sdelphij
330205194Sdelphij
331205194Sdelphij
332205194Sdelphij
333205194Sdelphij
334205194Sdelphij
335205194Sdelphij
336205194Sdelphij
337205194Sdelphij
338205194SdelphijDeutsch                      Informational                      [Page 6]
339205194Sdelphij
340205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
341205194Sdelphij
342205194Sdelphij
343205194Sdelphij                          /\              Symbol    Code
344205194Sdelphij                         0  1             ------    ----
345205194Sdelphij                        /    \                A      00
346205194Sdelphij                       /\     B               B       1
347205194Sdelphij                      0  1                    C     011
348205194Sdelphij                     /    \                   D     010
349205194Sdelphij                    A     /\
350205194Sdelphij                         0  1
351205194Sdelphij                        /    \
352205194Sdelphij                       D      C
353205194Sdelphij
354205194Sdelphij         A parser can decode the next symbol from an encoded input
355205194Sdelphij         stream by walking down the tree from the root, at each step
356205194Sdelphij         choosing the edge corresponding to the next input bit.
357205194Sdelphij
358205194Sdelphij         Given an alphabet with known symbol frequencies, the Huffman
359205194Sdelphij         algorithm allows the construction of an optimal prefix code
360205194Sdelphij         (one which represents strings with those symbol frequencies
361205194Sdelphij         using the fewest bits of any possible prefix codes for that
362205194Sdelphij         alphabet).  Such a code is called a Huffman code.  (See
363205194Sdelphij         reference [1] in Chapter 5, references for additional
364205194Sdelphij         information on Huffman codes.)
365205194Sdelphij
366205194Sdelphij         Note that in the "deflate" format, the Huffman codes for the
367205194Sdelphij         various alphabets must not exceed certain maximum code lengths.
368205194Sdelphij         This constraint complicates the algorithm for computing code
369205194Sdelphij         lengths from symbol frequencies.  Again, see Chapter 5,
370205194Sdelphij         references for details.
371205194Sdelphij
372205194Sdelphij      3.2.2. Use of Huffman coding in the "deflate" format
373205194Sdelphij
374205194Sdelphij         The Huffman codes used for each alphabet in the "deflate"
375205194Sdelphij         format have two additional rules:
376205194Sdelphij
377205194Sdelphij             * All codes of a given bit length have lexicographically
378205194Sdelphij               consecutive values, in the same order as the symbols
379205194Sdelphij               they represent;
380205194Sdelphij
381205194Sdelphij             * Shorter codes lexicographically precede longer codes.
382205194Sdelphij
383205194Sdelphij
384205194Sdelphij
385205194Sdelphij
386205194Sdelphij
387205194Sdelphij
388205194Sdelphij
389205194Sdelphij
390205194Sdelphij
391205194Sdelphij
392205194Sdelphij
393205194Sdelphij
394205194SdelphijDeutsch                      Informational                      [Page 7]
395205194Sdelphij
396205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
397205194Sdelphij
398205194Sdelphij
399205194Sdelphij         We could recode the example above to follow this rule as
400205194Sdelphij         follows, assuming that the order of the alphabet is ABCD:
401205194Sdelphij
402205194Sdelphij            Symbol  Code
403205194Sdelphij            ------  ----
404205194Sdelphij            A       10
405205194Sdelphij            B       0
406205194Sdelphij            C       110
407205194Sdelphij            D       111
408205194Sdelphij
409205194Sdelphij         I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
410205194Sdelphij         lexicographically consecutive.
411205194Sdelphij
412205194Sdelphij         Given this rule, we can define the Huffman code for an alphabet
413205194Sdelphij         just by giving the bit lengths of the codes for each symbol of
414205194Sdelphij         the alphabet in order; this is sufficient to determine the
415205194Sdelphij         actual codes.  In our example, the code is completely defined
416205194Sdelphij         by the sequence of bit lengths (2, 1, 3, 3).  The following
417205194Sdelphij         algorithm generates the codes as integers, intended to be read
418205194Sdelphij         from most- to least-significant bit.  The code lengths are
419205194Sdelphij         initially in tree[I].Len; the codes are produced in
420205194Sdelphij         tree[I].Code.
421205194Sdelphij
422205194Sdelphij         1)  Count the number of codes for each code length.  Let
423205194Sdelphij             bl_count[N] be the number of codes of length N, N >= 1.
424205194Sdelphij
425205194Sdelphij         2)  Find the numerical value of the smallest code for each
426205194Sdelphij             code length:
427205194Sdelphij
428205194Sdelphij                code = 0;
429205194Sdelphij                bl_count[0] = 0;
430205194Sdelphij                for (bits = 1; bits <= MAX_BITS; bits++) {
431205194Sdelphij                    code = (code + bl_count[bits-1]) << 1;
432205194Sdelphij                    next_code[bits] = code;
433205194Sdelphij                }
434205194Sdelphij
435205194Sdelphij         3)  Assign numerical values to all codes, using consecutive
436205194Sdelphij             values for all codes of the same length with the base
437205194Sdelphij             values determined at step 2. Codes that are never used
438205194Sdelphij             (which have a bit length of zero) must not be assigned a
439205194Sdelphij             value.
440205194Sdelphij
441205194Sdelphij                for (n = 0;  n <= max_code; n++) {
442205194Sdelphij                    len = tree[n].Len;
443205194Sdelphij                    if (len != 0) {
444205194Sdelphij                        tree[n].Code = next_code[len];
445205194Sdelphij                        next_code[len]++;
446205194Sdelphij                    }
447205194Sdelphij
448205194Sdelphij
449205194Sdelphij
450205194SdelphijDeutsch                      Informational                      [Page 8]
451205194Sdelphij
452205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
453205194Sdelphij
454205194Sdelphij
455205194Sdelphij                }
456205194Sdelphij
457205194Sdelphij         Example:
458205194Sdelphij
459205194Sdelphij         Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
460205194Sdelphij         3, 2, 4, 4).  After step 1, we have:
461205194Sdelphij
462205194Sdelphij            N      bl_count[N]
463205194Sdelphij            -      -----------
464205194Sdelphij            2      1
465205194Sdelphij            3      5
466205194Sdelphij            4      2
467205194Sdelphij
468205194Sdelphij         Step 2 computes the following next_code values:
469205194Sdelphij
470205194Sdelphij            N      next_code[N]
471205194Sdelphij            -      ------------
472205194Sdelphij            1      0
473205194Sdelphij            2      0
474205194Sdelphij            3      2
475205194Sdelphij            4      14
476205194Sdelphij
477205194Sdelphij         Step 3 produces the following code values:
478205194Sdelphij
479205194Sdelphij            Symbol Length   Code
480205194Sdelphij            ------ ------   ----
481205194Sdelphij            A       3        010
482205194Sdelphij            B       3        011
483205194Sdelphij            C       3        100
484205194Sdelphij            D       3        101
485205194Sdelphij            E       3        110
486205194Sdelphij            F       2         00
487205194Sdelphij            G       4       1110
488205194Sdelphij            H       4       1111
489205194Sdelphij
490205194Sdelphij      3.2.3. Details of block format
491205194Sdelphij
492205194Sdelphij         Each block of compressed data begins with 3 header bits
493205194Sdelphij         containing the following data:
494205194Sdelphij
495205194Sdelphij            first bit       BFINAL
496205194Sdelphij            next 2 bits     BTYPE
497205194Sdelphij
498205194Sdelphij         Note that the header bits do not necessarily begin on a byte
499205194Sdelphij         boundary, since a block does not necessarily occupy an integral
500205194Sdelphij         number of bytes.
501205194Sdelphij
502205194Sdelphij
503205194Sdelphij
504205194Sdelphij
505205194Sdelphij
506205194SdelphijDeutsch                      Informational                      [Page 9]
507205194Sdelphij
508205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
509205194Sdelphij
510205194Sdelphij
511205194Sdelphij         BFINAL is set if and only if this is the last block of the data
512205194Sdelphij         set.
513205194Sdelphij
514205194Sdelphij         BTYPE specifies how the data are compressed, as follows:
515205194Sdelphij
516205194Sdelphij            00 - no compression
517205194Sdelphij            01 - compressed with fixed Huffman codes
518205194Sdelphij            10 - compressed with dynamic Huffman codes
519205194Sdelphij            11 - reserved (error)
520205194Sdelphij
521205194Sdelphij         The only difference between the two compressed cases is how the
522205194Sdelphij         Huffman codes for the literal/length and distance alphabets are
523205194Sdelphij         defined.
524205194Sdelphij
525205194Sdelphij         In all cases, the decoding algorithm for the actual data is as
526205194Sdelphij         follows:
527205194Sdelphij
528205194Sdelphij            do
529205194Sdelphij               read block header from input stream.
530205194Sdelphij               if stored with no compression
531205194Sdelphij                  skip any remaining bits in current partially
532205194Sdelphij                     processed byte
533205194Sdelphij                  read LEN and NLEN (see next section)
534205194Sdelphij                  copy LEN bytes of data to output
535205194Sdelphij               otherwise
536205194Sdelphij                  if compressed with dynamic Huffman codes
537205194Sdelphij                     read representation of code trees (see
538205194Sdelphij                        subsection below)
539205194Sdelphij                  loop (until end of block code recognized)
540205194Sdelphij                     decode literal/length value from input stream
541205194Sdelphij                     if value < 256
542205194Sdelphij                        copy value (literal byte) to output stream
543205194Sdelphij                     otherwise
544205194Sdelphij                        if value = end of block (256)
545205194Sdelphij                           break from loop
546205194Sdelphij                        otherwise (value = 257..285)
547205194Sdelphij                           decode distance from input stream
548205194Sdelphij
549205194Sdelphij                           move backwards distance bytes in the output
550205194Sdelphij                           stream, and copy length bytes from this
551205194Sdelphij                           position to the output stream.
552205194Sdelphij                  end loop
553205194Sdelphij            while not last block
554205194Sdelphij
555205194Sdelphij         Note that a duplicated string reference may refer to a string
556205194Sdelphij         in a previous block; i.e., the backward distance may cross one
557205194Sdelphij         or more block boundaries.  However a distance cannot refer past
558205194Sdelphij         the beginning of the output stream.  (An application using a
559205194Sdelphij
560205194Sdelphij
561205194Sdelphij
562205194SdelphijDeutsch                      Informational                     [Page 10]
563205194Sdelphij
564205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
565205194Sdelphij
566205194Sdelphij
567205194Sdelphij         preset dictionary might discard part of the output stream; a
568205194Sdelphij         distance can refer to that part of the output stream anyway)
569205194Sdelphij         Note also that the referenced string may overlap the current
570205194Sdelphij         position; for example, if the last 2 bytes decoded have values
571205194Sdelphij         X and Y, a string reference with <length = 5, distance = 2>
572205194Sdelphij         adds X,Y,X,Y,X to the output stream.
573205194Sdelphij
574205194Sdelphij         We now specify each compression method in turn.
575205194Sdelphij
576205194Sdelphij      3.2.4. Non-compressed blocks (BTYPE=00)
577205194Sdelphij
578205194Sdelphij         Any bits of input up to the next byte boundary are ignored.
579205194Sdelphij         The rest of the block consists of the following information:
580205194Sdelphij
581205194Sdelphij              0   1   2   3   4...
582205194Sdelphij            +---+---+---+---+================================+
583205194Sdelphij            |  LEN  | NLEN  |... LEN bytes of literal data...|
584205194Sdelphij            +---+---+---+---+================================+
585205194Sdelphij
586205194Sdelphij         LEN is the number of data bytes in the block.  NLEN is the
587205194Sdelphij         one's complement of LEN.
588205194Sdelphij
589205194Sdelphij      3.2.5. Compressed blocks (length and distance codes)
590205194Sdelphij
591205194Sdelphij         As noted above, encoded data blocks in the "deflate" format
592205194Sdelphij         consist of sequences of symbols drawn from three conceptually
593205194Sdelphij         distinct alphabets: either literal bytes, from the alphabet of
594205194Sdelphij         byte values (0..255), or <length, backward distance> pairs,
595205194Sdelphij         where the length is drawn from (3..258) and the distance is
596205194Sdelphij         drawn from (1..32,768).  In fact, the literal and length
597205194Sdelphij         alphabets are merged into a single alphabet (0..285), where
598205194Sdelphij         values 0..255 represent literal bytes, the value 256 indicates
599205194Sdelphij         end-of-block, and values 257..285 represent length codes
600205194Sdelphij         (possibly in conjunction with extra bits following the symbol
601205194Sdelphij         code) as follows:
602205194Sdelphij
603205194Sdelphij
604205194Sdelphij
605205194Sdelphij
606205194Sdelphij
607205194Sdelphij
608205194Sdelphij
609205194Sdelphij
610205194Sdelphij
611205194Sdelphij
612205194Sdelphij
613205194Sdelphij
614205194Sdelphij
615205194Sdelphij
616205194Sdelphij
617205194Sdelphij
618205194SdelphijDeutsch                      Informational                     [Page 11]
619205194Sdelphij
620205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
621205194Sdelphij
622205194Sdelphij
623205194Sdelphij                 Extra               Extra               Extra
624205194Sdelphij            Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
625205194Sdelphij            ---- ---- ------     ---- ---- -------   ---- ---- -------
626205194Sdelphij             257   0     3       267   1   15,16     277   4   67-82
627205194Sdelphij             258   0     4       268   1   17,18     278   4   83-98
628205194Sdelphij             259   0     5       269   2   19-22     279   4   99-114
629205194Sdelphij             260   0     6       270   2   23-26     280   4  115-130
630205194Sdelphij             261   0     7       271   2   27-30     281   5  131-162
631205194Sdelphij             262   0     8       272   2   31-34     282   5  163-194
632205194Sdelphij             263   0     9       273   3   35-42     283   5  195-226
633205194Sdelphij             264   0    10       274   3   43-50     284   5  227-257
634205194Sdelphij             265   1  11,12      275   3   51-58     285   0    258
635205194Sdelphij             266   1  13,14      276   3   59-66
636205194Sdelphij
637205194Sdelphij         The extra bits should be interpreted as a machine integer
638205194Sdelphij         stored with the most-significant bit first, e.g., bits 1110
639205194Sdelphij         represent the value 14.
640205194Sdelphij
641205194Sdelphij                  Extra           Extra               Extra
642205194Sdelphij             Code Bits Dist  Code Bits   Dist     Code Bits Distance
643205194Sdelphij             ---- ---- ----  ---- ----  ------    ---- ---- --------
644205194Sdelphij               0   0    1     10   4     33-48    20    9   1025-1536
645205194Sdelphij               1   0    2     11   4     49-64    21    9   1537-2048
646205194Sdelphij               2   0    3     12   5     65-96    22   10   2049-3072
647205194Sdelphij               3   0    4     13   5     97-128   23   10   3073-4096
648205194Sdelphij               4   1   5,6    14   6    129-192   24   11   4097-6144
649205194Sdelphij               5   1   7,8    15   6    193-256   25   11   6145-8192
650205194Sdelphij               6   2   9-12   16   7    257-384   26   12  8193-12288
651205194Sdelphij               7   2  13-16   17   7    385-512   27   12 12289-16384
652205194Sdelphij               8   3  17-24   18   8    513-768   28   13 16385-24576
653205194Sdelphij               9   3  25-32   19   8   769-1024   29   13 24577-32768
654205194Sdelphij
655205194Sdelphij      3.2.6. Compression with fixed Huffman codes (BTYPE=01)
656205194Sdelphij
657205194Sdelphij         The Huffman codes for the two alphabets are fixed, and are not
658205194Sdelphij         represented explicitly in the data.  The Huffman code lengths
659205194Sdelphij         for the literal/length alphabet are:
660205194Sdelphij
661205194Sdelphij                   Lit Value    Bits        Codes
662205194Sdelphij                   ---------    ----        -----
663205194Sdelphij                     0 - 143     8          00110000 through
664205194Sdelphij                                            10111111
665205194Sdelphij                   144 - 255     9          110010000 through
666205194Sdelphij                                            111111111
667205194Sdelphij                   256 - 279     7          0000000 through
668205194Sdelphij                                            0010111
669205194Sdelphij                   280 - 287     8          11000000 through
670205194Sdelphij                                            11000111
671205194Sdelphij
672205194Sdelphij
673205194Sdelphij
674205194SdelphijDeutsch                      Informational                     [Page 12]
675205194Sdelphij
676205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
677205194Sdelphij
678205194Sdelphij
679205194Sdelphij         The code lengths are sufficient to generate the actual codes,
680205194Sdelphij         as described above; we show the codes in the table for added
681205194Sdelphij         clarity.  Literal/length values 286-287 will never actually
682205194Sdelphij         occur in the compressed data, but participate in the code
683205194Sdelphij         construction.
684205194Sdelphij
685205194Sdelphij         Distance codes 0-31 are represented by (fixed-length) 5-bit
686205194Sdelphij         codes, with possible additional bits as shown in the table
687205194Sdelphij         shown in Paragraph 3.2.5, above.  Note that distance codes 30-
688205194Sdelphij         31 will never actually occur in the compressed data.
689205194Sdelphij
690205194Sdelphij      3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
691205194Sdelphij
692205194Sdelphij         The Huffman codes for the two alphabets appear in the block
693205194Sdelphij         immediately after the header bits and before the actual
694205194Sdelphij         compressed data, first the literal/length code and then the
695205194Sdelphij         distance code.  Each code is defined by a sequence of code
696205194Sdelphij         lengths, as discussed in Paragraph 3.2.2, above.  For even
697205194Sdelphij         greater compactness, the code length sequences themselves are
698205194Sdelphij         compressed using a Huffman code.  The alphabet for code lengths
699205194Sdelphij         is as follows:
700205194Sdelphij
701205194Sdelphij               0 - 15: Represent code lengths of 0 - 15
702205194Sdelphij                   16: Copy the previous code length 3 - 6 times.
703205194Sdelphij                       The next 2 bits indicate repeat length
704205194Sdelphij                             (0 = 3, ... , 3 = 6)
705205194Sdelphij                          Example:  Codes 8, 16 (+2 bits 11),
706205194Sdelphij                                    16 (+2 bits 10) will expand to
707205194Sdelphij                                    12 code lengths of 8 (1 + 6 + 5)
708205194Sdelphij                   17: Repeat a code length of 0 for 3 - 10 times.
709205194Sdelphij                       (3 bits of length)
710205194Sdelphij                   18: Repeat a code length of 0 for 11 - 138 times
711205194Sdelphij                       (7 bits of length)
712205194Sdelphij
713205194Sdelphij         A code length of 0 indicates that the corresponding symbol in
714205194Sdelphij         the literal/length or distance alphabet will not occur in the
715205194Sdelphij         block, and should not participate in the Huffman code
716205194Sdelphij         construction algorithm given earlier.  If only one distance
717205194Sdelphij         code is used, it is encoded using one bit, not zero bits; in
718205194Sdelphij         this case there is a single code length of one, with one unused
719205194Sdelphij         code.  One distance code of zero bits means that there are no
720205194Sdelphij         distance codes used at all (the data is all literals).
721205194Sdelphij
722205194Sdelphij         We can now define the format of the block:
723205194Sdelphij
724205194Sdelphij               5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
725205194Sdelphij               5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
726205194Sdelphij               4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)
727205194Sdelphij
728205194Sdelphij
729205194Sdelphij
730205194SdelphijDeutsch                      Informational                     [Page 13]
731205194Sdelphij
732205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
733205194Sdelphij
734205194Sdelphij
735205194Sdelphij               (HCLEN + 4) x 3 bits: code lengths for the code length
736205194Sdelphij                  alphabet given just above, in the order: 16, 17, 18,
737205194Sdelphij                  0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
738205194Sdelphij
739205194Sdelphij                  These code lengths are interpreted as 3-bit integers
740205194Sdelphij                  (0-7); as above, a code length of 0 means the
741205194Sdelphij                  corresponding symbol (literal/length or distance code
742205194Sdelphij                  length) is not used.
743205194Sdelphij
744205194Sdelphij               HLIT + 257 code lengths for the literal/length alphabet,
745205194Sdelphij                  encoded using the code length Huffman code
746205194Sdelphij
747205194Sdelphij               HDIST + 1 code lengths for the distance alphabet,
748205194Sdelphij                  encoded using the code length Huffman code
749205194Sdelphij
750205194Sdelphij               The actual compressed data of the block,
751205194Sdelphij                  encoded using the literal/length and distance Huffman
752205194Sdelphij                  codes
753205194Sdelphij
754205194Sdelphij               The literal/length symbol 256 (end of data),
755205194Sdelphij                  encoded using the literal/length Huffman code
756205194Sdelphij
757205194Sdelphij         The code length repeat codes can cross from HLIT + 257 to the
758205194Sdelphij         HDIST + 1 code lengths.  In other words, all code lengths form
759205194Sdelphij         a single sequence of HLIT + HDIST + 258 values.
760205194Sdelphij
761205194Sdelphij   3.3. Compliance
762205194Sdelphij
763205194Sdelphij      A compressor may limit further the ranges of values specified in
764205194Sdelphij      the previous section and still be compliant; for example, it may
765205194Sdelphij      limit the range of backward pointers to some value smaller than
766205194Sdelphij      32K.  Similarly, a compressor may limit the size of blocks so that
767205194Sdelphij      a compressible block fits in memory.
768205194Sdelphij
769205194Sdelphij      A compliant decompressor must accept the full range of possible
770205194Sdelphij      values defined in the previous section, and must accept blocks of
771205194Sdelphij      arbitrary size.
772205194Sdelphij
773205194Sdelphij4. Compression algorithm details
774205194Sdelphij
775205194Sdelphij   While it is the intent of this document to define the "deflate"
776205194Sdelphij   compressed data format without reference to any particular
777205194Sdelphij   compression algorithm, the format is related to the compressed
778205194Sdelphij   formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
779205194Sdelphij   since many variations of LZ77 are patented, it is strongly
780205194Sdelphij   recommended that the implementor of a compressor follow the general
781205194Sdelphij   algorithm presented here, which is known not to be patented per se.
782205194Sdelphij   The material in this section is not part of the definition of the
783205194Sdelphij
784205194Sdelphij
785205194Sdelphij
786205194SdelphijDeutsch                      Informational                     [Page 14]
787205194Sdelphij
788205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
789205194Sdelphij
790205194Sdelphij
791205194Sdelphij   specification per se, and a compressor need not follow it in order to
792205194Sdelphij   be compliant.
793205194Sdelphij
794205194Sdelphij   The compressor terminates a block when it determines that starting a
795205194Sdelphij   new block with fresh trees would be useful, or when the block size
796205194Sdelphij   fills up the compressor's block buffer.
797205194Sdelphij
798205194Sdelphij   The compressor uses a chained hash table to find duplicated strings,
799205194Sdelphij   using a hash function that operates on 3-byte sequences.  At any
800205194Sdelphij   given point during compression, let XYZ be the next 3 input bytes to
801205194Sdelphij   be examined (not necessarily all different, of course).  First, the
802205194Sdelphij   compressor examines the hash chain for XYZ.  If the chain is empty,
803205194Sdelphij   the compressor simply writes out X as a literal byte and advances one
804205194Sdelphij   byte in the input.  If the hash chain is not empty, indicating that
805205194Sdelphij   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
806205194Sdelphij   same hash function value) has occurred recently, the compressor
807205194Sdelphij   compares all strings on the XYZ hash chain with the actual input data
808205194Sdelphij   sequence starting at the current point, and selects the longest
809205194Sdelphij   match.
810205194Sdelphij
811205194Sdelphij   The compressor searches the hash chains starting with the most recent
812205194Sdelphij   strings, to favor small distances and thus take advantage of the
813205194Sdelphij   Huffman encoding.  The hash chains are singly linked. There are no
814205194Sdelphij   deletions from the hash chains; the algorithm simply discards matches
815205194Sdelphij   that are too old.  To avoid a worst-case situation, very long hash
816205194Sdelphij   chains are arbitrarily truncated at a certain length, determined by a
817205194Sdelphij   run-time parameter.
818205194Sdelphij
819205194Sdelphij   To improve overall compression, the compressor optionally defers the
820205194Sdelphij   selection of matches ("lazy matching"): after a match of length N has
821205194Sdelphij   been found, the compressor searches for a longer match starting at
822205194Sdelphij   the next input byte.  If it finds a longer match, it truncates the
823205194Sdelphij   previous match to a length of one (thus producing a single literal
824205194Sdelphij   byte) and then emits the longer match.  Otherwise, it emits the
825205194Sdelphij   original match, and, as described above, advances N bytes before
826205194Sdelphij   continuing.
827205194Sdelphij
828205194Sdelphij   Run-time parameters also control this "lazy match" procedure.  If
829205194Sdelphij   compression ratio is most important, the compressor attempts a
830205194Sdelphij   complete second search regardless of the length of the first match.
831205194Sdelphij   In the normal case, if the current match is "long enough", the
832205194Sdelphij   compressor reduces the search for a longer match, thus speeding up
833205194Sdelphij   the process.  If speed is most important, the compressor inserts new
834205194Sdelphij   strings in the hash table only when no match was found, or when the
835205194Sdelphij   match is not "too long".  This degrades the compression ratio but
836205194Sdelphij   saves time since there are both fewer insertions and fewer searches.
837205194Sdelphij
838205194Sdelphij
839205194Sdelphij
840205194Sdelphij
841205194Sdelphij
842205194SdelphijDeutsch                      Informational                     [Page 15]
843205194Sdelphij
844205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
845205194Sdelphij
846205194Sdelphij
847205194Sdelphij5. References
848205194Sdelphij
849205194Sdelphij   [1] Huffman, D. A., "A Method for the Construction of Minimum
850205194Sdelphij       Redundancy Codes", Proceedings of the Institute of Radio
851205194Sdelphij       Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
852205194Sdelphij
853205194Sdelphij   [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
854205194Sdelphij       Compression", IEEE Transactions on Information Theory, Vol. 23,
855205194Sdelphij       No. 3, pp. 337-343.
856205194Sdelphij
857205194Sdelphij   [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
858205194Sdelphij       available in ftp://ftp.uu.net/pub/archiving/zip/doc/
859205194Sdelphij
860205194Sdelphij   [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
861205194Sdelphij       available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
862205194Sdelphij
863205194Sdelphij   [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
864205194Sdelphij       encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
865205194Sdelphij
866205194Sdelphij   [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
867205194Sdelphij       Comm. ACM, 33,4, April 1990, pp. 449-459.
868205194Sdelphij
869205194Sdelphij6. Security Considerations
870205194Sdelphij
871205194Sdelphij   Any data compression method involves the reduction of redundancy in
872205194Sdelphij   the data.  Consequently, any corruption of the data is likely to have
873205194Sdelphij   severe effects and be difficult to correct.  Uncompressed text, on
874205194Sdelphij   the other hand, will probably still be readable despite the presence
875205194Sdelphij   of some corrupted bytes.
876205194Sdelphij
877205194Sdelphij   It is recommended that systems using this data format provide some
878205194Sdelphij   means of validating the integrity of the compressed data.  See
879205194Sdelphij   reference [3], for example.
880205194Sdelphij
881205194Sdelphij7. Source code
882205194Sdelphij
883205194Sdelphij   Source code for a C language implementation of a "deflate" compliant
884205194Sdelphij   compressor and decompressor is available within the zlib package at
885205194Sdelphij   ftp://ftp.uu.net/pub/archiving/zip/zlib/.
886205194Sdelphij
887205194Sdelphij8. Acknowledgements
888205194Sdelphij
889205194Sdelphij   Trademarks cited in this document are the property of their
890205194Sdelphij   respective owners.
891205194Sdelphij
892205194Sdelphij   Phil Katz designed the deflate format.  Jean-Loup Gailly and Mark
893205194Sdelphij   Adler wrote the related software described in this specification.
894205194Sdelphij   Glenn Randers-Pehrson converted this document to RFC and HTML format.
895205194Sdelphij
896205194Sdelphij
897205194Sdelphij
898205194SdelphijDeutsch                      Informational                     [Page 16]
899205194Sdelphij
900205194SdelphijRFC 1951      DEFLATE Compressed Data Format Specification      May 1996
901205194Sdelphij
902205194Sdelphij
903205194Sdelphij9. Author's Address
904205194Sdelphij
905205194Sdelphij   L. Peter Deutsch
906205194Sdelphij   Aladdin Enterprises
907205194Sdelphij   203 Santa Margarita Ave.
908205194Sdelphij   Menlo Park, CA 94025
909205194Sdelphij
910205194Sdelphij   Phone: (415) 322-0103 (AM only)
911205194Sdelphij   FAX:   (415) 322-1734
912205194Sdelphij   EMail: <ghost@aladdin.com>
913205194Sdelphij
914205194Sdelphij   Questions about the technical content of this specification can be
915205194Sdelphij   sent by email to:
916205194Sdelphij
917205194Sdelphij   Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
918205194Sdelphij   Mark Adler <madler@alumni.caltech.edu>
919205194Sdelphij
920205194Sdelphij   Editorial comments on this specification can be sent by email to:
921205194Sdelphij
922205194Sdelphij   L. Peter Deutsch <ghost@aladdin.com> and
923205194Sdelphij   Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
924205194Sdelphij
925205194Sdelphij
926205194Sdelphij
927205194Sdelphij
928205194Sdelphij
929205194Sdelphij
930205194Sdelphij
931205194Sdelphij
932205194Sdelphij
933205194Sdelphij
934205194Sdelphij
935205194Sdelphij
936205194Sdelphij
937205194Sdelphij
938205194Sdelphij
939205194Sdelphij
940205194Sdelphij
941205194Sdelphij
942205194Sdelphij
943205194Sdelphij
944205194Sdelphij
945205194Sdelphij
946205194Sdelphij
947205194Sdelphij
948205194Sdelphij
949205194Sdelphij
950205194Sdelphij
951205194Sdelphij
952205194Sdelphij
953205194Sdelphij
954205194SdelphijDeutsch                      Informational                     [Page 17]
955205194Sdelphij
956