bzip2.1 revision 78556
1.PU
2.TH bzip2 1
3.SH NAME
4bzip2, bunzip2 \- a block-sorting file compressor, v1.0
5.br
6bzcat \- decompresses files to stdout
7.br
8bzip2recover \- recovers data from damaged bzip2 files
9
10.SH SYNOPSIS
11.ll +8
12.B bzip2
13.RB [ " \-cdfkqstvzVL123456789 " ]
14[
15.I "filenames \&..."
16]
17.ll -8
18.br
19.B bunzip2
20.RB [ " \-fkvsVL " ]
21[ 
22.I "filenames \&..."
23]
24.br
25.B bzcat
26.RB [ " \-s " ]
27[ 
28.I "filenames \&..."
29]
30.br
31.B bzip2recover
32.I "filename"
33
34.SH DESCRIPTION
35.I bzip2
36compresses files using the Burrows-Wheeler block sorting
37text compression algorithm, and Huffman coding.  Compression is
38generally considerably better than that achieved by more conventional
39LZ77/LZ78-based compressors, and approaches the performance of the PPM
40family of statistical compressors.
41
42The command-line options are deliberately very similar to 
43those of 
44.I GNU gzip, 
45but they are not identical.
46
47.I bzip2
48expects a list of file names to accompany the
49command-line flags.  Each file is replaced by a compressed version of
50itself, with the name "original_name.bz2".  
51Each compressed file
52has the same modification date, permissions, and, when possible,
53ownership as the corresponding original, so that these properties can
54be correctly restored at decompression time.  File name handling is
55naive in the sense that there is no mechanism for preserving original
56file names, permissions, ownerships or dates in filesystems which lack
57these concepts, or have serious file name length restrictions, such as
58MS-DOS.
59
60.I bzip2
61and
62.I bunzip2
63will by default not overwrite existing
64files.  If you want this to happen, specify the \-f flag.
65
66If no file names are specified,
67.I bzip2
68compresses from standard
69input to standard output.  In this case,
70.I bzip2
71will decline to
72write compressed output to a terminal, as this would be entirely
73incomprehensible and therefore pointless.
74
75.I bunzip2
76(or
77.I bzip2 \-d) 
78decompresses all
79specified files.  Files which were not created by 
80.I bzip2
81will be detected and ignored, and a warning issued.  
82.I bzip2
83attempts to guess the filename for the decompressed file 
84from that of the compressed file as follows:
85
86       filename.bz2    becomes   filename
87       filename.bz     becomes   filename
88       filename.tbz2   becomes   filename.tar
89       filename.tbz    becomes   filename.tar
90       anyothername    becomes   anyothername.out
91
92If the file does not end in one of the recognised endings, 
93.I .bz2, 
94.I .bz, 
95.I .tbz2
96or
97.I .tbz, 
98.I bzip2 
99complains that it cannot
100guess the name of the original file, and uses the original name
101with
102.I .out
103appended.
104
105As with compression, supplying no
106filenames causes decompression from 
107standard input to standard output.
108
109.I bunzip2 
110will correctly decompress a file which is the
111concatenation of two or more compressed files.  The result is the
112concatenation of the corresponding uncompressed files.  Integrity
113testing (\-t) 
114of concatenated 
115compressed files is also supported.
116
117You can also compress or decompress files to the standard output by
118giving the \-c flag.  Multiple files may be compressed and
119decompressed like this.  The resulting outputs are fed sequentially to
120stdout.  Compression of multiple files 
121in this manner generates a stream
122containing multiple compressed file representations.  Such a stream
123can be decompressed correctly only by
124.I bzip2 
125version 0.9.0 or
126later.  Earlier versions of
127.I bzip2
128will stop after decompressing
129the first file in the stream.
130
131.I bzcat
132(or
133.I bzip2 -dc) 
134decompresses all specified files to
135the standard output.
136
137.I bzip2
138will read arguments from the environment variables
139.I BZIP2
140and
141.I BZIP,
142in that order, and will process them
143before any arguments read from the command line.  This gives a 
144convenient way to supply default arguments.
145
146Compression is always performed, even if the compressed 
147file is slightly
148larger than the original.  Files of less than about one hundred bytes
149tend to get larger, since the compression mechanism has a constant
150overhead in the region of 50 bytes.  Random data (including the output
151of most file compressors) is coded at about 8.05 bits per byte, giving
152an expansion of around 0.5%.
153
154As a self-check for your protection, 
155.I 
156bzip2
157uses 32-bit CRCs to
158make sure that the decompressed version of a file is identical to the
159original.  This guards against corruption of the compressed data, and
160against undetected bugs in
161.I bzip2
162(hopefully very unlikely).  The
163chances of data corruption going undetected is microscopic, about one
164chance in four billion for each file processed.  Be aware, though, that
165the check occurs upon decompression, so it can only tell you that
166something is wrong.  It can't help you 
167recover the original uncompressed
168data.  You can use 
169.I bzip2recover
170to try to recover data from
171damaged files.
172
173Return values: 0 for a normal exit, 1 for environmental problems (file
174not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
175compressed file, 3 for an internal consistency error (eg, bug) which
176caused
177.I bzip2
178to panic.
179
180.SH OPTIONS
181.TP
182.B \-c --stdout
183Compress or decompress to standard output.
184.TP
185.B \-d --decompress
186Force decompression.  
187.I bzip2, 
188.I bunzip2 
189and
190.I bzcat 
191are
192really the same program, and the decision about what actions to take is
193done on the basis of which name is used.  This flag overrides that
194mechanism, and forces 
195.I bzip2
196to decompress.
197.TP
198.B \-z --compress
199The complement to \-d: forces compression, regardless of the
200invokation name.
201.TP
202.B \-t --test
203Check integrity of the specified file(s), but don't decompress them.
204This really performs a trial decompression and throws away the result.
205.TP
206.B \-f --force
207Force overwrite of output files.  Normally,
208.I bzip2 
209will not overwrite
210existing output files.  Also forces 
211.I bzip2 
212to break hard links
213to files, which it otherwise wouldn't do.
214.TP
215.B \-k --keep
216Keep (don't delete) input files during compression
217or decompression.
218.TP
219.B \-s --small
220Reduce memory usage, for compression, decompression and testing.  Files
221are decompressed and tested using a modified algorithm which only
222requires 2.5 bytes per block byte.  This means any file can be
223decompressed in 2300k of memory, albeit at about half the normal speed.
224
225During compression, \-s selects a block size of 200k, which limits
226memory use to around the same figure, at the expense of your compression
227ratio.  In short, if your machine is low on memory (8 megabytes or
228less), use \-s for everything.  See MEMORY MANAGEMENT below.
229.TP
230.B \-q --quiet
231Suppress non-essential warning messages.  Messages pertaining to
232I/O errors and other critical events will not be suppressed.
233.TP
234.B \-v --verbose
235Verbose mode -- show the compression ratio for each file processed.
236Further \-v's increase the verbosity level, spewing out lots of
237information which is primarily of interest for diagnostic purposes.
238.TP
239.B \-L --license -V --version
240Display the software version, license terms and conditions.
241.TP
242.B \-1 to \-9
243Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
244effect when decompressing.  See MEMORY MANAGEMENT below.
245.TP
246.B \--
247Treats all subsequent arguments as file names, even if they start
248with a dash.  This is so you can handle files with names beginning
249with a dash, for example: bzip2 \-- \-myfilename.
250.TP
251.B \--repetitive-fast --repetitive-best
252These flags are redundant in versions 0.9.5 and above.  They provided
253some coarse control over the behaviour of the sorting algorithm in
254earlier versions, which was sometimes useful.  0.9.5 and above have an
255improved algorithm which renders these flags irrelevant.
256
257.SH MEMORY MANAGEMENT
258.I bzip2 
259compresses large files in blocks.  The block size affects
260both the compression ratio achieved, and the amount of memory needed for
261compression and decompression.  The flags \-1 through \-9
262specify the block size to be 100,000 bytes through 900,000 bytes (the
263default) respectively.  At decompression time, the block size used for
264compression is read from the header of the compressed file, and
265.I bunzip2
266then allocates itself just enough memory to decompress
267the file.  Since block sizes are stored in compressed files, it follows
268that the flags \-1 to \-9 are irrelevant to and so ignored
269during decompression.
270
271Compression and decompression requirements, 
272in bytes, can be estimated as:
273
274       Compression:   400k + ( 8 x block size )
275
276       Decompression: 100k + ( 4 x block size ), or
277                      100k + ( 2.5 x block size )
278
279Larger block sizes give rapidly diminishing marginal returns.  Most of
280the compression comes from the first two or three hundred k of block
281size, a fact worth bearing in mind when using
282.I bzip2
283on small machines.
284It is also important to appreciate that the decompression memory
285requirement is set at compression time by the choice of block size.
286
287For files compressed with the default 900k block size,
288.I bunzip2
289will require about 3700 kbytes to decompress.  To support decompression
290of any file on a 4 megabyte machine, 
291.I bunzip2
292has an option to
293decompress using approximately half this amount of memory, about 2300
294kbytes.  Decompression speed is also halved, so you should use this
295option only where necessary.  The relevant flag is -s.
296
297In general, try and use the largest block size memory constraints allow,
298since that maximises the compression achieved.  Compression and
299decompression speed are virtually unaffected by block size.
300
301Another significant point applies to files which fit in a single block
302-- that means most files you'd encounter using a large block size.  The
303amount of real memory touched is proportional to the size of the file,
304since the file is smaller than a block.  For example, compressing a file
30520,000 bytes long with the flag -9 will cause the compressor to
306allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
307kbytes of it.  Similarly, the decompressor will allocate 3700k but only
308touch 100k + 20000 * 4 = 180 kbytes.
309
310Here is a table which summarises the maximum memory usage for different
311block sizes.  Also recorded is the total compressed size for 14 files of
312the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
313column gives some feel for how compression varies with block size.
314These figures tend to understate the advantage of larger block sizes for
315larger files, since the Corpus is dominated by smaller files.
316
317           Compress   Decompress   Decompress   Corpus
318    Flag     usage      usage       -s usage     Size
319
320     -1      1200k       500k         350k      914704
321     -2      2000k       900k         600k      877703
322     -3      2800k      1300k         850k      860338
323     -4      3600k      1700k        1100k      846899
324     -5      4400k      2100k        1350k      845160
325     -6      5200k      2500k        1600k      838626
326     -7      6100k      2900k        1850k      834096
327     -8      6800k      3300k        2100k      828642
328     -9      7600k      3700k        2350k      828642
329
330.SH RECOVERING DATA FROM DAMAGED FILES
331.I bzip2
332compresses files in blocks, usually 900kbytes long.  Each
333block is handled independently.  If a media or transmission error causes
334a multi-block .bz2
335file to become damaged, it may be possible to
336recover data from the undamaged blocks in the file.
337
338The compressed representation of each block is delimited by a 48-bit
339pattern, which makes it possible to find the block boundaries with
340reasonable certainty.  Each block also carries its own 32-bit CRC, so
341damaged blocks can be distinguished from undamaged ones.
342
343.I bzip2recover
344is a simple program whose purpose is to search for
345blocks in .bz2 files, and write each block out into its own .bz2 
346file.  You can then use
347.I bzip2 
348\-t
349to test the
350integrity of the resulting files, and decompress those which are
351undamaged.
352
353.I bzip2recover
354takes a single argument, the name of the damaged file, 
355and writes a number of files "rec0001file.bz2",
356"rec0002file.bz2", etc, containing the  extracted  blocks.
357The  output  filenames  are  designed  so  that the use of
358wildcards in subsequent processing -- for example,  
359"bzip2 -dc  rec*file.bz2 > recovered_data" -- lists the files in
360the correct order.
361
362.I bzip2recover
363should be of most use dealing with large .bz2
364files,  as  these will contain many blocks.  It is clearly
365futile to use it on damaged single-block  files,  since  a
366damaged  block  cannot  be recovered.  If you wish to minimise 
367any potential data loss through media  or  transmission errors, 
368you might consider compressing with a smaller
369block size.
370
371.SH PERFORMANCE NOTES
372The sorting phase of compression gathers together similar strings in the
373file.  Because of this, files containing very long runs of repeated
374symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
375compress more slowly than normal.  Versions 0.9.5 and above fare much
376better than previous versions in this respect.  The ratio between
377worst-case and average-case compression time is in the region of 10:1.
378For previous versions, this figure was more like 100:1.  You can use the
379\-vvvv option to monitor progress in great detail, if you want.
380
381Decompression speed is unaffected by these phenomena.
382
383.I bzip2
384usually allocates several megabytes of memory to operate
385in, and then charges all over it in a fairly random fashion.  This means
386that performance, both for compressing and decompressing, is largely
387determined by the speed at which your machine can service cache misses.
388Because of this, small changes to the code to reduce the miss rate have
389been observed to give disproportionately large performance improvements.
390I imagine 
391.I bzip2
392will perform best on machines with very large caches.
393
394.SH CAVEATS
395I/O error messages are not as helpful as they could be.
396.I bzip2
397tries hard to detect I/O errors and exit cleanly, but the details of
398what the problem is sometimes seem rather misleading.
399
400This manual page pertains to version 1.0 of
401.I bzip2.  
402Compressed
403data created by this version is entirely forwards and backwards
404compatible with the previous public releases, versions 0.1pl2, 0.9.0
405and 0.9.5,
406but with the following exception: 0.9.0 and above can correctly
407decompress multiple concatenated compressed files.  0.1pl2 cannot do
408this; it will stop after decompressing just the first file in the
409stream.
410
411.I bzip2recover
412uses 32-bit integers to represent bit positions in
413compressed files, so it cannot handle compressed files more than 512
414megabytes long.  This could easily be fixed.
415
416.SH AUTHOR
417Julian Seward, jseward@acm.org.
418
419http://sourceware.cygnus.com/bzip2
420http://www.muraroa.demon.co.uk
421
422The ideas embodied in
423.I bzip2
424are due to (at least) the following
425people: Michael Burrows and David Wheeler (for the block sorting
426transformation), David Wheeler (again, for the Huffman coder), Peter
427Fenwick (for the structured coding model in the original
428.I bzip,
429and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
430(for the arithmetic coder in the original
431.I bzip).  
432I am much
433indebted for their help, support and advice.  See the manual in the
434source distribution for pointers to sources of documentation.  Christian
435von Roques encouraged me to look for faster sorting algorithms, so as to
436speed up compression.  Bela Lubkin encouraged me to improve the
437worst-case compression performance.  Many people sent patches, helped
438with portability problems, lent machines, gave advice and were generally
439helpful.
440