1@c This is part of the paxutils manual.
2@c Copyright (C) 2006 Free Software Foundation, Inc.
3@c This file is distributed under GFDL 1.1 or any later version
4@c published by the Free Software Foundation.
5
6@cindex sparse formats
7@cindex sparse versions
8The notion of sparse file, and the ways of handling it from the point
9of view of @GNUTAR{} user have been described in detail in
10@ref{sparse}.  This chapter describes the internal format @GNUTAR{}
11uses to store such files.
12
13The support for sparse files in @GNUTAR{} has a long history.  The
14earliest version featuring this support that I was able to find was 1.09,
15released in November, 1990.  The format introduced back then is called
16@dfn{old GNU} sparse format and in spite of the fact that its design
17contained many flaws, it was the only format @GNUTAR{} supported 
18until version 1.14 (May, 2004), which introduced initial support for
19sparse archives in @acronym{PAX} archives (@pxref{posix}).  This
20format was not free from design flows, either and it was subsequently
21improved in versions 1.15.2 (November, 2005) and 1.15.92 (June,
222006). 
23
24In addition to GNU sparse format, @GNUTAR{} is able to read and
25extract sparse files archived by @command{star}.
26
27The following subsections describe each format in detail.
28
29@menu
30* Old GNU Format::
31* PAX 0::                PAX Format, Versions 0.0 and 0.1
32* PAX 1::                PAX Format, Version 1.0
33@end menu
34
35@node Old GNU Format
36@appendixsubsec Old GNU Format
37
38@cindex sparse formats, Old GNU
39@cindex Old GNU sparse format
40The format introduced some time around 1990 (v. 1.09).  It was
41designed on top of standard @code{ustar} headers in such an
42unfortunate way that some of its fields overwrote fields required by
43POSIX.
44
45An old GNU sparse header is designated by type @samp{S}
46(@code{GNUTYPE_SPARSE}) and has the following layout:
47
48@multitable @columnfractions 0.10 0.10 0.20 0.20 0.40
49@headitem Offset @tab Size @tab Name   @tab Data type   @tab Contents
50@item          0 @tab 345  @tab        @tab N/A         @tab Not used.
51@item        345 @tab  12  @tab atime  @tab Number      @tab @code{atime} of the file.
52@item        357 @tab  12  @tab ctime  @tab Number      @tab @code{ctime} of the file .
53@item        369 @tab  12  @tab offset @tab Number      @tab For
54multivolume archives: the offset of the start of this volume.
55@item        381 @tab   4  @tab        @tab N/A         @tab Not used.
56@item        385 @tab   1  @tab        @tab N/A         @tab Not used.
57@item        386 @tab  96  @tab sp     @tab @code{sparse_header} @tab (4 entries) File map.
58@item        482 @tab   1  @tab isextended @tab Bool        @tab @code{1} if an
59extension sparse header follows, @code{0} otherwise.
60@item        483 @tab  12  @tab realsize @tab Number      @tab Real size of the file.
61@end multitable
62
63Each of @code{sparse_header} object at offset 386 describes a single
64data chunk. It has the following structure: 
65
66@multitable @columnfractions 0.10 0.10 0.20 0.60
67@headitem Offset @tab Size @tab Data type   @tab Contents
68@item          0 @tab   12 @tab Number      @tab Offset of the
69beginning of the chunk.
70@item         12 @tab   12 @tab Number      @tab Size of the chunk.
71@end multitable
72
73If the member contains more than four chunks, the @code{isextended}
74field of the header has the value @code{1} and the main header is
75followed by one or more @dfn{extension headers}.  Each such header has
76the following structure:
77
78@multitable @columnfractions 0.10 0.10 0.20 0.20 0.40
79@headitem Offset @tab Size @tab Name   @tab Data type   @tab Contents
80@item          0 @tab   21 @tab sp     @tab @code{sparse_header} @tab
81(21 entires) File map.
82@item        504 @tab    1 @tab isextended @tab Bool    @tab @code{1} if an
83extension sparse header follows, or @code{0} otherwise.
84@end multitable
85
86A header with @code{isextended=0} ends the map.
87
88@node PAX 0
89@appendixsubsec PAX Format, Versions 0.0 and 0.1
90
91@cindex sparse formats, v.0.0
92There are two formats available in this branch.  The version @code{0.0}
93is the initial version of sparse format used by @command{tar}
94versions 1.14--1.15.1.  The sparse file map is kept in extended
95(@code{x}) PAX header variables:
96
97@table @code
98@vrindex GNU.sparse.size, extended header variable
99@item GNU.sparse.size
100Real size of the stored file
101
102@item GNU.sparse.numblocks
103@vrindex GNU.sparse.numblocks, extended header variable
104Number of blocks in the sparse map
105
106@item GNU.sparse.offset
107@vrindex GNU.sparse.offset, extended header variable
108Offset of the data block
109
110@item GNU.sparse.numbytes
111@vrindex GNU.sparse.numbytes, extended header variable
112Size of the data block
113@end table
114
115The latter two variables repeat for each data block, so the overall
116structure is like this:
117
118@smallexample
119@group
120GNU.sparse.size=@var{size}      
121GNU.sparse.numblocks=@var{numblocks} 
122repeat @var{numblocks} times
123  GNU.sparse.offset=@var{offset}    
124  GNU.sparse.numbytes=@var{numbytes}  
125end repeat
126@end group
127@end smallexample
128
129This format presented the following two problems:
130
131@enumerate 1
132@item
133Whereas the POSIX specification allows a variable to appear multiple
134times in a header, it requires that only the last occurrence be
135meaningful.  Thus, multiple occurrences of @code{GNU.sparse.offset} and
136@code{GNU.sparse.numbytes} are conflicting with the POSIX specs.
137
138@item
139Attempting to extract such archives using a third-party @command{tar}s
140results in extraction of sparse files in @emph{compressed form}.  If
141the @command{tar} implementation in question does not support POSIX
142format, it will also extract a file containing extension header
143attributes.  This file can be used to expand the file to its original
144state.  However, posix-aware @command{tar}s will usually ignore the
145unknown variables, which makes restoring the file more
146difficult.  @xref{extracting sparse v.0.x, Extraction of sparse
147members in v.0.0 format}, for the detailed description of how to
148restore such members using non-GNU @command{tar}s.
149@end enumerate
150
151@cindex sparse formats, v.0.1
152@GNUTAR{} 1.15.2 introduced sparse format version @code{0.1}, which
153attempted to solve these problems.  As its predecessor, this format
154stores sparse map in the extended POSIX header.  It retains
155@code{GNU.sparse.size} and @code{GNU.sparse.numblocks} variables, but
156instead of @code{GNU.sparse.offset}/@code{GNU.sparse.numbytes} pairs
157it uses a single variable:
158
159@table @code
160@item GNU.sparse.map
161@vrindex GNU.sparse.map, extended header variable
162Map of non-null data chunks.  It is a string consisting of
163comma-separated values "@var{offset},@var{size}[,@var{offset-1},@var{size-1}...]" 
164@end table
165
166To address the 2nd problem, the @code{name} field in @code{ustar}
167is replaced with a special name, constructed using the following pattern:
168
169@smallexample
170%d/GNUSparseFile.%p/%f
171@end smallexample
172
173@vrindex GNU.sparse.name, extended header variable
174The real name of the sparse file is stored in the variable
175@code{GNU.sparse.name}.  Thus, those @command{tar} implementations
176that are not aware of GNU extensions will at least extract the files
177into separate directories, giving the user a possibility to expand it
178afterwards.  @xref{extracting sparse v.0.x, Extraction of sparse
179members in v.0.1 format}, for the detailed description of how to
180restore such members using non-GNU @command{tar}s.
181
182The resulting @code{GNU.sparse.map} string can be @emph{very} long.
183Although POSIX does not impose any limit on the length of a @code{x}
184header variable, this possibly can confuse some tars.
185
186@node PAX 1
187@appendixsubsec PAX Format, Version 1.0
188
189@cindex sparse formats, v.1.0
190The version @code{1.0} of sparse format was introduced with @GNUTAR{}
1911.15.92.  Its main objective was to make the resulting file
192extractable with little effort even by non-posix aware @command{tar}
193implementations.  Starting from this version, the extended header
194preceding a sparse member always contains the following variables that
195identify the format being used:
196
197@table @code
198@item GNU.sparse.major
199@vrindex GNU.sparse.major, extended header variable
200Major version
201
202@item GNU.sparse.minor
203@vrindex GNU.sparse.minor, extended header variable
204Minor version
205@end table
206
207The @code{name} field in @code{ustar} header contains a special name,
208constructed using the following pattern:
209
210@smallexample
211%d/GNUSparseFile.%p/%f
212@end smallexample
213
214@vrindex GNU.sparse.name, extended header variable, in v.1.0
215@vrindex GNU.sparse.realsize, extended header variable
216The real name of the sparse file is stored in the variable
217@code{GNU.sparse.name}.  The real size of the file is stored in the
218variable @code{GNU.sparse.realsize}.
219
220The sparse map itself is stored in the file data block, preceding the actual
221file data.  It consists of a series of octal numbers of arbitrary length, delimited 
222by newlines. The map is padded with nulls to the nearest block boundary.
223
224The first number gives the number of entries in the map. Following are map entries,
225each one consisting of two numbers giving the offset and size of the
226data block it describes.
227
228The format is designed in such a way that non-posix aware tars and tars not
229supporting @code{GNU.sparse.*} keywords will extract each sparse file
230in its condensed form with the file map prepended and will place it
231into a separate directory.  Then, using a simple program it would be
232possible to expand the file to its original form even without @GNUTAR{}.
233@xref{Sparse Recovery}, for the detailed information on how to extract
234sparse members without @GNUTAR{}.
235 
236