structure-indexes revision 362181
1This file describes the design, data model, and storage formats of FSFS
2index data.
3
4
5Design
6======
7
8Each pack and each rev file using logical addressing contains exactly
9two index sections.  One, the log-to-phys index, maps the (rev, item_index)
10pairs to absolute file offsets.  The other, phys-to-log, is a reverse
11index that gives basic information on any file location.  This is enough
12to read and cache any data without traversing DAGs.
13
14Rev and pack files are immutable, so the same is true for index data.
15During a transaction or while packing a file, a proto index file gets
16written (actually, one log-to-phys and one phys-to-log).  They use a
17simpler, less compact format with fixed record lengths.  A proto index
18basically aggregates all the information that must later be transformed
19into the final index.
20
21
22General design concerns
23-----------------------
24
25In Subversion, there is no limit to the size of a revision; even practical
26limits are in the order of millions of changes at least.  Index data for
27these would be multiple megabytes in size with pack file indexes possibly
28approaching 1 GB.  To ensure we still get roughly O(1) access time, we
29need a hierarchical data structure.
30
31Therefore, the indexes will start with a header containing an array of
32references to sub-sections or pages.  The length of these pages varies
33but is limited to a size configurable in fsfs.conf.
34
35Finally, it is assumed that whole pages can be cached efficiently and
36with a high cache hit rate.  So, although a page may have a thousand or
37more entries, the access time is still amortized O(1) in many scenarios.
38
39
40Items and item types
41--------------------
42
43The index implementation treats item_index and item type as simple ints,
44except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED.
45Since they have been defined as 0, the code may test for "used" etc.
46by simply comparing with 0.
47
48See section "addressing modes" in structure to a list of item types
49and pre-defined item_index values.
50
51
52Encoding
53--------
54
55The final index data format is tuned for space and decoding efficiency.
56Indexes are stored as a sequence of variable integers.  The encoding is
57as follows:
58
59* Unsigned integers are stored in little endian order with a variable
60  length 7b/8b encoding.  If most significant bit a byte has been set,
61  the next byte has also belongs to the same value.
62
63  0x00 .. 0x7f    -> 0x00 .. 0x7f               ( 7 bits stored in  8 bits)
64  0x80 .. 0xff    -> 0x80 0x01 .. 0xff 0x01     (14 bits stored in 16 bits)
65  0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f     (14 bits stored in 16 bits)
66  0x100000000     -> 0x80 0x80 0x80 0x80 0x10   (35 bits stored in 40 bits)
67
68  Technically, we can represent integers of arbitrary lengths.  Currently,
69  we only generate and parse up to 64 bits.
70
71* Signed integers are mapped onto the unsigned value space as follows:
72
73  x >= 0 ->  2 * x
74  x < 0  -> -2 * x - 1
75
76  Again, we can represent arbitrary length numbers that way but the code
77  is currently restricted to 64 bits.
78
79Most data is unsigned by nature but will be stored differentially using
80signed integers.
81
82
83Encoding in proto-index files
84-----------------------------
85
86These have a much simpler encoding.  Throughout the files, all records have
87the same length (but different between L2P and P2L).  All records contain
88unsigned 64 bit integers only, stored in little endian byte order.
89
90
91Log-to-phys index
92=================
93
94This index has to map (rev, item_index) -> offset.  It assumes that the
95item_index values per revision are dense and start at 0.  There may be
96unused item_index values, though; the data structure simply gets less
97space-efficient when the more sparse the value space gets.
98
99
100Index data model
101----------------
102
103hierarchy:
104
105  header -> per-revision info -> page -> offset
106
107  There is one entry per revision in the header.  Per revision there are
108  one or more pages (exclusive to that revision) containing up to a known,
109  fixed limit of entries (= page size).  The total access path is:
110
111  pages = header->pages[revision];
112  offsets = page = pages[item_index / page_size];
113  offset = offsets[item_index % page_size];
114
115  Different log-to-phys indexes in the same repository may have different
116  page sizes but within any given index, the page size is the same and
117  immutable.
118
119header:
120
121  <first revision>   ... first revision covered by this index
122  <revision count>   ... number of revision covered by this index
123  <page size>        ... maximum number of entries per page
124  <page table index> ... array, for each revision containing the index in
125                         <page table> of the first page that belongs to
126                         this revision.  This has <revision count>+1
127                         entries to terminate the last revision.
128  <page table>       ... array of page headers.  It has
129                         <page table index>[<revision count>] entries.
130
131page table:
132
133  <offset>           ... absolute position of the page contents within the
134                         index
135  <entry count>      ... number of offset entries in the page.
136                         Must match <header>.<page size> unless this is
137                         the last page for the respective revision.
138  <size>             ... length in bytes of the on-disk page description.
139                         Note that this is redundant with the <offset>.
140
141page:
142
143  <entry count>      ... number of offset entries in the page.
144                         Must match <header>.<page size> unless this is
145                         the last page for the respective revision.
146                         Redundant with <page table>.<entry count>
147  <offsets>          ... array of absolute file positions within the rev /
148                         pack file.  This has <entry count> entries.
149
150                         
151Index on-disk format
152--------------------
153
154  index := "L2P-INDEX\n" header revisions pages offsets
155
156  header := u(<header>.<first revision>) \
157            u(<header>.<page size>) \
158            u(<header>.<revision count>) \
159            u(s(<header>.<page table>))
160
161  revisions := u(  <header>.<page table index>[k+1]
162                 - <header>.<page table index>[k]),
163               for k in 0 .. <header>.<revision count>-1
164
165  pages := u(<header>.<page table>[k].<size>) \
166           u(<header>.<page table>[k].<entry count>),
167           for k in 0 .. s(<header>.<page table>)-1
168
169  offsets := page(k),
170             for k in 0 .. s(<header>.<page table>)-1
171
172  page(k) := i(<header>.<page table>[k].<offsets>[0]) \
173             i(  <header>.<page table>[k].<offsets>[l] \
174               - <header>.<page table>[k].<offsets>[l - 1]),
175             for l in 1 .. s(<header>.<page table>[k].<entry count>)-1
176
177  u(x) ... unsigned int x in 7b/8b encoding
178  i(x) ... signed int x in 7b/8b encoding
179  s(x) ... number of entries in array x
180
181
182Proto index file format
183-----------------------
184
185The index will be created from a "revision-less" proto index file
186containing (<offset><item_index>) pairs only.
187
188All mappings belonging to the same revision must be written in one go but
189there is no restriction on the order of those entries.  To signify that
190a new revision begins, a (0, 0) mapping must be written.  A (0, 0) entry
191at the beginning of the file is optional and will be ignored.
192
193  <bof>         /* begin of proto index file for revision r and following */
194  (0, 0)        /* mark start of revision r, optional for first rev */
195  (off, item)*  /* zero or more mappings in random order */
196  (0, 0)        /* mark start of revision r + 1 */
197  (off, item)*  /* zero or more mappings in random order */
198  (0, 0)        /* mark start of revision r + 2 */
199  (off, item)*  /* zero or more mappings in random order */
200  ...
201  <eof>         /* end of file. */
202
203All entries are pairs of 64 bit unsigned integers in little endian order.
204
205
206Phys-to-log index
207=================
208
209This index has to map offset -> (rev, item_index, type, len, checksum).
210
211
212Index data model
213----------------
214
215hierarchy:
216
217  header -> page -> item info
218
219  Logically, the index splits up index rev / pack file into pages of a
220  fixed size.  That page size may differ from the FS's block size.  The
221  index will describe each rev / pack file page with one index page.
222
223  page = header->pages[offset % page_size];
224  item info = binary_search(page.data, offset)
225
226  Note that the current implementation will not return any data if the
227  offset is does not match any actual item start.  To simplify the lookup,
228  the last index page will have an "unused item" entry for the section
229  behind EOF.  Holes aren't allowed as well, i.e. every byte of the rev /
230  pack is expected to be covered by the index.
231
232  Also, there may be items stretching across page borders or even over
233  multiple pages.  The data model solves this issue by storing the item
234  descriptions as a "primary array" and then representing the pages as
235  ranges within that array.  Thus multiple pages may reference the same
236  item description.
237
238header:
239
240  <first revision>   ... first revision covered by this index
241  <file size>        ... size of the rev / pack file in bytes
242  <page size>        ... number of bytes in the rev / pack file covered by
243                         each index page
244  <page count>       ... number of pages
245  <offsets>          ... array of page offsets, i.e. locations the page
246                         data within the index.
247                         This array has <page count> + 1 entries.
248
249page:
250
251  <entries>          ... array of item descriptions, ordered by offset.
252                         First and last entry may cross page boundaries.
253
254entry:
255
256  <offset>           ... absolute position in the pack / rev file
257  <size>             ... on-disk size of the item in the pack / rev file
258  <type>             ... item type
259  <FNV checksum>     ... modified 32 bit FNV-1a checksum of that section
260                         of the pack / rev file (see below). 0 for empty
261                         or zero-length items
262  <revision>         ... revision that this item belongs to
263  <item_index>       ... item_index within that revision
264
265
266Index on-disk format
267--------------------
268
269  index := "P2L-INDEX\n" header pages items
270
271  header := u(<header>.<first revision>) \
272            u(<header>.<file size>) \
273            u(<header>.<page size>) \
274            u(<header>.<page count>)
275
276  pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]),
277           for k in 0 .. <header>.<page count> -1
278
279  items := u(<items in page k>[0].<offset>) \
280           u(<items in page k>[l].<size>) \
281           i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \
282           i(  <items in page k>[l].<revision>
283             - <items in page k>[l-1].<revision>), \
284           u(FNV checksum)
285           for l in 0 .. s(<items in page k>)-1,
286           for k in 0 .. <header>.<page count>-1
287
288  u(x) ... unsigned int x in 7b/8b encoding
289  i(x) ... signed int x in 7b/8b encoding
290  s(x) ... number of entries in collection x
291  c(x) ... compound function := x.<item_index> * 8 + x.<type>
292
293  Access to negative indexes gives a 0 value.
294
295  <Items in page k> are in strict ascending offset order.  Items that
296  started after the begin of a given page and overlap with the next page
297  will not be stored in the start page.  The runtime representation will
298  duplicate items overlapping page boundaries; the on-disk representation
299  will not.  Thus, pages inside large items will have zero entries on disk.
300
301
302Proto index file format
303-----------------------
304
305The index will be created from a proto index file containing simple
306instances of svn_fs_fs__p2l_entry_t with the following element order:
307
308  item offset               as uint64
309  item size                 as uint64
310  item type                 as uint64
311  modified FNV1a checksum   as uint64
312  revision                  as uint64, with SVN_INVALID_REVNUM mapped to 0
313                                       and revisions >= 0 stored as rev+1
314  item index                as uint64
315
316All values are stored in little endian order.
317
318Page table and header information, except start revision and page size,
319can easily be derived from that information.
320
321All entries must be written in strict offset order.  Overlapping entries
322are not allowed; zero-length items are.
323
324In transactions, the final revision number may not be known when writing
325the proto index file (e.g. while still writing the proto rev file).  Items
326with revision set to SVN_INVALID_REVNUM will therefore be automatically
327updated when creating the final index.  This is possible in conjunction
328with rev files but not for pack files.
329
330
331FNV checksum
332------------
333
334FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/
335For performance reasons we use a modified version:
336
337* split the input byte stream [b0 .. bN] into 4 sub-streams of equal
338  length and up to 3 remnants:
339
340  [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant]
341
342* calculate 32 bit FNV-1a checksums for the 4 substreams:
343
344  h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..])
345
346* concatenate the big endian representation of these checksums (4 bytes
347  each) plus the remnant of the original stream into a 16 to 19 byte long
348  intermediate:
349
350  [i0 .. iK] = [big-endian(h0) ... big-endian(h3) remnant ], 16 <= K+1 <= 19
351
352* fold the variable-length intermediate into a compact 32 bit checksum:
353
354  FNV checksum = fnv_1a([i0 .. iK])
355