structure-indexes revision 362181
1This file describes the design, data model, and storage formats of FSFS 2index data. 3 4 5Design 6====== 7 8Each pack and each rev file using logical addressing contains exactly 9two index sections. One, the log-to-phys index, maps the (rev, item_index) 10pairs to absolute file offsets. The other, phys-to-log, is a reverse 11index that gives basic information on any file location. This is enough 12to read and cache any data without traversing DAGs. 13 14Rev and pack files are immutable, so the same is true for index data. 15During a transaction or while packing a file, a proto index file gets 16written (actually, one log-to-phys and one phys-to-log). They use a 17simpler, less compact format with fixed record lengths. A proto index 18basically aggregates all the information that must later be transformed 19into the final index. 20 21 22General design concerns 23----------------------- 24 25In Subversion, there is no limit to the size of a revision; even practical 26limits are in the order of millions of changes at least. Index data for 27these would be multiple megabytes in size with pack file indexes possibly 28approaching 1 GB. To ensure we still get roughly O(1) access time, we 29need a hierarchical data structure. 30 31Therefore, the indexes will start with a header containing an array of 32references to sub-sections or pages. The length of these pages varies 33but is limited to a size configurable in fsfs.conf. 34 35Finally, it is assumed that whole pages can be cached efficiently and 36with a high cache hit rate. So, although a page may have a thousand or 37more entries, the access time is still amortized O(1) in many scenarios. 38 39 40Items and item types 41-------------------- 42 43The index implementation treats item_index and item type as simple ints, 44except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED. 45Since they have been defined as 0, the code may test for "used" etc. 46by simply comparing with 0. 47 48See section "addressing modes" in structure to a list of item types 49and pre-defined item_index values. 50 51 52Encoding 53-------- 54 55The final index data format is tuned for space and decoding efficiency. 56Indexes are stored as a sequence of variable integers. The encoding is 57as follows: 58 59* Unsigned integers are stored in little endian order with a variable 60 length 7b/8b encoding. If most significant bit a byte has been set, 61 the next byte has also belongs to the same value. 62 63 0x00 .. 0x7f -> 0x00 .. 0x7f ( 7 bits stored in 8 bits) 64 0x80 .. 0xff -> 0x80 0x01 .. 0xff 0x01 (14 bits stored in 16 bits) 65 0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f (14 bits stored in 16 bits) 66 0x100000000 -> 0x80 0x80 0x80 0x80 0x10 (35 bits stored in 40 bits) 67 68 Technically, we can represent integers of arbitrary lengths. Currently, 69 we only generate and parse up to 64 bits. 70 71* Signed integers are mapped onto the unsigned value space as follows: 72 73 x >= 0 -> 2 * x 74 x < 0 -> -2 * x - 1 75 76 Again, we can represent arbitrary length numbers that way but the code 77 is currently restricted to 64 bits. 78 79Most data is unsigned by nature but will be stored differentially using 80signed integers. 81 82 83Encoding in proto-index files 84----------------------------- 85 86These have a much simpler encoding. Throughout the files, all records have 87the same length (but different between L2P and P2L). All records contain 88unsigned 64 bit integers only, stored in little endian byte order. 89 90 91Log-to-phys index 92================= 93 94This index has to map (rev, item_index) -> offset. It assumes that the 95item_index values per revision are dense and start at 0. There may be 96unused item_index values, though; the data structure simply gets less 97space-efficient when the more sparse the value space gets. 98 99 100Index data model 101---------------- 102 103hierarchy: 104 105 header -> per-revision info -> page -> offset 106 107 There is one entry per revision in the header. Per revision there are 108 one or more pages (exclusive to that revision) containing up to a known, 109 fixed limit of entries (= page size). The total access path is: 110 111 pages = header->pages[revision]; 112 offsets = page = pages[item_index / page_size]; 113 offset = offsets[item_index % page_size]; 114 115 Different log-to-phys indexes in the same repository may have different 116 page sizes but within any given index, the page size is the same and 117 immutable. 118 119header: 120 121 <first revision> ... first revision covered by this index 122 <revision count> ... number of revision covered by this index 123 <page size> ... maximum number of entries per page 124 <page table index> ... array, for each revision containing the index in 125 <page table> of the first page that belongs to 126 this revision. This has <revision count>+1 127 entries to terminate the last revision. 128 <page table> ... array of page headers. It has 129 <page table index>[<revision count>] entries. 130 131page table: 132 133 <offset> ... absolute position of the page contents within the 134 index 135 <entry count> ... number of offset entries in the page. 136 Must match <header>.<page size> unless this is 137 the last page for the respective revision. 138 <size> ... length in bytes of the on-disk page description. 139 Note that this is redundant with the <offset>. 140 141page: 142 143 <entry count> ... number of offset entries in the page. 144 Must match <header>.<page size> unless this is 145 the last page for the respective revision. 146 Redundant with <page table>.<entry count> 147 <offsets> ... array of absolute file positions within the rev / 148 pack file. This has <entry count> entries. 149 150 151Index on-disk format 152-------------------- 153 154 index := "L2P-INDEX\n" header revisions pages offsets 155 156 header := u(<header>.<first revision>) \ 157 u(<header>.<page size>) \ 158 u(<header>.<revision count>) \ 159 u(s(<header>.<page table>)) 160 161 revisions := u( <header>.<page table index>[k+1] 162 - <header>.<page table index>[k]), 163 for k in 0 .. <header>.<revision count>-1 164 165 pages := u(<header>.<page table>[k].<size>) \ 166 u(<header>.<page table>[k].<entry count>), 167 for k in 0 .. s(<header>.<page table>)-1 168 169 offsets := page(k), 170 for k in 0 .. s(<header>.<page table>)-1 171 172 page(k) := i(<header>.<page table>[k].<offsets>[0]) \ 173 i( <header>.<page table>[k].<offsets>[l] \ 174 - <header>.<page table>[k].<offsets>[l - 1]), 175 for l in 1 .. s(<header>.<page table>[k].<entry count>)-1 176 177 u(x) ... unsigned int x in 7b/8b encoding 178 i(x) ... signed int x in 7b/8b encoding 179 s(x) ... number of entries in array x 180 181 182Proto index file format 183----------------------- 184 185The index will be created from a "revision-less" proto index file 186containing (<offset><item_index>) pairs only. 187 188All mappings belonging to the same revision must be written in one go but 189there is no restriction on the order of those entries. To signify that 190a new revision begins, a (0, 0) mapping must be written. A (0, 0) entry 191at the beginning of the file is optional and will be ignored. 192 193 <bof> /* begin of proto index file for revision r and following */ 194 (0, 0) /* mark start of revision r, optional for first rev */ 195 (off, item)* /* zero or more mappings in random order */ 196 (0, 0) /* mark start of revision r + 1 */ 197 (off, item)* /* zero or more mappings in random order */ 198 (0, 0) /* mark start of revision r + 2 */ 199 (off, item)* /* zero or more mappings in random order */ 200 ... 201 <eof> /* end of file. */ 202 203All entries are pairs of 64 bit unsigned integers in little endian order. 204 205 206Phys-to-log index 207================= 208 209This index has to map offset -> (rev, item_index, type, len, checksum). 210 211 212Index data model 213---------------- 214 215hierarchy: 216 217 header -> page -> item info 218 219 Logically, the index splits up index rev / pack file into pages of a 220 fixed size. That page size may differ from the FS's block size. The 221 index will describe each rev / pack file page with one index page. 222 223 page = header->pages[offset % page_size]; 224 item info = binary_search(page.data, offset) 225 226 Note that the current implementation will not return any data if the 227 offset is does not match any actual item start. To simplify the lookup, 228 the last index page will have an "unused item" entry for the section 229 behind EOF. Holes aren't allowed as well, i.e. every byte of the rev / 230 pack is expected to be covered by the index. 231 232 Also, there may be items stretching across page borders or even over 233 multiple pages. The data model solves this issue by storing the item 234 descriptions as a "primary array" and then representing the pages as 235 ranges within that array. Thus multiple pages may reference the same 236 item description. 237 238header: 239 240 <first revision> ... first revision covered by this index 241 <file size> ... size of the rev / pack file in bytes 242 <page size> ... number of bytes in the rev / pack file covered by 243 each index page 244 <page count> ... number of pages 245 <offsets> ... array of page offsets, i.e. locations the page 246 data within the index. 247 This array has <page count> + 1 entries. 248 249page: 250 251 <entries> ... array of item descriptions, ordered by offset. 252 First and last entry may cross page boundaries. 253 254entry: 255 256 <offset> ... absolute position in the pack / rev file 257 <size> ... on-disk size of the item in the pack / rev file 258 <type> ... item type 259 <FNV checksum> ... modified 32 bit FNV-1a checksum of that section 260 of the pack / rev file (see below). 0 for empty 261 or zero-length items 262 <revision> ... revision that this item belongs to 263 <item_index> ... item_index within that revision 264 265 266Index on-disk format 267-------------------- 268 269 index := "P2L-INDEX\n" header pages items 270 271 header := u(<header>.<first revision>) \ 272 u(<header>.<file size>) \ 273 u(<header>.<page size>) \ 274 u(<header>.<page count>) 275 276 pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]), 277 for k in 0 .. <header>.<page count> -1 278 279 items := u(<items in page k>[0].<offset>) \ 280 u(<items in page k>[l].<size>) \ 281 i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \ 282 i( <items in page k>[l].<revision> 283 - <items in page k>[l-1].<revision>), \ 284 u(FNV checksum) 285 for l in 0 .. s(<items in page k>)-1, 286 for k in 0 .. <header>.<page count>-1 287 288 u(x) ... unsigned int x in 7b/8b encoding 289 i(x) ... signed int x in 7b/8b encoding 290 s(x) ... number of entries in collection x 291 c(x) ... compound function := x.<item_index> * 8 + x.<type> 292 293 Access to negative indexes gives a 0 value. 294 295 <Items in page k> are in strict ascending offset order. Items that 296 started after the begin of a given page and overlap with the next page 297 will not be stored in the start page. The runtime representation will 298 duplicate items overlapping page boundaries; the on-disk representation 299 will not. Thus, pages inside large items will have zero entries on disk. 300 301 302Proto index file format 303----------------------- 304 305The index will be created from a proto index file containing simple 306instances of svn_fs_fs__p2l_entry_t with the following element order: 307 308 item offset as uint64 309 item size as uint64 310 item type as uint64 311 modified FNV1a checksum as uint64 312 revision as uint64, with SVN_INVALID_REVNUM mapped to 0 313 and revisions >= 0 stored as rev+1 314 item index as uint64 315 316All values are stored in little endian order. 317 318Page table and header information, except start revision and page size, 319can easily be derived from that information. 320 321All entries must be written in strict offset order. Overlapping entries 322are not allowed; zero-length items are. 323 324In transactions, the final revision number may not be known when writing 325the proto index file (e.g. while still writing the proto rev file). Items 326with revision set to SVN_INVALID_REVNUM will therefore be automatically 327updated when creating the final index. This is possible in conjunction 328with rev files but not for pack files. 329 330 331FNV checksum 332------------ 333 334FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/ 335For performance reasons we use a modified version: 336 337* split the input byte stream [b0 .. bN] into 4 sub-streams of equal 338 length and up to 3 remnants: 339 340 [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant] 341 342* calculate 32 bit FNV-1a checksums for the 4 substreams: 343 344 h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..]) 345 346* concatenate the big endian representation of these checksums (4 bytes 347 each) plus the remnant of the original stream into a 16 to 19 byte long 348 intermediate: 349 350 [i0 .. iK] = [big-endian(h0) ... big-endian(h3) remnant ], 16 <= K+1 <= 19 351 352* fold the variable-length intermediate into a compact 32 bit checksum: 353 354 FNV checksum = fnv_1a([i0 .. iK]) 355