3.t (45319) | 3.t (108533) |
---|---|
1.\" Copyright (c) 1986, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. --- 17 unchanged lines hidden (view full) --- 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)3.t 8.1 (Berkeley) 6/8/93 33.\" | 1.\" Copyright (c) 1986, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. --- 17 unchanged lines hidden (view full) --- 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)3.t 8.1 (Berkeley) 6/8/93 33.\" |
34.\" $FreeBSD: head/share/doc/smm/05.fastfs/3.t 108533 2003-01-01 18:49:04Z schweikh $ 35.\" |
|
34.ds RH New file system 35.NH 36New file system organization 37.PP 38In the new file system organization (as in the 39old file system organization), 40each disk drive contains one or more file systems. 41A file system is described by its super-block, --- 8 unchanged lines hidden (view full) --- 50.PP 51To insure that it is possible to create files as large as 52.if n 2 ** 32 53.if t $2 sup 32$ 54bytes with only two levels of indirection, 55the minimum size of a file system block is 4096 bytes. 56The size of file system blocks can be any power of two 57greater than or equal to 4096. | 36.ds RH New file system 37.NH 38New file system organization 39.PP 40In the new file system organization (as in the 41old file system organization), 42each disk drive contains one or more file systems. 43A file system is described by its super-block, --- 8 unchanged lines hidden (view full) --- 52.PP 53To insure that it is possible to create files as large as 54.if n 2 ** 32 55.if t $2 sup 32$ 56bytes with only two levels of indirection, 57the minimum size of a file system block is 4096 bytes. 58The size of file system blocks can be any power of two 59greater than or equal to 4096. |
58The block size of a file system is recorded in the | 60The block size of a file system is recorded in the |
59file system's super-block 60so it is possible for file systems with different block sizes 61to be simultaneously accessible on the same system. 62The block size must be decided at the time that 63the file system is created; 64it cannot be subsequently changed without rebuilding the file system. 65.PP 66The new file system organization divides a disk partition --- 8 unchanged lines hidden (view full) --- 75and summary information describing the usage of data blocks 76within the cylinder group. 77The bit map of available blocks in the cylinder group replaces 78the traditional file system's free list. 79For each cylinder group a static number of inodes 80is allocated at file system creation time. 81The default policy is to allocate one inode for each 2048 82bytes of space in the cylinder group, expecting this | 61file system's super-block 62so it is possible for file systems with different block sizes 63to be simultaneously accessible on the same system. 64The block size must be decided at the time that 65the file system is created; 66it cannot be subsequently changed without rebuilding the file system. 67.PP 68The new file system organization divides a disk partition --- 8 unchanged lines hidden (view full) --- 77and summary information describing the usage of data blocks 78within the cylinder group. 79The bit map of available blocks in the cylinder group replaces 80the traditional file system's free list. 81For each cylinder group a static number of inodes 82is allocated at file system creation time. 83The default policy is to allocate one inode for each 2048 84bytes of space in the cylinder group, expecting this |
83to be far more than will ever be needed. | 85to be far more than will ever be needed. |
84.PP 85All the cylinder group bookkeeping information could be 86placed at the beginning of each cylinder group. 87However if this approach were used, 88all the redundant information would be on the top platter. 89A single hardware failure that destroyed the top platter 90could cause the loss of all redundant copies of the super-block. 91Thus the cylinder group bookkeeping information --- 12 unchanged lines hidden (view full) --- 104\(dg While it appears that the first cylinder group could be laid 105out with its super-block at the ``known'' location, 106this would not work for file systems 107with blocks sizes of 16 kilobytes or greater. 108This is because of a requirement that the first 8 kilobytes of the disk 109be reserved for a bootstrap program and a separate requirement that 110the cylinder group information begin on a file system block boundary. 111To start the cylinder group on a file system block boundary, | 86.PP 87All the cylinder group bookkeeping information could be 88placed at the beginning of each cylinder group. 89However if this approach were used, 90all the redundant information would be on the top platter. 91A single hardware failure that destroyed the top platter 92could cause the loss of all redundant copies of the super-block. 93Thus the cylinder group bookkeeping information --- 12 unchanged lines hidden (view full) --- 106\(dg While it appears that the first cylinder group could be laid 107out with its super-block at the ``known'' location, 108this would not work for file systems 109with blocks sizes of 16 kilobytes or greater. 110This is because of a requirement that the first 8 kilobytes of the disk 111be reserved for a bootstrap program and a separate requirement that 112the cylinder group information begin on a file system block boundary. 113To start the cylinder group on a file system block boundary, |
112file systems with block sizes larger than 8 kilobytes | 114file systems with block sizes larger than 8 kilobytes |
113would have to leave an empty space between the end of 114the boot block and the beginning of the cylinder group. 115Without knowing the size of the file system blocks, 116the system would not know what roundup function to use 117to find the beginning of the first cylinder group. 118.FE 119.NH 2 120Optimizing storage utilization --- 5 unchanged lines hidden (view full) --- 126In the old file system this file would be composed of 1024 byte blocks. 127By increasing the block size, disk accesses in the new file 128system may transfer up to four times as much information per 129disk transaction. 130In large files, several 1314096 byte blocks may be allocated from the same cylinder so that 132even larger data transfers are possible before requiring a seek. 133.PP | 115would have to leave an empty space between the end of 116the boot block and the beginning of the cylinder group. 117Without knowing the size of the file system blocks, 118the system would not know what roundup function to use 119to find the beginning of the first cylinder group. 120.FE 121.NH 2 122Optimizing storage utilization --- 5 unchanged lines hidden (view full) --- 128In the old file system this file would be composed of 1024 byte blocks. 129By increasing the block size, disk accesses in the new file 130system may transfer up to four times as much information per 131disk transaction. 132In large files, several 1334096 byte blocks may be allocated from the same cylinder so that 134even larger data transfers are possible before requiring a seek. 135.PP |
134The main problem with | 136The main problem with |
135larger blocks is that most UNIX 136file systems are composed of many small files. 137A uniformly large block size wastes space. 138Table 1 shows the effect of file system 139block size on the amount of wasted space in the file system. 140The files measured to obtain these figures reside on 141one of our time sharing 142systems that has roughly 1.2 gigabytes of on-line storage. --- 49 unchanged lines hidden (view full) --- 192Fragment numbers 0-3 4-7 8-11 12-15 193Block numbers 0 1 2 3 194.TE 195Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system. 196.DE 197.KE 198Each bit in the map records the status of a fragment; 199an ``X'' shows that the fragment is in use, | 137larger blocks is that most UNIX 138file systems are composed of many small files. 139A uniformly large block size wastes space. 140Table 1 shows the effect of file system 141block size on the amount of wasted space in the file system. 142The files measured to obtain these figures reside on 143one of our time sharing 144systems that has roughly 1.2 gigabytes of on-line storage. --- 49 unchanged lines hidden (view full) --- 194Fragment numbers 0-3 4-7 8-11 12-15 195Block numbers 0 1 2 3 196.TE 197Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system. 198.DE 199.KE 200Each bit in the map records the status of a fragment; 201an ``X'' shows that the fragment is in use, |
200while a ``O'' shows that the fragment is available for allocation. | 202while an ``O'' shows that the fragment is available for allocation. |
201In this example, 202fragments 0\-5, 10, and 11 are in use, 203while fragments 6\-9, and 12\-15 are free. 204Fragments of adjoining blocks cannot be used as a full block, 205even if they are large enough. 206In this example, 207fragments 6\-9 cannot be allocated as a full block; 208only fragments 12\-15 can be coalesced into a full block. --- 42 unchanged lines hidden (view full) --- 251This process is repeated until less than a full block 252of new data remains. 253If the remaining new data to be written will 254fit in less than a full block, 255a block with the necessary fragments is located, 256otherwise a full block is located. 257The remaining new data is written into the located space. 258.IP 3) | 203In this example, 204fragments 0\-5, 10, and 11 are in use, 205while fragments 6\-9, and 12\-15 are free. 206Fragments of adjoining blocks cannot be used as a full block, 207even if they are large enough. 208In this example, 209fragments 6\-9 cannot be allocated as a full block; 210only fragments 12\-15 can be coalesced into a full block. --- 42 unchanged lines hidden (view full) --- 253This process is repeated until less than a full block 254of new data remains. 255If the remaining new data to be written will 256fit in less than a full block, 257a block with the necessary fragments is located, 258otherwise a full block is located. 259The remaining new data is written into the located space. 260.IP 3) |
259The file contains one or more fragments (and the | 261The file contains one or more fragments (and the |
260fragments contain insufficient space to hold the new data). 261If the size of the new data plus the size of the data 262already in the fragments exceeds the size of a full block, 263a new block is allocated. 264The contents of the fragments are copied 265to the beginning of the block 266and the remainder of the block is filled with new data. 267The process then continues as in (2) above. 268Otherwise, if the new data to be written will 269fit in less than a full block, 270a block with the necessary fragments is located, 271otherwise a full block is located. 272The contents of the existing fragments 273appended with the new data 274are written into the allocated space. 275.PP 276The problem with expanding a file one fragment at a | 262fragments contain insufficient space to hold the new data). 263If the size of the new data plus the size of the data 264already in the fragments exceeds the size of a full block, 265a new block is allocated. 266The contents of the fragments are copied 267to the beginning of the block 268and the remainder of the block is filled with new data. 269The process then continues as in (2) above. 270Otherwise, if the new data to be written will 271fit in less than a full block, 272a block with the necessary fragments is located, 273otherwise a full block is located. 274The contents of the existing fragments 275appended with the new data 276are written into the allocated space. 277.PP 278The problem with expanding a file one fragment at a |
277a time is that data may be copied many times as a | 279a time is that data may be copied many times as a |
278fragmented block expands to a full block. 279Fragment reallocation can be minimized 280if the user program writes a full block at a time, 281except for a partial block at the end of the file. 282Since file systems with different block sizes may reside on 283the same system, 284the file system interface has been extended to provide 285application programs the optimal size for a read or write. --- 54 unchanged lines hidden (view full) --- 340in Table 1. 341Thus, the percentage of waste in 342an old 1024 byte UNIX file system is roughly 343comparable to a new 4096/512 byte file system 344with the free space reserve set at 5%. 345(Compare 11.8% wasted with the old file system 346to 6.9% waste + 5% reserved space in the 347new file system.) | 280fragmented block expands to a full block. 281Fragment reallocation can be minimized 282if the user program writes a full block at a time, 283except for a partial block at the end of the file. 284Since file systems with different block sizes may reside on 285the same system, 286the file system interface has been extended to provide 287application programs the optimal size for a read or write. --- 54 unchanged lines hidden (view full) --- 342in Table 1. 343Thus, the percentage of waste in 344an old 1024 byte UNIX file system is roughly 345comparable to a new 4096/512 byte file system 346with the free space reserve set at 5%. 347(Compare 11.8% wasted with the old file system 348to 6.9% waste + 5% reserved space in the 349new file system.) |
348.NH 2 | 350.NH 2 |
349File system parameterization 350.PP 351Except for the initial creation of the free list, 352the old file system ignores the parameters of the underlying hardware. 353It has no information about either the physical characteristics 354of the mass storage device, 355or the hardware that interacts with it. | 351File system parameterization 352.PP 353Except for the initial creation of the free list, 354the old file system ignores the parameters of the underlying hardware. 355It has no information about either the physical characteristics 356of the mass storage device, 357or the hardware that interacts with it. |
356A goal of the new file system is to parameterize the | 358A goal of the new file system is to parameterize the |
357processor capabilities and 358mass storage characteristics 359so that blocks can be allocated in an | 359processor capabilities and 360mass storage characteristics 361so that blocks can be allocated in an |
360optimum configuration-dependent way. | 362optimum configuration-dependent way. |
361Parameters used include the speed of the processor, 362the hardware support for mass storage transfers, 363and the characteristics of the mass storage devices. 364Disk technology is constantly improving and 365a given installation can have several different disk technologies 366running on a single processor. 367Each file system is parameterized so that it can be 368adapted to the characteristics of the disk on which 369it is placed. 370.PP 371For mass storage devices such as disks, 372the new file system tries to allocate new blocks | 363Parameters used include the speed of the processor, 364the hardware support for mass storage transfers, 365and the characteristics of the mass storage devices. 366Disk technology is constantly improving and 367a given installation can have several different disk technologies 368running on a single processor. 369Each file system is parameterized so that it can be 370adapted to the characteristics of the disk on which 371it is placed. 372.PP 373For mass storage devices such as disks, 374the new file system tries to allocate new blocks |
373on the same cylinder as the previous block in the same file. 374Optimally, these new blocks will also be | 375on the same cylinder as the previous block in the same file. 376Optimally, these new blocks will also be |
375rotationally well positioned. 376The distance between ``rotationally optimal'' blocks varies greatly; 377it can be a consecutive block 378or a rotationally delayed block 379depending on system characteristics. 380On a processor with an input/output channel that does not require 381any processor intervention between mass storage transfer requests, 382two consecutive disk blocks can often be accessed --- 51 unchanged lines hidden (view full) --- 434can be changed at any time, 435even when the file system is mounted and active. 436If a file system is parameterized to lay out blocks with 437a rotational separation of 2 milliseconds, 438and the disk pack is then moved to a system that has a 439processor requiring 4 milliseconds to schedule a disk operation, 440the throughput will drop precipitously because of lost disk revolutions 441on nearly every block. | 377rotationally well positioned. 378The distance between ``rotationally optimal'' blocks varies greatly; 379it can be a consecutive block 380or a rotationally delayed block 381depending on system characteristics. 382On a processor with an input/output channel that does not require 383any processor intervention between mass storage transfer requests, 384two consecutive disk blocks can often be accessed --- 51 unchanged lines hidden (view full) --- 436can be changed at any time, 437even when the file system is mounted and active. 438If a file system is parameterized to lay out blocks with 439a rotational separation of 2 milliseconds, 440and the disk pack is then moved to a system that has a 441processor requiring 4 milliseconds to schedule a disk operation, 442the throughput will drop precipitously because of lost disk revolutions 443on nearly every block. |
442If the eventual target machine is known, | 444If the eventual target machine is known, |
443the file system can be parameterized for it 444even though it is initially created on a different processor. 445Even if the move is not known in advance, 446the rotational layout delay can be reconfigured after the disk is moved 447so that all further allocation is done based on the 448characteristics of the new host. 449.NH 2 450Layout policies --- 8 unchanged lines hidden (view full) --- 459and decide when to force a long seek to a new cylinder group 460because there are insufficient blocks left 461in the current cylinder group to do reasonable layouts. 462Below the global policy routines are 463the local allocation routines that use a locally optimal scheme to 464lay out data blocks. 465.PP 466Two methods for improving file system performance are to increase | 445the file system can be parameterized for it 446even though it is initially created on a different processor. 447Even if the move is not known in advance, 448the rotational layout delay can be reconfigured after the disk is moved 449so that all further allocation is done based on the 450characteristics of the new host. 451.NH 2 452Layout policies --- 8 unchanged lines hidden (view full) --- 461and decide when to force a long seek to a new cylinder group 462because there are insufficient blocks left 463in the current cylinder group to do reasonable layouts. 464Below the global policy routines are 465the local allocation routines that use a locally optimal scheme to 466lay out data blocks. 467.PP 468Two methods for improving file system performance are to increase |
467the locality of reference to minimize seek latency | 469the locality of reference to minimize seek latency |
468as described by [Trivedi80], and 469to improve the layout of data to make larger transfers possible 470as described by [Nevalainen77]. 471The global layout policies try to improve performance 472by clustering related information. 473They cannot attempt to localize all data references, 474but must also try to spread unrelated data 475among different cylinder groups. --- 5 unchanged lines hidden (view full) --- 481resembling the old file system. 482The global policies try to balance the two conflicting 483goals of localizing data that is concurrently accessed 484while spreading out unrelated data. 485.PP 486One allocatable resource is inodes. 487Inodes are used to describe both files and directories. 488Inodes of files in the same directory are frequently accessed together. | 470as described by [Trivedi80], and 471to improve the layout of data to make larger transfers possible 472as described by [Nevalainen77]. 473The global layout policies try to improve performance 474by clustering related information. 475They cannot attempt to localize all data references, 476but must also try to spread unrelated data 477among different cylinder groups. --- 5 unchanged lines hidden (view full) --- 483resembling the old file system. 484The global policies try to balance the two conflicting 485goals of localizing data that is concurrently accessed 486while spreading out unrelated data. 487.PP 488One allocatable resource is inodes. 489Inodes are used to describe both files and directories. 490Inodes of files in the same directory are frequently accessed together. |
489For example, the ``list directory'' command often accesses | 491For example, the ``list directory'' command often accesses |
490the inode for each file in a directory. 491The layout policy tries to place all the inodes of 492files in a directory in the same cylinder group. 493To ensure that files are distributed throughout the disk, 494a different policy is used for directory allocation. 495A new directory is placed in a cylinder group that has a greater 496than average number of free inodes, 497and the smallest number of directories already in it. --- 44 unchanged lines hidden (view full) --- 542.FE 543The newly chosen cylinder group is selected from those cylinder 544groups that have a greater than average number of free blocks left. 545Although big files tend to be spread out over the disk, 546a megabyte of data is typically accessible before 547a long seek must be performed, 548and the cost of one long seek per megabyte is small. 549.PP | 492the inode for each file in a directory. 493The layout policy tries to place all the inodes of 494files in a directory in the same cylinder group. 495To ensure that files are distributed throughout the disk, 496a different policy is used for directory allocation. 497A new directory is placed in a cylinder group that has a greater 498than average number of free inodes, 499and the smallest number of directories already in it. --- 44 unchanged lines hidden (view full) --- 544.FE 545The newly chosen cylinder group is selected from those cylinder 546groups that have a greater than average number of free blocks left. 547Although big files tend to be spread out over the disk, 548a megabyte of data is typically accessible before 549a long seek must be performed, 550and the cost of one long seek per megabyte is small. 551.PP |
550The global policy routines call local allocation routines with | 552The global policy routines call local allocation routines with |
551requests for specific blocks. 552The local allocation routines will | 553requests for specific blocks. 554The local allocation routines will |
553always allocate the requested block | 555always allocate the requested block |
554if it is free, otherwise it 555allocates a free block of the requested size that is 556rotationally closest to the requested block. 557If the global layout policies had complete information, 558they could always request unused blocks and 559the allocation routines would be reduced to simple bookkeeping. 560However, maintaining complete information is costly; | 556if it is free, otherwise it 557allocates a free block of the requested size that is 558rotationally closest to the requested block. 559If the global layout policies had complete information, 560they could always request unused blocks and 561the allocation routines would be reduced to simple bookkeeping. 562However, maintaining complete information is costly; |
561thus the implementation of the global layout policy | 563thus the implementation of the global layout policy |
562uses heuristics that employ only partial information. 563.PP 564If a requested block is not available, the local allocator uses 565a four level allocation strategy: 566.IP 1) 567Use the next available block rotationally closest 568to the requested block on the same cylinder. It is assumed | 564uses heuristics that employ only partial information. 565.PP 566If a requested block is not available, the local allocator uses 567a four level allocation strategy: 568.IP 1) 569Use the next available block rotationally closest 570to the requested block on the same cylinder. It is assumed |
569here that head switching time is zero. On disk | 571here that head switching time is zero. On disk |
570controllers where this is not the case, it may be possible 571to incorporate the time required to switch between disk platters 572when constructing the rotational layout tables. This, however, 573has not yet been tried. 574.IP 2) 575If there are no blocks available on the same cylinder, 576use a block within the same cylinder group. 577.IP 3) | 572controllers where this is not the case, it may be possible 573to incorporate the time required to switch between disk platters 574when constructing the rotational layout tables. This, however, 575has not yet been tried. 576.IP 2) 577If there are no blocks available on the same cylinder, 578use a block within the same cylinder group. 579.IP 3) |
578If that cylinder group is entirely full, | 580If that cylinder group is entirely full, |
579quadratically hash the cylinder group number to choose 580another cylinder group to look for a free block. 581.IP 4) 582Finally if the hash fails, apply an exhaustive search 583to all cylinder groups. 584.PP 585Quadratic hash is used because of its speed in finding 586unused slots in nearly full hash tables [Knuth75]. 587File systems that are parameterized to maintain at least 58810% free space rarely use this strategy. 589File systems that are run without maintaining any free 590space typically have so few free blocks that almost any 591allocation is random; 592the most important characteristic of 593the strategy used under such conditions is that the strategy be fast. 594.ds RH Performance 595.sp 2 596.ne 1i | 581quadratically hash the cylinder group number to choose 582another cylinder group to look for a free block. 583.IP 4) 584Finally if the hash fails, apply an exhaustive search 585to all cylinder groups. 586.PP 587Quadratic hash is used because of its speed in finding 588unused slots in nearly full hash tables [Knuth75]. 589File systems that are parameterized to maintain at least 59010% free space rarely use this strategy. 591File systems that are run without maintaining any free 592space typically have so few free blocks that almost any 593allocation is random; 594the most important characteristic of 595the strategy used under such conditions is that the strategy be fast. 596.ds RH Performance 597.sp 2 598.ne 1i |