Deleted Added
full compact
3.t (45319) 3.t (108533)
1.\" Copyright (c) 1986, 1993
2.\" The Regents of the University of California. All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\" notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\" notice, this list of conditions and the following disclaimer in the
11.\" documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\" must display the following acknowledgement:
14.\" This product includes software developed by the University of
15.\" California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\" may be used to endorse or promote products derived from this software
18.\" without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\" @(#)3.t 8.1 (Berkeley) 6/8/93
33.\"
1.\" Copyright (c) 1986, 1993
2.\" The Regents of the University of California. All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\" notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\" notice, this list of conditions and the following disclaimer in the
11.\" documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\" must display the following acknowledgement:
14.\" This product includes software developed by the University of
15.\" California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\" may be used to endorse or promote products derived from this software
18.\" without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\" @(#)3.t 8.1 (Berkeley) 6/8/93
33.\"
34.\" $FreeBSD: head/share/doc/smm/05.fastfs/3.t 108533 2003-01-01 18:49:04Z schweikh $
35.\"
34.ds RH New file system
35.NH
36New file system organization
37.PP
38In the new file system organization (as in the
39old file system organization),
40each disk drive contains one or more file systems.
41A file system is described by its super-block,
42located at the beginning of the file system's disk partition.
43Because the super-block contains critical data,
44it is replicated to protect against catastrophic loss.
45This is done when the file system is created;
46since the super-block data does not change,
47the copies need not be referenced unless a head crash
48or other hard disk error causes the default super-block
49to be unusable.
50.PP
51To insure that it is possible to create files as large as
52.if n 2 ** 32
53.if t $2 sup 32$
54bytes with only two levels of indirection,
55the minimum size of a file system block is 4096 bytes.
56The size of file system blocks can be any power of two
57greater than or equal to 4096.
36.ds RH New file system
37.NH
38New file system organization
39.PP
40In the new file system organization (as in the
41old file system organization),
42each disk drive contains one or more file systems.
43A file system is described by its super-block,
44located at the beginning of the file system's disk partition.
45Because the super-block contains critical data,
46it is replicated to protect against catastrophic loss.
47This is done when the file system is created;
48since the super-block data does not change,
49the copies need not be referenced unless a head crash
50or other hard disk error causes the default super-block
51to be unusable.
52.PP
53To insure that it is possible to create files as large as
54.if n 2 ** 32
55.if t $2 sup 32$
56bytes with only two levels of indirection,
57the minimum size of a file system block is 4096 bytes.
58The size of file system blocks can be any power of two
59greater than or equal to 4096.
58The block size of a file system is recorded in the
60The block size of a file system is recorded in the
59file system's super-block
60so it is possible for file systems with different block sizes
61to be simultaneously accessible on the same system.
62The block size must be decided at the time that
63the file system is created;
64it cannot be subsequently changed without rebuilding the file system.
65.PP
66The new file system organization divides a disk partition
67into one or more areas called
68.I "cylinder groups".
69A cylinder group is comprised of one or more consecutive
70cylinders on a disk.
71Associated with each cylinder group is some bookkeeping information
72that includes a redundant copy of the super-block,
73space for inodes,
74a bit map describing available blocks in the cylinder group,
75and summary information describing the usage of data blocks
76within the cylinder group.
77The bit map of available blocks in the cylinder group replaces
78the traditional file system's free list.
79For each cylinder group a static number of inodes
80is allocated at file system creation time.
81The default policy is to allocate one inode for each 2048
82bytes of space in the cylinder group, expecting this
61file system's super-block
62so it is possible for file systems with different block sizes
63to be simultaneously accessible on the same system.
64The block size must be decided at the time that
65the file system is created;
66it cannot be subsequently changed without rebuilding the file system.
67.PP
68The new file system organization divides a disk partition
69into one or more areas called
70.I "cylinder groups".
71A cylinder group is comprised of one or more consecutive
72cylinders on a disk.
73Associated with each cylinder group is some bookkeeping information
74that includes a redundant copy of the super-block,
75space for inodes,
76a bit map describing available blocks in the cylinder group,
77and summary information describing the usage of data blocks
78within the cylinder group.
79The bit map of available blocks in the cylinder group replaces
80the traditional file system's free list.
81For each cylinder group a static number of inodes
82is allocated at file system creation time.
83The default policy is to allocate one inode for each 2048
84bytes of space in the cylinder group, expecting this
83to be far more than will ever be needed.
85to be far more than will ever be needed.
84.PP
85All the cylinder group bookkeeping information could be
86placed at the beginning of each cylinder group.
87However if this approach were used,
88all the redundant information would be on the top platter.
89A single hardware failure that destroyed the top platter
90could cause the loss of all redundant copies of the super-block.
91Thus the cylinder group bookkeeping information
92begins at a varying offset from the beginning of the cylinder group.
93The offset for each successive cylinder group is calculated to be
94about one track further from the beginning of the cylinder group
95than the preceding cylinder group.
96In this way the redundant
97information spirals down into the pack so that any single track, cylinder,
98or platter can be lost without losing all copies of the super-block.
99Except for the first cylinder group,
100the space between the beginning of the cylinder group
101and the beginning of the cylinder group information
102is used for data blocks.\(dg
103.FS
104\(dg While it appears that the first cylinder group could be laid
105out with its super-block at the ``known'' location,
106this would not work for file systems
107with blocks sizes of 16 kilobytes or greater.
108This is because of a requirement that the first 8 kilobytes of the disk
109be reserved for a bootstrap program and a separate requirement that
110the cylinder group information begin on a file system block boundary.
111To start the cylinder group on a file system block boundary,
86.PP
87All the cylinder group bookkeeping information could be
88placed at the beginning of each cylinder group.
89However if this approach were used,
90all the redundant information would be on the top platter.
91A single hardware failure that destroyed the top platter
92could cause the loss of all redundant copies of the super-block.
93Thus the cylinder group bookkeeping information
94begins at a varying offset from the beginning of the cylinder group.
95The offset for each successive cylinder group is calculated to be
96about one track further from the beginning of the cylinder group
97than the preceding cylinder group.
98In this way the redundant
99information spirals down into the pack so that any single track, cylinder,
100or platter can be lost without losing all copies of the super-block.
101Except for the first cylinder group,
102the space between the beginning of the cylinder group
103and the beginning of the cylinder group information
104is used for data blocks.\(dg
105.FS
106\(dg While it appears that the first cylinder group could be laid
107out with its super-block at the ``known'' location,
108this would not work for file systems
109with blocks sizes of 16 kilobytes or greater.
110This is because of a requirement that the first 8 kilobytes of the disk
111be reserved for a bootstrap program and a separate requirement that
112the cylinder group information begin on a file system block boundary.
113To start the cylinder group on a file system block boundary,
112file systems with block sizes larger than 8 kilobytes
114file systems with block sizes larger than 8 kilobytes
113would have to leave an empty space between the end of
114the boot block and the beginning of the cylinder group.
115Without knowing the size of the file system blocks,
116the system would not know what roundup function to use
117to find the beginning of the first cylinder group.
118.FE
119.NH 2
120Optimizing storage utilization
121.PP
122Data is laid out so that larger blocks can be transferred
123in a single disk transaction, greatly increasing file system throughput.
124As an example, consider a file in the new file system
125composed of 4096 byte data blocks.
126In the old file system this file would be composed of 1024 byte blocks.
127By increasing the block size, disk accesses in the new file
128system may transfer up to four times as much information per
129disk transaction.
130In large files, several
1314096 byte blocks may be allocated from the same cylinder so that
132even larger data transfers are possible before requiring a seek.
133.PP
115would have to leave an empty space between the end of
116the boot block and the beginning of the cylinder group.
117Without knowing the size of the file system blocks,
118the system would not know what roundup function to use
119to find the beginning of the first cylinder group.
120.FE
121.NH 2
122Optimizing storage utilization
123.PP
124Data is laid out so that larger blocks can be transferred
125in a single disk transaction, greatly increasing file system throughput.
126As an example, consider a file in the new file system
127composed of 4096 byte data blocks.
128In the old file system this file would be composed of 1024 byte blocks.
129By increasing the block size, disk accesses in the new file
130system may transfer up to four times as much information per
131disk transaction.
132In large files, several
1334096 byte blocks may be allocated from the same cylinder so that
134even larger data transfers are possible before requiring a seek.
135.PP
134The main problem with
136The main problem with
135larger blocks is that most UNIX
136file systems are composed of many small files.
137A uniformly large block size wastes space.
138Table 1 shows the effect of file system
139block size on the amount of wasted space in the file system.
140The files measured to obtain these figures reside on
141one of our time sharing
142systems that has roughly 1.2 gigabytes of on-line storage.
143The measurements are based on the active user file systems containing
144about 920 megabytes of formatted space.
145.KF
146.DS B
147.TS
148box;
149l|l|l
150a|n|l.
151Space used % waste Organization
152_
153775.2 Mb 0.0 Data only, no separation between files
154807.8 Mb 4.2 Data only, each file starts on 512 byte boundary
155828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system
156866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system
157948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system
1581128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system
159.TE
160Table 1 \- Amount of wasted space as a function of block size.
161.DE
162.KE
163The space wasted is calculated to be the percentage of space
164on the disk not containing user data.
165As the block size on the disk
166increases, the waste rises quickly, to an intolerable
16745.6% waste with 4096 byte file system blocks.
168.PP
169To be able to use large blocks without undue waste,
170small files must be stored in a more efficient way.
171The new file system accomplishes this goal by allowing the division
172of a single file system block into one or more
173.I "fragments".
174The file system fragment size is specified
175at the time that the file system is created;
176each file system block can optionally be broken into
1772, 4, or 8 fragments, each of which is addressable.
178The lower bound on the size of these fragments is constrained
179by the disk sector size,
180typically 512 bytes.
181The block map associated with each cylinder group
182records the space available in a cylinder group
183at the fragment level;
184to determine if a block is available, aligned fragments are examined.
185Figure 1 shows a piece of a map from a 4096/1024 file system.
186.KF
187.DS B
188.TS
189box;
190l|c c c c.
191Bits in map XXXX XXOO OOXX OOOO
192Fragment numbers 0-3 4-7 8-11 12-15
193Block numbers 0 1 2 3
194.TE
195Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system.
196.DE
197.KE
198Each bit in the map records the status of a fragment;
199an ``X'' shows that the fragment is in use,
137larger blocks is that most UNIX
138file systems are composed of many small files.
139A uniformly large block size wastes space.
140Table 1 shows the effect of file system
141block size on the amount of wasted space in the file system.
142The files measured to obtain these figures reside on
143one of our time sharing
144systems that has roughly 1.2 gigabytes of on-line storage.
145The measurements are based on the active user file systems containing
146about 920 megabytes of formatted space.
147.KF
148.DS B
149.TS
150box;
151l|l|l
152a|n|l.
153Space used % waste Organization
154_
155775.2 Mb 0.0 Data only, no separation between files
156807.8 Mb 4.2 Data only, each file starts on 512 byte boundary
157828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system
158866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system
159948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system
1601128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system
161.TE
162Table 1 \- Amount of wasted space as a function of block size.
163.DE
164.KE
165The space wasted is calculated to be the percentage of space
166on the disk not containing user data.
167As the block size on the disk
168increases, the waste rises quickly, to an intolerable
16945.6% waste with 4096 byte file system blocks.
170.PP
171To be able to use large blocks without undue waste,
172small files must be stored in a more efficient way.
173The new file system accomplishes this goal by allowing the division
174of a single file system block into one or more
175.I "fragments".
176The file system fragment size is specified
177at the time that the file system is created;
178each file system block can optionally be broken into
1792, 4, or 8 fragments, each of which is addressable.
180The lower bound on the size of these fragments is constrained
181by the disk sector size,
182typically 512 bytes.
183The block map associated with each cylinder group
184records the space available in a cylinder group
185at the fragment level;
186to determine if a block is available, aligned fragments are examined.
187Figure 1 shows a piece of a map from a 4096/1024 file system.
188.KF
189.DS B
190.TS
191box;
192l|c c c c.
193Bits in map XXXX XXOO OOXX OOOO
194Fragment numbers 0-3 4-7 8-11 12-15
195Block numbers 0 1 2 3
196.TE
197Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system.
198.DE
199.KE
200Each bit in the map records the status of a fragment;
201an ``X'' shows that the fragment is in use,
200while a ``O'' shows that the fragment is available for allocation.
202while an ``O'' shows that the fragment is available for allocation.
201In this example,
202fragments 0\-5, 10, and 11 are in use,
203while fragments 6\-9, and 12\-15 are free.
204Fragments of adjoining blocks cannot be used as a full block,
205even if they are large enough.
206In this example,
207fragments 6\-9 cannot be allocated as a full block;
208only fragments 12\-15 can be coalesced into a full block.
209.PP
210On a file system with a block size of 4096 bytes
211and a fragment size of 1024 bytes,
212a file is represented by zero or more 4096 byte blocks of data,
213and possibly a single fragmented block.
214If a file system block must be fragmented to obtain
215space for a small amount of data,
216the remaining fragments of the block are made
217available for allocation to other files.
218As an example consider an 11000 byte file stored on
219a 4096/1024 byte file system.
220This file would uses two full size blocks and one
221three fragment portion of another block.
222If no block with three aligned fragments is
223available at the time the file is created,
224a full size block is split yielding the necessary
225fragments and a single unused fragment.
226This remaining fragment can be allocated to another file as needed.
227.PP
228Space is allocated to a file when a program does a \fIwrite\fP
229system call.
230Each time data is written to a file, the system checks to see if
231the size of the file has increased*.
232.FS
233* A program may be overwriting data in the middle of an existing file
234in which case space would already have been allocated.
235.FE
236If the file needs to be expanded to hold the new data,
237one of three conditions exists:
238.IP 1)
239There is enough space left in an already allocated
240block or fragment to hold the new data.
241The new data is written into the available space.
242.IP 2)
243The file contains no fragmented blocks (and the last
244block in the file
245contains insufficient space to hold the new data).
246If space exists in a block already allocated,
247the space is filled with new data.
248If the remainder of the new data contains more than
249a full block of data, a full block is allocated and
250the first full block of new data is written there.
251This process is repeated until less than a full block
252of new data remains.
253If the remaining new data to be written will
254fit in less than a full block,
255a block with the necessary fragments is located,
256otherwise a full block is located.
257The remaining new data is written into the located space.
258.IP 3)
203In this example,
204fragments 0\-5, 10, and 11 are in use,
205while fragments 6\-9, and 12\-15 are free.
206Fragments of adjoining blocks cannot be used as a full block,
207even if they are large enough.
208In this example,
209fragments 6\-9 cannot be allocated as a full block;
210only fragments 12\-15 can be coalesced into a full block.
211.PP
212On a file system with a block size of 4096 bytes
213and a fragment size of 1024 bytes,
214a file is represented by zero or more 4096 byte blocks of data,
215and possibly a single fragmented block.
216If a file system block must be fragmented to obtain
217space for a small amount of data,
218the remaining fragments of the block are made
219available for allocation to other files.
220As an example consider an 11000 byte file stored on
221a 4096/1024 byte file system.
222This file would uses two full size blocks and one
223three fragment portion of another block.
224If no block with three aligned fragments is
225available at the time the file is created,
226a full size block is split yielding the necessary
227fragments and a single unused fragment.
228This remaining fragment can be allocated to another file as needed.
229.PP
230Space is allocated to a file when a program does a \fIwrite\fP
231system call.
232Each time data is written to a file, the system checks to see if
233the size of the file has increased*.
234.FS
235* A program may be overwriting data in the middle of an existing file
236in which case space would already have been allocated.
237.FE
238If the file needs to be expanded to hold the new data,
239one of three conditions exists:
240.IP 1)
241There is enough space left in an already allocated
242block or fragment to hold the new data.
243The new data is written into the available space.
244.IP 2)
245The file contains no fragmented blocks (and the last
246block in the file
247contains insufficient space to hold the new data).
248If space exists in a block already allocated,
249the space is filled with new data.
250If the remainder of the new data contains more than
251a full block of data, a full block is allocated and
252the first full block of new data is written there.
253This process is repeated until less than a full block
254of new data remains.
255If the remaining new data to be written will
256fit in less than a full block,
257a block with the necessary fragments is located,
258otherwise a full block is located.
259The remaining new data is written into the located space.
260.IP 3)
259The file contains one or more fragments (and the
261The file contains one or more fragments (and the
260fragments contain insufficient space to hold the new data).
261If the size of the new data plus the size of the data
262already in the fragments exceeds the size of a full block,
263a new block is allocated.
264The contents of the fragments are copied
265to the beginning of the block
266and the remainder of the block is filled with new data.
267The process then continues as in (2) above.
268Otherwise, if the new data to be written will
269fit in less than a full block,
270a block with the necessary fragments is located,
271otherwise a full block is located.
272The contents of the existing fragments
273appended with the new data
274are written into the allocated space.
275.PP
276The problem with expanding a file one fragment at a
262fragments contain insufficient space to hold the new data).
263If the size of the new data plus the size of the data
264already in the fragments exceeds the size of a full block,
265a new block is allocated.
266The contents of the fragments are copied
267to the beginning of the block
268and the remainder of the block is filled with new data.
269The process then continues as in (2) above.
270Otherwise, if the new data to be written will
271fit in less than a full block,
272a block with the necessary fragments is located,
273otherwise a full block is located.
274The contents of the existing fragments
275appended with the new data
276are written into the allocated space.
277.PP
278The problem with expanding a file one fragment at a
277a time is that data may be copied many times as a
279a time is that data may be copied many times as a
278fragmented block expands to a full block.
279Fragment reallocation can be minimized
280if the user program writes a full block at a time,
281except for a partial block at the end of the file.
282Since file systems with different block sizes may reside on
283the same system,
284the file system interface has been extended to provide
285application programs the optimal size for a read or write.
286For files the optimal size is the block size of the file system
287on which the file is being accessed.
288For other objects, such as pipes and sockets,
289the optimal size is the underlying buffer size.
290This feature is used by the Standard
291Input/Output Library,
292a package used by most user programs.
293This feature is also used by
294certain system utilities such as archivers and loaders
295that do their own input and output management
296and need the highest possible file system bandwidth.
297.PP
298The amount of wasted space in the 4096/1024 byte new file system
299organization is empirically observed to be about the same as in the
3001024 byte old file system organization.
301A file system with 4096 byte blocks and 512 byte fragments
302has about the same amount of wasted space as the 512 byte
303block UNIX file system.
304The new file system uses less space
305than the 512 byte or 1024 byte
306file systems for indexing information for
307large files and the same amount of space
308for small files.
309These savings are offset by the need to use
310more space for keeping track of available free blocks.
311The net result is about the same disk utilization
312when a new file system's fragment size
313equals an old file system's block size.
314.PP
315In order for the layout policies to be effective,
316a file system cannot be kept completely full.
317For each file system there is a parameter, termed
318the free space reserve, that
319gives the minimum acceptable percentage of file system
320blocks that should be free.
321If the number of free blocks drops below this level
322only the system administrator can continue to allocate blocks.
323The value of this parameter may be changed at any time,
324even when the file system is mounted and active.
325The transfer rates that appear in section 4 were measured on file
326systems kept less than 90% full (a reserve of 10%).
327If the number of free blocks falls to zero,
328the file system throughput tends to be cut in half,
329because of the inability of the file system to localize
330blocks in a file.
331If a file system's performance degrades because
332of overfilling, it may be restored by removing
333files until the amount of free space once again
334reaches the minimum acceptable level.
335Access rates for files created during periods of little
336free space may be restored by moving their data once enough
337space is available.
338The free space reserve must be added to the
339percentage of waste when comparing the organizations given
340in Table 1.
341Thus, the percentage of waste in
342an old 1024 byte UNIX file system is roughly
343comparable to a new 4096/512 byte file system
344with the free space reserve set at 5%.
345(Compare 11.8% wasted with the old file system
346to 6.9% waste + 5% reserved space in the
347new file system.)
280fragmented block expands to a full block.
281Fragment reallocation can be minimized
282if the user program writes a full block at a time,
283except for a partial block at the end of the file.
284Since file systems with different block sizes may reside on
285the same system,
286the file system interface has been extended to provide
287application programs the optimal size for a read or write.
288For files the optimal size is the block size of the file system
289on which the file is being accessed.
290For other objects, such as pipes and sockets,
291the optimal size is the underlying buffer size.
292This feature is used by the Standard
293Input/Output Library,
294a package used by most user programs.
295This feature is also used by
296certain system utilities such as archivers and loaders
297that do their own input and output management
298and need the highest possible file system bandwidth.
299.PP
300The amount of wasted space in the 4096/1024 byte new file system
301organization is empirically observed to be about the same as in the
3021024 byte old file system organization.
303A file system with 4096 byte blocks and 512 byte fragments
304has about the same amount of wasted space as the 512 byte
305block UNIX file system.
306The new file system uses less space
307than the 512 byte or 1024 byte
308file systems for indexing information for
309large files and the same amount of space
310for small files.
311These savings are offset by the need to use
312more space for keeping track of available free blocks.
313The net result is about the same disk utilization
314when a new file system's fragment size
315equals an old file system's block size.
316.PP
317In order for the layout policies to be effective,
318a file system cannot be kept completely full.
319For each file system there is a parameter, termed
320the free space reserve, that
321gives the minimum acceptable percentage of file system
322blocks that should be free.
323If the number of free blocks drops below this level
324only the system administrator can continue to allocate blocks.
325The value of this parameter may be changed at any time,
326even when the file system is mounted and active.
327The transfer rates that appear in section 4 were measured on file
328systems kept less than 90% full (a reserve of 10%).
329If the number of free blocks falls to zero,
330the file system throughput tends to be cut in half,
331because of the inability of the file system to localize
332blocks in a file.
333If a file system's performance degrades because
334of overfilling, it may be restored by removing
335files until the amount of free space once again
336reaches the minimum acceptable level.
337Access rates for files created during periods of little
338free space may be restored by moving their data once enough
339space is available.
340The free space reserve must be added to the
341percentage of waste when comparing the organizations given
342in Table 1.
343Thus, the percentage of waste in
344an old 1024 byte UNIX file system is roughly
345comparable to a new 4096/512 byte file system
346with the free space reserve set at 5%.
347(Compare 11.8% wasted with the old file system
348to 6.9% waste + 5% reserved space in the
349new file system.)
348.NH 2
350.NH 2
349File system parameterization
350.PP
351Except for the initial creation of the free list,
352the old file system ignores the parameters of the underlying hardware.
353It has no information about either the physical characteristics
354of the mass storage device,
355or the hardware that interacts with it.
351File system parameterization
352.PP
353Except for the initial creation of the free list,
354the old file system ignores the parameters of the underlying hardware.
355It has no information about either the physical characteristics
356of the mass storage device,
357or the hardware that interacts with it.
356A goal of the new file system is to parameterize the
358A goal of the new file system is to parameterize the
357processor capabilities and
358mass storage characteristics
359so that blocks can be allocated in an
359processor capabilities and
360mass storage characteristics
361so that blocks can be allocated in an
360optimum configuration-dependent way.
362optimum configuration-dependent way.
361Parameters used include the speed of the processor,
362the hardware support for mass storage transfers,
363and the characteristics of the mass storage devices.
364Disk technology is constantly improving and
365a given installation can have several different disk technologies
366running on a single processor.
367Each file system is parameterized so that it can be
368adapted to the characteristics of the disk on which
369it is placed.
370.PP
371For mass storage devices such as disks,
372the new file system tries to allocate new blocks
363Parameters used include the speed of the processor,
364the hardware support for mass storage transfers,
365and the characteristics of the mass storage devices.
366Disk technology is constantly improving and
367a given installation can have several different disk technologies
368running on a single processor.
369Each file system is parameterized so that it can be
370adapted to the characteristics of the disk on which
371it is placed.
372.PP
373For mass storage devices such as disks,
374the new file system tries to allocate new blocks
373on the same cylinder as the previous block in the same file.
374Optimally, these new blocks will also be
375on the same cylinder as the previous block in the same file.
376Optimally, these new blocks will also be
375rotationally well positioned.
376The distance between ``rotationally optimal'' blocks varies greatly;
377it can be a consecutive block
378or a rotationally delayed block
379depending on system characteristics.
380On a processor with an input/output channel that does not require
381any processor intervention between mass storage transfer requests,
382two consecutive disk blocks can often be accessed
383without suffering lost time because of an intervening disk revolution.
384For processors without input/output channels,
385the main processor must field an interrupt and
386prepare for a new disk transfer.
387The expected time to service this interrupt and
388schedule a new disk transfer depends on the
389speed of the main processor.
390.PP
391The physical characteristics of each disk include
392the number of blocks per track and the rate at which
393the disk spins.
394The allocation routines use this information to calculate
395the number of milliseconds required to skip over a block.
396The characteristics of the processor include
397the expected time to service an interrupt and schedule a
398new disk transfer.
399Given a block allocated to a file,
400the allocation routines calculate the number of blocks to
401skip over so that the next block in the file will
402come into position under the disk head in the expected
403amount of time that it takes to start a new
404disk transfer operation.
405For programs that sequentially access large amounts of data,
406this strategy minimizes the amount of time spent waiting for
407the disk to position itself.
408.PP
409To ease the calculation of finding rotationally optimal blocks,
410the cylinder group summary information includes
411a count of the available blocks in a cylinder
412group at different rotational positions.
413Eight rotational positions are distinguished,
414so the resolution of the
415summary information is 2 milliseconds for a typical 3600
416revolution per minute drive.
417The super-block contains a vector of lists called
418.I "rotational layout tables".
419The vector is indexed by rotational position.
420Each component of the vector
421lists the index into the block map for every data block contained
422in its rotational position.
423When looking for an allocatable block,
424the system first looks through the summary counts for a rotational
425position with a non-zero block count.
426It then uses the index of the rotational position to find the appropriate
427list to use to index through
428only the relevant parts of the block map to find a free block.
429.PP
430The parameter that defines the
431minimum number of milliseconds between the completion of a data
432transfer and the initiation of
433another data transfer on the same cylinder
434can be changed at any time,
435even when the file system is mounted and active.
436If a file system is parameterized to lay out blocks with
437a rotational separation of 2 milliseconds,
438and the disk pack is then moved to a system that has a
439processor requiring 4 milliseconds to schedule a disk operation,
440the throughput will drop precipitously because of lost disk revolutions
441on nearly every block.
377rotationally well positioned.
378The distance between ``rotationally optimal'' blocks varies greatly;
379it can be a consecutive block
380or a rotationally delayed block
381depending on system characteristics.
382On a processor with an input/output channel that does not require
383any processor intervention between mass storage transfer requests,
384two consecutive disk blocks can often be accessed
385without suffering lost time because of an intervening disk revolution.
386For processors without input/output channels,
387the main processor must field an interrupt and
388prepare for a new disk transfer.
389The expected time to service this interrupt and
390schedule a new disk transfer depends on the
391speed of the main processor.
392.PP
393The physical characteristics of each disk include
394the number of blocks per track and the rate at which
395the disk spins.
396The allocation routines use this information to calculate
397the number of milliseconds required to skip over a block.
398The characteristics of the processor include
399the expected time to service an interrupt and schedule a
400new disk transfer.
401Given a block allocated to a file,
402the allocation routines calculate the number of blocks to
403skip over so that the next block in the file will
404come into position under the disk head in the expected
405amount of time that it takes to start a new
406disk transfer operation.
407For programs that sequentially access large amounts of data,
408this strategy minimizes the amount of time spent waiting for
409the disk to position itself.
410.PP
411To ease the calculation of finding rotationally optimal blocks,
412the cylinder group summary information includes
413a count of the available blocks in a cylinder
414group at different rotational positions.
415Eight rotational positions are distinguished,
416so the resolution of the
417summary information is 2 milliseconds for a typical 3600
418revolution per minute drive.
419The super-block contains a vector of lists called
420.I "rotational layout tables".
421The vector is indexed by rotational position.
422Each component of the vector
423lists the index into the block map for every data block contained
424in its rotational position.
425When looking for an allocatable block,
426the system first looks through the summary counts for a rotational
427position with a non-zero block count.
428It then uses the index of the rotational position to find the appropriate
429list to use to index through
430only the relevant parts of the block map to find a free block.
431.PP
432The parameter that defines the
433minimum number of milliseconds between the completion of a data
434transfer and the initiation of
435another data transfer on the same cylinder
436can be changed at any time,
437even when the file system is mounted and active.
438If a file system is parameterized to lay out blocks with
439a rotational separation of 2 milliseconds,
440and the disk pack is then moved to a system that has a
441processor requiring 4 milliseconds to schedule a disk operation,
442the throughput will drop precipitously because of lost disk revolutions
443on nearly every block.
442If the eventual target machine is known,
444If the eventual target machine is known,
443the file system can be parameterized for it
444even though it is initially created on a different processor.
445Even if the move is not known in advance,
446the rotational layout delay can be reconfigured after the disk is moved
447so that all further allocation is done based on the
448characteristics of the new host.
449.NH 2
450Layout policies
451.PP
452The file system layout policies are divided into two distinct parts.
453At the top level are global policies that use file system
454wide summary information to make decisions regarding
455the placement of new inodes and data blocks.
456These routines are responsible for deciding the
457placement of new directories and files.
458They also calculate rotationally optimal block layouts,
459and decide when to force a long seek to a new cylinder group
460because there are insufficient blocks left
461in the current cylinder group to do reasonable layouts.
462Below the global policy routines are
463the local allocation routines that use a locally optimal scheme to
464lay out data blocks.
465.PP
466Two methods for improving file system performance are to increase
445the file system can be parameterized for it
446even though it is initially created on a different processor.
447Even if the move is not known in advance,
448the rotational layout delay can be reconfigured after the disk is moved
449so that all further allocation is done based on the
450characteristics of the new host.
451.NH 2
452Layout policies
453.PP
454The file system layout policies are divided into two distinct parts.
455At the top level are global policies that use file system
456wide summary information to make decisions regarding
457the placement of new inodes and data blocks.
458These routines are responsible for deciding the
459placement of new directories and files.
460They also calculate rotationally optimal block layouts,
461and decide when to force a long seek to a new cylinder group
462because there are insufficient blocks left
463in the current cylinder group to do reasonable layouts.
464Below the global policy routines are
465the local allocation routines that use a locally optimal scheme to
466lay out data blocks.
467.PP
468Two methods for improving file system performance are to increase
467the locality of reference to minimize seek latency
469the locality of reference to minimize seek latency
468as described by [Trivedi80], and
469to improve the layout of data to make larger transfers possible
470as described by [Nevalainen77].
471The global layout policies try to improve performance
472by clustering related information.
473They cannot attempt to localize all data references,
474but must also try to spread unrelated data
475among different cylinder groups.
476If too much localization is attempted,
477the local cylinder group may run out of space
478forcing the data to be scattered to non-local cylinder groups.
479Taken to an extreme,
480total localization can result in a single huge cluster of data
481resembling the old file system.
482The global policies try to balance the two conflicting
483goals of localizing data that is concurrently accessed
484while spreading out unrelated data.
485.PP
486One allocatable resource is inodes.
487Inodes are used to describe both files and directories.
488Inodes of files in the same directory are frequently accessed together.
470as described by [Trivedi80], and
471to improve the layout of data to make larger transfers possible
472as described by [Nevalainen77].
473The global layout policies try to improve performance
474by clustering related information.
475They cannot attempt to localize all data references,
476but must also try to spread unrelated data
477among different cylinder groups.
478If too much localization is attempted,
479the local cylinder group may run out of space
480forcing the data to be scattered to non-local cylinder groups.
481Taken to an extreme,
482total localization can result in a single huge cluster of data
483resembling the old file system.
484The global policies try to balance the two conflicting
485goals of localizing data that is concurrently accessed
486while spreading out unrelated data.
487.PP
488One allocatable resource is inodes.
489Inodes are used to describe both files and directories.
490Inodes of files in the same directory are frequently accessed together.
489For example, the ``list directory'' command often accesses
491For example, the ``list directory'' command often accesses
490the inode for each file in a directory.
491The layout policy tries to place all the inodes of
492files in a directory in the same cylinder group.
493To ensure that files are distributed throughout the disk,
494a different policy is used for directory allocation.
495A new directory is placed in a cylinder group that has a greater
496than average number of free inodes,
497and the smallest number of directories already in it.
498The intent of this policy is to allow the inode clustering policy
499to succeed most of the time.
500The allocation of inodes within a cylinder group is done using a
501next free strategy.
502Although this allocates the inodes randomly within a cylinder group,
503all the inodes for a particular cylinder group can be read with
5048 to 16 disk transfers.
505(At most 16 disk transfers are required because a cylinder
506group may have no more than 2048 inodes.)
507This puts a small and constant upper bound on the number of
508disk transfers required to access the inodes
509for all the files in a directory.
510In contrast, the old file system typically requires
511one disk transfer to fetch the inode for each file in a directory.
512.PP
513The other major resource is data blocks.
514Since data blocks for a file are typically accessed together,
515the policy routines try to place all data
516blocks for a file in the same cylinder group,
517preferably at rotationally optimal positions in the same cylinder.
518The problem with allocating all the data blocks
519in the same cylinder group is that large files will
520quickly use up available space in the cylinder group,
521forcing a spill over to other areas.
522Further, using all the space in a cylinder group
523causes future allocations for any file in the cylinder group
524to also spill to other areas.
525Ideally none of the cylinder groups should ever become completely full.
526The heuristic solution chosen is to
527redirect block allocation
528to a different cylinder group
529when a file exceeds 48 kilobytes,
530and at every megabyte thereafter.*
531.FS
532* The first spill over point at 48 kilobytes is the point
533at which a file on a 4096 byte block file system first
534requires a single indirect block. This appears to be
535a natural first point at which to redirect block allocation.
536The other spillover points are chosen with the intent of
537forcing block allocation to be redirected when a
538file has used about 25% of the data blocks in a cylinder group.
539In observing the new file system in day to day use, the heuristics appear
540to work well in minimizing the number of completely filled
541cylinder groups.
542.FE
543The newly chosen cylinder group is selected from those cylinder
544groups that have a greater than average number of free blocks left.
545Although big files tend to be spread out over the disk,
546a megabyte of data is typically accessible before
547a long seek must be performed,
548and the cost of one long seek per megabyte is small.
549.PP
492the inode for each file in a directory.
493The layout policy tries to place all the inodes of
494files in a directory in the same cylinder group.
495To ensure that files are distributed throughout the disk,
496a different policy is used for directory allocation.
497A new directory is placed in a cylinder group that has a greater
498than average number of free inodes,
499and the smallest number of directories already in it.
500The intent of this policy is to allow the inode clustering policy
501to succeed most of the time.
502The allocation of inodes within a cylinder group is done using a
503next free strategy.
504Although this allocates the inodes randomly within a cylinder group,
505all the inodes for a particular cylinder group can be read with
5068 to 16 disk transfers.
507(At most 16 disk transfers are required because a cylinder
508group may have no more than 2048 inodes.)
509This puts a small and constant upper bound on the number of
510disk transfers required to access the inodes
511for all the files in a directory.
512In contrast, the old file system typically requires
513one disk transfer to fetch the inode for each file in a directory.
514.PP
515The other major resource is data blocks.
516Since data blocks for a file are typically accessed together,
517the policy routines try to place all data
518blocks for a file in the same cylinder group,
519preferably at rotationally optimal positions in the same cylinder.
520The problem with allocating all the data blocks
521in the same cylinder group is that large files will
522quickly use up available space in the cylinder group,
523forcing a spill over to other areas.
524Further, using all the space in a cylinder group
525causes future allocations for any file in the cylinder group
526to also spill to other areas.
527Ideally none of the cylinder groups should ever become completely full.
528The heuristic solution chosen is to
529redirect block allocation
530to a different cylinder group
531when a file exceeds 48 kilobytes,
532and at every megabyte thereafter.*
533.FS
534* The first spill over point at 48 kilobytes is the point
535at which a file on a 4096 byte block file system first
536requires a single indirect block. This appears to be
537a natural first point at which to redirect block allocation.
538The other spillover points are chosen with the intent of
539forcing block allocation to be redirected when a
540file has used about 25% of the data blocks in a cylinder group.
541In observing the new file system in day to day use, the heuristics appear
542to work well in minimizing the number of completely filled
543cylinder groups.
544.FE
545The newly chosen cylinder group is selected from those cylinder
546groups that have a greater than average number of free blocks left.
547Although big files tend to be spread out over the disk,
548a megabyte of data is typically accessible before
549a long seek must be performed,
550and the cost of one long seek per megabyte is small.
551.PP
550The global policy routines call local allocation routines with
552The global policy routines call local allocation routines with
551requests for specific blocks.
552The local allocation routines will
553requests for specific blocks.
554The local allocation routines will
553always allocate the requested block
555always allocate the requested block
554if it is free, otherwise it
555allocates a free block of the requested size that is
556rotationally closest to the requested block.
557If the global layout policies had complete information,
558they could always request unused blocks and
559the allocation routines would be reduced to simple bookkeeping.
560However, maintaining complete information is costly;
556if it is free, otherwise it
557allocates a free block of the requested size that is
558rotationally closest to the requested block.
559If the global layout policies had complete information,
560they could always request unused blocks and
561the allocation routines would be reduced to simple bookkeeping.
562However, maintaining complete information is costly;
561thus the implementation of the global layout policy
563thus the implementation of the global layout policy
562uses heuristics that employ only partial information.
563.PP
564If a requested block is not available, the local allocator uses
565a four level allocation strategy:
566.IP 1)
567Use the next available block rotationally closest
568to the requested block on the same cylinder. It is assumed
564uses heuristics that employ only partial information.
565.PP
566If a requested block is not available, the local allocator uses
567a four level allocation strategy:
568.IP 1)
569Use the next available block rotationally closest
570to the requested block on the same cylinder. It is assumed
569here that head switching time is zero. On disk
571here that head switching time is zero. On disk
570controllers where this is not the case, it may be possible
571to incorporate the time required to switch between disk platters
572when constructing the rotational layout tables. This, however,
573has not yet been tried.
574.IP 2)
575If there are no blocks available on the same cylinder,
576use a block within the same cylinder group.
577.IP 3)
572controllers where this is not the case, it may be possible
573to incorporate the time required to switch between disk platters
574when constructing the rotational layout tables. This, however,
575has not yet been tried.
576.IP 2)
577If there are no blocks available on the same cylinder,
578use a block within the same cylinder group.
579.IP 3)
578If that cylinder group is entirely full,
580If that cylinder group is entirely full,
579quadratically hash the cylinder group number to choose
580another cylinder group to look for a free block.
581.IP 4)
582Finally if the hash fails, apply an exhaustive search
583to all cylinder groups.
584.PP
585Quadratic hash is used because of its speed in finding
586unused slots in nearly full hash tables [Knuth75].
587File systems that are parameterized to maintain at least
58810% free space rarely use this strategy.
589File systems that are run without maintaining any free
590space typically have so few free blocks that almost any
591allocation is random;
592the most important characteristic of
593the strategy used under such conditions is that the strategy be fast.
594.ds RH Performance
595.sp 2
596.ne 1i
581quadratically hash the cylinder group number to choose
582another cylinder group to look for a free block.
583.IP 4)
584Finally if the hash fails, apply an exhaustive search
585to all cylinder groups.
586.PP
587Quadratic hash is used because of its speed in finding
588unused slots in nearly full hash tables [Knuth75].
589File systems that are parameterized to maintain at least
59010% free space rarely use this strategy.
591File systems that are run without maintaining any free
592space typically have so few free blocks that almost any
593allocation is random;
594the most important characteristic of
595the strategy used under such conditions is that the strategy be fast.
596.ds RH Performance
597.sp 2
598.ne 1i