Cross Reference: /freebsd-9.3-release/share/doc/papers/sysperf/4.t

Deleted Added

sdiff udiff text old ( 21477 ) new ( 108533 )

full compact

4.t (21477)	4.t (108533)
1.\" Copyright (c) 1985 The Regents of the University of California. 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. --- 17 unchanged lines hidden (view full) --- 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)4.t 5.1 (Berkeley) 4/17/91 33.\"	1.\" Copyright (c) 1985 The Regents of the University of California. 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. --- 17 unchanged lines hidden (view full) --- 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)4.t 5.1 (Berkeley) 4/17/91 33.\"
	34.\" $FreeBSD: head/share/doc/papers/sysperf/4.t 108533 2003-01-01 18:49:04Z schweikh $ 35.\"
34.ds RH Performance Improvements 35.NH 36Performance Improvements 37.PP	36.ds RH Performance Improvements 37.NH 38Performance Improvements 39.PP
38This section outlines the changes made to the system	40This section outlines the changes made to the system
39since the 4.2BSD distribution. 40The changes reported here were made in response 41to the problems described in Section 3. 42The improvements fall into two major classes; 43changes to the kernel that are described in this section, 44and changes to the system libraries and utilities that are 45described in the following section. 46.NH 2 --- 27 unchanged lines hidden (view full) --- 74.FE 75An inspection of \fInamei\fP shows that 76it consists of two nested loops. 77The outer loop is traversed once per pathname component. 78The inner loop performs a linear search through a directory looking 79for a particular pathname component. 80.PP 81Our first idea was to reduce the number of iterations	41since the 4.2BSD distribution. 42The changes reported here were made in response 43to the problems described in Section 3. 44The improvements fall into two major classes; 45changes to the kernel that are described in this section, 46and changes to the system libraries and utilities that are 47described in the following section. 48.NH 2 --- 27 unchanged lines hidden (view full) --- 76.FE 77An inspection of \fInamei\fP shows that 78it consists of two nested loops. 79The outer loop is traversed once per pathname component. 80The inner loop performs a linear search through a directory looking 81for a particular pathname component. 82.PP 83Our first idea was to reduce the number of iterations
82around the inner loop of \fInamei\fP by observing that many programs	84around the inner loop of \fInamei\fP by observing that many programs
83step through a directory performing an operation on each entry in turn. 84To improve performance for processes doing directory scans, 85the system keeps track of the directory offset of the last component of the 86most recently translated path name for each process. 87If the next name the process requests is in the same directory, 88the search is started from the offset that the previous name was found 89(instead of from the beginning of the directory). 90Changing directories invalidates the cache, as 91does modifying the directory. 92For programs that step sequentially through a directory with 93.EQ 94delim $$ 95.EN 96$N$ files, search time decreases from $O ( N sup 2 )$ to $O(N)$. 97.EQ 98delim off 99.EN 100.PP 101The cost of the cache is about 20 lines of code	85step through a directory performing an operation on each entry in turn. 86To improve performance for processes doing directory scans, 87the system keeps track of the directory offset of the last component of the 88most recently translated path name for each process. 89If the next name the process requests is in the same directory, 90the search is started from the offset that the previous name was found 91(instead of from the beginning of the directory). 92Changing directories invalidates the cache, as 93does modifying the directory. 94For programs that step sequentially through a directory with 95.EQ 96delim $$ 97.EN 98$N$ files, search time decreases from $O ( N sup 2 )$ to $O(N)$. 99.EQ 100delim off 101.EN 102.PP 103The cost of the cache is about 20 lines of code
102(about 0.2 kilobytes)	104(about 0.2 kilobytes)
103and 16 bytes per process, with the cached data 104stored in a process's \fIuser\fP vector. 105.PP 106As a quick benchmark to verify the maximum effectiveness of the 107cache we ran ``ls \-l'' 108on a directory containing 600 files. 109Before the per-process cache this command 110used 22.3 seconds of system time. --- 46 unchanged lines hidden (view full) --- 157entry in the inode table. 158.FE 159This has the effect of eliminating the inner loop of \fInamei\fP. 160For each path name component, 161\fInamei\fP first looks in its cache of recent translations 162for the needed name. 163If it exists, the directory search can be completely eliminated. 164.PP	105and 16 bytes per process, with the cached data 106stored in a process's \fIuser\fP vector. 107.PP 108As a quick benchmark to verify the maximum effectiveness of the 109cache we ran ``ls \-l'' 110on a directory containing 600 files. 111Before the per-process cache this command 112used 22.3 seconds of system time. --- 46 unchanged lines hidden (view full) --- 159entry in the inode table. 160.FE 161This has the effect of eliminating the inner loop of \fInamei\fP. 162For each path name component, 163\fInamei\fP first looks in its cache of recent translations 164for the needed name. 165If it exists, the directory search can be completely eliminated. 166.PP
165The system already maintained a cache of recently accessed inodes,	167The system already maintained a cache of recently accessed inodes,
166so the initial name cache 167maintained a simple name-inode association that was used to 168check each component of a path name during name translations. 169We considered implementing the cache by tagging each inode 170with its most recently translated name, 171but eventually decided to have a separate data structure that 172kept names with pointers to the inode table. 173Tagging inodes has two drawbacks; 174many inodes such as those associated with login ports remain in	168so the initial name cache 169maintained a simple name-inode association that was used to 170check each component of a path name during name translations. 171We considered implementing the cache by tagging each inode 172with its most recently translated name, 173but eventually decided to have a separate data structure that 174kept names with pointers to the inode table. 175Tagging inodes has two drawbacks; 176many inodes such as those associated with login ports remain in
175the inode table for a long period of time, but are never looked	177the inode table for a long period of time, but are never looked
176up by name. 177Other inodes, such as those describing directories are looked up 178frequently by many different names (\fIe.g.\fP ``..''). 179By keeping a separate table of names, the cache can 180truly reflect the most recently used names. 181An added benefit is that the table can be sized independently 182of the inode table, so that machines with small amounts of memory 183can reduce the size of the cache (or even eliminate it) 184without modifying the inode table structure. 185.PP	178up by name. 179Other inodes, such as those describing directories are looked up 180frequently by many different names (\fIe.g.\fP ``..''). 181By keeping a separate table of names, the cache can 182truly reflect the most recently used names. 183An added benefit is that the table can be sized independently 184of the inode table, so that machines with small amounts of memory 185can reduce the size of the cache (or even eliminate it) 186without modifying the inode table structure. 187.PP
186Another issue to be considered is how the name cache should	188Another issue to be considered is how the name cache should
187hold references to the inode table. 188Normally processes hold ``hard references'' by incrementing the 189reference count in the inode they reference. 190Since the system reuses only inodes with zero reference counts, 191a hard reference insures that the inode pointer will remain valid. 192However, if the name cache holds hard references, 193it is limited to some fraction of the size of the inode table, 194since some inodes must be left free for new files. --- 19 unchanged lines hidden (view full) --- 214Since the name cache holds only soft references, 215it may be sized independent of the size of the inode table. 216A final benefit of using capabilities is that all 217cached names for an inode may be invalidated without 218searching through the entire cache; 219instead all you need to do is assign a new capability to the inode. 220.PP 221The cost of the name cache is about 200 lines of code	189hold references to the inode table. 190Normally processes hold ``hard references'' by incrementing the 191reference count in the inode they reference. 192Since the system reuses only inodes with zero reference counts, 193a hard reference insures that the inode pointer will remain valid. 194However, if the name cache holds hard references, 195it is limited to some fraction of the size of the inode table, 196since some inodes must be left free for new files. --- 19 unchanged lines hidden (view full) --- 216Since the name cache holds only soft references, 217it may be sized independent of the size of the inode table. 218A final benefit of using capabilities is that all 219cached names for an inode may be invalidated without 220searching through the entire cache; 221instead all you need to do is assign a new capability to the inode. 222.PP 223The cost of the name cache is about 200 lines of code
222(about 1.2 kilobytes)	224(about 1.2 kilobytes)
223and 48 bytes per cache entry. 224Depending on the size of the system, 225about 200 to 1000 entries will normally be configured, 226using 10-50 kilobytes of physical memory. 227The name cache is resident in memory at all times. 228.PP 229After adding the system wide name cache we reran ``ls \-l'' 230on the same directory. --- 68 unchanged lines hidden (view full) --- 299It is more efficient to use a device's silo input mode, 300since this generates fewer interrupts than handling each character 301as a separate interrupt. 302Since a given dialup port may switch between uucp logins and user logins, 303it is impossible to statically select the most efficient input mode to use. 304.PP 305We therefore changed the terminal multiplexor handlers 306to dynamically choose between the use of the silo and the use of	225and 48 bytes per cache entry. 226Depending on the size of the system, 227about 200 to 1000 entries will normally be configured, 228using 10-50 kilobytes of physical memory. 229The name cache is resident in memory at all times. 230.PP 231After adding the system wide name cache we reran ``ls \-l'' 232on the same directory. --- 68 unchanged lines hidden (view full) --- 301It is more efficient to use a device's silo input mode, 302since this generates fewer interrupts than handling each character 303as a separate interrupt. 304Since a given dialup port may switch between uucp logins and user logins, 305it is impossible to statically select the most efficient input mode to use. 306.PP 307We therefore changed the terminal multiplexor handlers 308to dynamically choose between the use of the silo and the use of
307per-character interrupts.	309per-character interrupts.
308At low input rates the handler processes characters on an 309interrupt basis, avoiding the overhead 310of checking each interface on each clock interrupt. 311During periods of sustained input, the handler enables the silo 312and starts a timer to drain input. 313This timer runs less frequently than the clock interrupts, 314and is used only when there is a substantial amount of input. 315The transition from using silos to an interrupt per character is --- 68 unchanged lines hidden (view full) --- 384Clock Handling 385.PP 386The hardware clock interrupts the processor 100 times per second 387at high priority. 388As most of the clock-based events need not be done at high priority, 389the system schedules a lower priority software interrupt to do the less 390time-critical events such as cpu scheduling and timeout processing. 391Often there are no such events, and the software interrupt handler	310At low input rates the handler processes characters on an 311interrupt basis, avoiding the overhead 312of checking each interface on each clock interrupt. 313During periods of sustained input, the handler enables the silo 314and starts a timer to drain input. 315This timer runs less frequently than the clock interrupts, 316and is used only when there is a substantial amount of input. 317The transition from using silos to an interrupt per character is --- 68 unchanged lines hidden (view full) --- 386Clock Handling 387.PP 388The hardware clock interrupts the processor 100 times per second 389at high priority. 390As most of the clock-based events need not be done at high priority, 391the system schedules a lower priority software interrupt to do the less 392time-critical events such as cpu scheduling and timeout processing. 393Often there are no such events, and the software interrupt handler
392finds nothing to do and returns.	394finds nothing to do and returns.
393The high priority event now checks to see if there are low priority 394events to process; 395if there is nothing to do, the software interrupt is not requested. 396Often, the high priority interrupt occurs during a period when the 397machine had been running at low priority. 398Rather than posting a software interrupt that would occur as 399soon as it returns, 400the hardware clock interrupt handler simply lowers the processor priority --- 127 unchanged lines hidden (view full) --- 528.PP 529Another optimization usually done by optimizing compilers 530is inline expansion of small or frequently used routines. 531In past Berkeley systems this has been done by using \fIsed\fP to 532run over the assembly language and replace calls to small 533routines with the code for the body of the routine, often 534a single VAX instruction. 535While this optimization eliminated the cost of the subroutine	395The high priority event now checks to see if there are low priority 396events to process; 397if there is nothing to do, the software interrupt is not requested. 398Often, the high priority interrupt occurs during a period when the 399machine had been running at low priority. 400Rather than posting a software interrupt that would occur as 401soon as it returns, 402the hardware clock interrupt handler simply lowers the processor priority --- 127 unchanged lines hidden (view full) --- 530.PP 531Another optimization usually done by optimizing compilers 532is inline expansion of small or frequently used routines. 533In past Berkeley systems this has been done by using \fIsed\fP to 534run over the assembly language and replace calls to small 535routines with the code for the body of the routine, often 536a single VAX instruction. 537While this optimization eliminated the cost of the subroutine
536call and return,	538call and return,
537it did not eliminate the pushing and popping of several arguments 538to the routine. 539The \fIsed\fP script has been replaced by a more intelligent expander, 540\fIinline\fP, that merges the pushes and pops into moves to registers. 541For example, if the C code 542.DS 543if (scanc(map[i], 1, 47, i - 63)) 544.DE --- 37 unchanged lines hidden (view full) --- 582Most systems today have 200 to 1000 buffers. 583Consequently, most of the hash chains contained 584ten to a hundred buffers each! 585The running time of the low level buffer management primitives was 586dramatically improved simply by enlarging the size of the hash table. 587.NH 2 588Improvements to Libraries and Utilities 589.PP	539it did not eliminate the pushing and popping of several arguments 540to the routine. 541The \fIsed\fP script has been replaced by a more intelligent expander, 542\fIinline\fP, that merges the pushes and pops into moves to registers. 543For example, if the C code 544.DS 545if (scanc(map[i], 1, 47, i - 63)) 546.DE --- 37 unchanged lines hidden (view full) --- 584Most systems today have 200 to 1000 buffers. 585Consequently, most of the hash chains contained 586ten to a hundred buffers each! 587The running time of the low level buffer management primitives was 588dramatically improved simply by enlarging the size of the hash table. 589.NH 2 590Improvements to Libraries and Utilities 591.PP
590Intuitively, changes to the kernel would seem to have the greatest	592Intuitively, changes to the kernel would seem to have the greatest
591payoff since they affect all programs that run on the system. 592However, the kernel has been tuned many times before, so the 593opportunity for significant improvement was small. 594By contrast, many of the libraries and utilities had never been tuned. 595For example, we found utilities that spent 90% of their 596running time doing single character read system calls. 597Changing the utility to use the standard I/O library cut the 598running time by a factor of five! --- 57 unchanged lines hidden (view full) --- 656.NH 3 657Mail System 658.PP 659The problems discussed in section 3.1.1 prompted significant work 660on the entire mail system. The first problem identified was a bug 661in the \fIsyslog\fP program. The mail delivery program, \fIsendmail\fP 662logs all mail transactions through this process with the 4.2BSD interprocess 663communication facilities. \fISyslog\fP then records the information in	593payoff since they affect all programs that run on the system. 594However, the kernel has been tuned many times before, so the 595opportunity for significant improvement was small. 596By contrast, many of the libraries and utilities had never been tuned. 597For example, we found utilities that spent 90% of their 598running time doing single character read system calls. 599Changing the utility to use the standard I/O library cut the 600running time by a factor of five! --- 57 unchanged lines hidden (view full) --- 658.NH 3 659Mail System 660.PP 661The problems discussed in section 3.1.1 prompted significant work 662on the entire mail system. The first problem identified was a bug 663in the \fIsyslog\fP program. The mail delivery program, \fIsendmail\fP 664logs all mail transactions through this process with the 4.2BSD interprocess 665communication facilities. \fISyslog\fP then records the information in
664a log file. Unfortunately, \fIsyslog\fP was performing a \fIsync\fP	666a log file. Unfortunately, \fIsyslog\fP was performing a \fIsync\fP
665operation after each message it received, whether it was logged to a file 666or not. This wreaked havoc on the effectiveness of the 667buffer cache and explained, to a large 668extent, why sending mail to large distribution lists generated such a 669heavy load on the system (one syslog message was generated for each 670message recipient causing almost a continuous sequence of sync operations). 671.PP 672The hashed data base files were	667operation after each message it received, whether it was logged to a file 668or not. This wreaked havoc on the effectiveness of the 669buffer cache and explained, to a large 670extent, why sending mail to large distribution lists generated such a 671heavy load on the system (one syslog message was generated for each 672message recipient causing almost a continuous sequence of sync operations). 673.PP 674The hashed data base files were
673installed in all mail programs, resulting in a order of magnitude	675installed in all mail programs, resulting in an order of magnitude
674speedup on large distribution lists. The code in \fI/bin/mail\fP 675that notifies the \fIcomsat\fP program when mail has been delivered to 676a user was changed to cache host table lookups, resulting in a similar	676speedup on large distribution lists. The code in \fI/bin/mail\fP 677that notifies the \fIcomsat\fP program when mail has been delivered to 678a user was changed to cache host table lookups, resulting in a similar
677speedup on large distribution lists.	679speedup on large distribution lists.
678.PP 679Next, the file locking facilities 680provided in 4.2BSD, \fIflock\fP\\|(2), were used in place of the old	680.PP 681Next, the file locking facilities 682provided in 4.2BSD, \fIflock\fP\\|(2), were used in place of the old
681locking mechanism.	683locking mechanism.
682The mail system previously used \fIlink\fP and \fIunlink\fP in	684The mail system previously used \fIlink\fP and \fIunlink\fP in
683implementing file locking primitives.	685implementing file locking primitives.
684Because these operations usually modify the contents of directories 685they require synchronous disk operations and cannot take 686advantage of the name cache maintained by the system. 687Unlink requires that the entry be found in the directory so that	686Because these operations usually modify the contents of directories 687they require synchronous disk operations and cannot take 688advantage of the name cache maintained by the system. 689Unlink requires that the entry be found in the directory so that
688it can be removed;	690it can be removed;
689link requires that the directory be scanned to insure that the name 690does not already exist. 691By contrast the advisory locking facility in 4.2BSD is 692efficient because it is all done with in-memory tables. 693Thus, the mail system was modified to use the file locking primitives. 694This yielded another 10% cut in the basic overhead of delivering mail. 695Extensive profiling and tuning of \fIsendmail\fP and 696compiling it without debugging code reduced the overhead by another 20%. 697.NH 3 698Network Servers 699.PP 700With the introduction of the network facilities in 4.2BSD,	691link requires that the directory be scanned to insure that the name 692does not already exist. 693By contrast the advisory locking facility in 4.2BSD is 694efficient because it is all done with in-memory tables. 695Thus, the mail system was modified to use the file locking primitives. 696This yielded another 10% cut in the basic overhead of delivering mail. 697Extensive profiling and tuning of \fIsendmail\fP and 698compiling it without debugging code reduced the overhead by another 20%. 699.NH 3 700Network Servers 701.PP 702With the introduction of the network facilities in 4.2BSD,
701a myriad of services became available, each of which	703a myriad of services became available, each of which
702required its own daemon process. 703Many of these daemons were rarely if ever used, 704yet they lay asleep in the process table consuming 705system resources and generally slowing down response. 706Rather than having many servers started at boot time, a single server, 707\fIinetd\fP was substituted. 708This process reads a simple configuration file 709that specifies the services the system is willing to support --- 65 unchanged lines hidden ---	704required its own daemon process. 705Many of these daemons were rarely if ever used, 706yet they lay asleep in the process table consuming 707system resources and generally slowing down response. 708Rather than having many servers started at boot time, a single server, 709\fIinetd\fP was substituted. 710This process reads a simple configuration file 711that specifies the services the system is willing to support --- 65 unchanged lines hidden ---