Deleted Added
full compact
4.t (21477) 4.t (108533)
1.\" Copyright (c) 1985 The Regents of the University of California.
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\" notice, this list of conditions and the following disclaimer.

--- 17 unchanged lines hidden (view full) ---

26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\" @(#)4.t 5.1 (Berkeley) 4/17/91
33.\"
1.\" Copyright (c) 1985 The Regents of the University of California.
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\" notice, this list of conditions and the following disclaimer.

--- 17 unchanged lines hidden (view full) ---

26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\" @(#)4.t 5.1 (Berkeley) 4/17/91
33.\"
34.\" $FreeBSD: head/share/doc/papers/sysperf/4.t 108533 2003-01-01 18:49:04Z schweikh $
35.\"
34.ds RH Performance Improvements
35.NH
36Performance Improvements
37.PP
36.ds RH Performance Improvements
37.NH
38Performance Improvements
39.PP
38This section outlines the changes made to the system
40This section outlines the changes made to the system
39since the 4.2BSD distribution.
40The changes reported here were made in response
41to the problems described in Section 3.
42The improvements fall into two major classes;
43changes to the kernel that are described in this section,
44and changes to the system libraries and utilities that are
45described in the following section.
46.NH 2

--- 27 unchanged lines hidden (view full) ---

74.FE
75An inspection of \fInamei\fP shows that
76it consists of two nested loops.
77The outer loop is traversed once per pathname component.
78The inner loop performs a linear search through a directory looking
79for a particular pathname component.
80.PP
81Our first idea was to reduce the number of iterations
41since the 4.2BSD distribution.
42The changes reported here were made in response
43to the problems described in Section 3.
44The improvements fall into two major classes;
45changes to the kernel that are described in this section,
46and changes to the system libraries and utilities that are
47described in the following section.
48.NH 2

--- 27 unchanged lines hidden (view full) ---

76.FE
77An inspection of \fInamei\fP shows that
78it consists of two nested loops.
79The outer loop is traversed once per pathname component.
80The inner loop performs a linear search through a directory looking
81for a particular pathname component.
82.PP
83Our first idea was to reduce the number of iterations
82around the inner loop of \fInamei\fP by observing that many programs
84around the inner loop of \fInamei\fP by observing that many programs
83step through a directory performing an operation on each entry in turn.
84To improve performance for processes doing directory scans,
85the system keeps track of the directory offset of the last component of the
86most recently translated path name for each process.
87If the next name the process requests is in the same directory,
88the search is started from the offset that the previous name was found
89(instead of from the beginning of the directory).
90Changing directories invalidates the cache, as
91does modifying the directory.
92For programs that step sequentially through a directory with
93.EQ
94delim $$
95.EN
96$N$ files, search time decreases from $O ( N sup 2 )$ to $O(N)$.
97.EQ
98delim off
99.EN
100.PP
101The cost of the cache is about 20 lines of code
85step through a directory performing an operation on each entry in turn.
86To improve performance for processes doing directory scans,
87the system keeps track of the directory offset of the last component of the
88most recently translated path name for each process.
89If the next name the process requests is in the same directory,
90the search is started from the offset that the previous name was found
91(instead of from the beginning of the directory).
92Changing directories invalidates the cache, as
93does modifying the directory.
94For programs that step sequentially through a directory with
95.EQ
96delim $$
97.EN
98$N$ files, search time decreases from $O ( N sup 2 )$ to $O(N)$.
99.EQ
100delim off
101.EN
102.PP
103The cost of the cache is about 20 lines of code
102(about 0.2 kilobytes)
104(about 0.2 kilobytes)
103and 16 bytes per process, with the cached data
104stored in a process's \fIuser\fP vector.
105.PP
106As a quick benchmark to verify the maximum effectiveness of the
107cache we ran ``ls \-l''
108on a directory containing 600 files.
109Before the per-process cache this command
110used 22.3 seconds of system time.

--- 46 unchanged lines hidden (view full) ---

157entry in the inode table.
158.FE
159This has the effect of eliminating the inner loop of \fInamei\fP.
160For each path name component,
161\fInamei\fP first looks in its cache of recent translations
162for the needed name.
163If it exists, the directory search can be completely eliminated.
164.PP
105and 16 bytes per process, with the cached data
106stored in a process's \fIuser\fP vector.
107.PP
108As a quick benchmark to verify the maximum effectiveness of the
109cache we ran ``ls \-l''
110on a directory containing 600 files.
111Before the per-process cache this command
112used 22.3 seconds of system time.

--- 46 unchanged lines hidden (view full) ---

159entry in the inode table.
160.FE
161This has the effect of eliminating the inner loop of \fInamei\fP.
162For each path name component,
163\fInamei\fP first looks in its cache of recent translations
164for the needed name.
165If it exists, the directory search can be completely eliminated.
166.PP
165The system already maintained a cache of recently accessed inodes,
167The system already maintained a cache of recently accessed inodes,
166so the initial name cache
167maintained a simple name-inode association that was used to
168check each component of a path name during name translations.
169We considered implementing the cache by tagging each inode
170with its most recently translated name,
171but eventually decided to have a separate data structure that
172kept names with pointers to the inode table.
173Tagging inodes has two drawbacks;
174many inodes such as those associated with login ports remain in
168so the initial name cache
169maintained a simple name-inode association that was used to
170check each component of a path name during name translations.
171We considered implementing the cache by tagging each inode
172with its most recently translated name,
173but eventually decided to have a separate data structure that
174kept names with pointers to the inode table.
175Tagging inodes has two drawbacks;
176many inodes such as those associated with login ports remain in
175the inode table for a long period of time, but are never looked
177the inode table for a long period of time, but are never looked
176up by name.
177Other inodes, such as those describing directories are looked up
178frequently by many different names (\fIe.g.\fP ``..'').
179By keeping a separate table of names, the cache can
180truly reflect the most recently used names.
181An added benefit is that the table can be sized independently
182of the inode table, so that machines with small amounts of memory
183can reduce the size of the cache (or even eliminate it)
184without modifying the inode table structure.
185.PP
178up by name.
179Other inodes, such as those describing directories are looked up
180frequently by many different names (\fIe.g.\fP ``..'').
181By keeping a separate table of names, the cache can
182truly reflect the most recently used names.
183An added benefit is that the table can be sized independently
184of the inode table, so that machines with small amounts of memory
185can reduce the size of the cache (or even eliminate it)
186without modifying the inode table structure.
187.PP
186Another issue to be considered is how the name cache should
188Another issue to be considered is how the name cache should
187hold references to the inode table.
188Normally processes hold ``hard references'' by incrementing the
189reference count in the inode they reference.
190Since the system reuses only inodes with zero reference counts,
191a hard reference insures that the inode pointer will remain valid.
192However, if the name cache holds hard references,
193it is limited to some fraction of the size of the inode table,
194since some inodes must be left free for new files.

--- 19 unchanged lines hidden (view full) ---

214Since the name cache holds only soft references,
215it may be sized independent of the size of the inode table.
216A final benefit of using capabilities is that all
217cached names for an inode may be invalidated without
218searching through the entire cache;
219instead all you need to do is assign a new capability to the inode.
220.PP
221The cost of the name cache is about 200 lines of code
189hold references to the inode table.
190Normally processes hold ``hard references'' by incrementing the
191reference count in the inode they reference.
192Since the system reuses only inodes with zero reference counts,
193a hard reference insures that the inode pointer will remain valid.
194However, if the name cache holds hard references,
195it is limited to some fraction of the size of the inode table,
196since some inodes must be left free for new files.

--- 19 unchanged lines hidden (view full) ---

216Since the name cache holds only soft references,
217it may be sized independent of the size of the inode table.
218A final benefit of using capabilities is that all
219cached names for an inode may be invalidated without
220searching through the entire cache;
221instead all you need to do is assign a new capability to the inode.
222.PP
223The cost of the name cache is about 200 lines of code
222(about 1.2 kilobytes)
224(about 1.2 kilobytes)
223and 48 bytes per cache entry.
224Depending on the size of the system,
225about 200 to 1000 entries will normally be configured,
226using 10-50 kilobytes of physical memory.
227The name cache is resident in memory at all times.
228.PP
229After adding the system wide name cache we reran ``ls \-l''
230on the same directory.

--- 68 unchanged lines hidden (view full) ---

299It is more efficient to use a device's silo input mode,
300since this generates fewer interrupts than handling each character
301as a separate interrupt.
302Since a given dialup port may switch between uucp logins and user logins,
303it is impossible to statically select the most efficient input mode to use.
304.PP
305We therefore changed the terminal multiplexor handlers
306to dynamically choose between the use of the silo and the use of
225and 48 bytes per cache entry.
226Depending on the size of the system,
227about 200 to 1000 entries will normally be configured,
228using 10-50 kilobytes of physical memory.
229The name cache is resident in memory at all times.
230.PP
231After adding the system wide name cache we reran ``ls \-l''
232on the same directory.

--- 68 unchanged lines hidden (view full) ---

301It is more efficient to use a device's silo input mode,
302since this generates fewer interrupts than handling each character
303as a separate interrupt.
304Since a given dialup port may switch between uucp logins and user logins,
305it is impossible to statically select the most efficient input mode to use.
306.PP
307We therefore changed the terminal multiplexor handlers
308to dynamically choose between the use of the silo and the use of
307per-character interrupts.
309per-character interrupts.
308At low input rates the handler processes characters on an
309interrupt basis, avoiding the overhead
310of checking each interface on each clock interrupt.
311During periods of sustained input, the handler enables the silo
312and starts a timer to drain input.
313This timer runs less frequently than the clock interrupts,
314and is used only when there is a substantial amount of input.
315The transition from using silos to an interrupt per character is

--- 68 unchanged lines hidden (view full) ---

384Clock Handling
385.PP
386The hardware clock interrupts the processor 100 times per second
387at high priority.
388As most of the clock-based events need not be done at high priority,
389the system schedules a lower priority software interrupt to do the less
390time-critical events such as cpu scheduling and timeout processing.
391Often there are no such events, and the software interrupt handler
310At low input rates the handler processes characters on an
311interrupt basis, avoiding the overhead
312of checking each interface on each clock interrupt.
313During periods of sustained input, the handler enables the silo
314and starts a timer to drain input.
315This timer runs less frequently than the clock interrupts,
316and is used only when there is a substantial amount of input.
317The transition from using silos to an interrupt per character is

--- 68 unchanged lines hidden (view full) ---

386Clock Handling
387.PP
388The hardware clock interrupts the processor 100 times per second
389at high priority.
390As most of the clock-based events need not be done at high priority,
391the system schedules a lower priority software interrupt to do the less
392time-critical events such as cpu scheduling and timeout processing.
393Often there are no such events, and the software interrupt handler
392finds nothing to do and returns.
394finds nothing to do and returns.
393The high priority event now checks to see if there are low priority
394events to process;
395if there is nothing to do, the software interrupt is not requested.
396Often, the high priority interrupt occurs during a period when the
397machine had been running at low priority.
398Rather than posting a software interrupt that would occur as
399soon as it returns,
400the hardware clock interrupt handler simply lowers the processor priority

--- 127 unchanged lines hidden (view full) ---

528.PP
529Another optimization usually done by optimizing compilers
530is inline expansion of small or frequently used routines.
531In past Berkeley systems this has been done by using \fIsed\fP to
532run over the assembly language and replace calls to small
533routines with the code for the body of the routine, often
534a single VAX instruction.
535While this optimization eliminated the cost of the subroutine
395The high priority event now checks to see if there are low priority
396events to process;
397if there is nothing to do, the software interrupt is not requested.
398Often, the high priority interrupt occurs during a period when the
399machine had been running at low priority.
400Rather than posting a software interrupt that would occur as
401soon as it returns,
402the hardware clock interrupt handler simply lowers the processor priority

--- 127 unchanged lines hidden (view full) ---

530.PP
531Another optimization usually done by optimizing compilers
532is inline expansion of small or frequently used routines.
533In past Berkeley systems this has been done by using \fIsed\fP to
534run over the assembly language and replace calls to small
535routines with the code for the body of the routine, often
536a single VAX instruction.
537While this optimization eliminated the cost of the subroutine
536call and return,
538call and return,
537it did not eliminate the pushing and popping of several arguments
538to the routine.
539The \fIsed\fP script has been replaced by a more intelligent expander,
540\fIinline\fP, that merges the pushes and pops into moves to registers.
541For example, if the C code
542.DS
543if (scanc(map[i], 1, 47, i - 63))
544.DE

--- 37 unchanged lines hidden (view full) ---

582Most systems today have 200 to 1000 buffers.
583Consequently, most of the hash chains contained
584ten to a hundred buffers each!
585The running time of the low level buffer management primitives was
586dramatically improved simply by enlarging the size of the hash table.
587.NH 2
588Improvements to Libraries and Utilities
589.PP
539it did not eliminate the pushing and popping of several arguments
540to the routine.
541The \fIsed\fP script has been replaced by a more intelligent expander,
542\fIinline\fP, that merges the pushes and pops into moves to registers.
543For example, if the C code
544.DS
545if (scanc(map[i], 1, 47, i - 63))
546.DE

--- 37 unchanged lines hidden (view full) ---

584Most systems today have 200 to 1000 buffers.
585Consequently, most of the hash chains contained
586ten to a hundred buffers each!
587The running time of the low level buffer management primitives was
588dramatically improved simply by enlarging the size of the hash table.
589.NH 2
590Improvements to Libraries and Utilities
591.PP
590Intuitively, changes to the kernel would seem to have the greatest
592Intuitively, changes to the kernel would seem to have the greatest
591payoff since they affect all programs that run on the system.
592However, the kernel has been tuned many times before, so the
593opportunity for significant improvement was small.
594By contrast, many of the libraries and utilities had never been tuned.
595For example, we found utilities that spent 90% of their
596running time doing single character read system calls.
597Changing the utility to use the standard I/O library cut the
598running time by a factor of five!

--- 57 unchanged lines hidden (view full) ---

656.NH 3
657Mail System
658.PP
659The problems discussed in section 3.1.1 prompted significant work
660on the entire mail system. The first problem identified was a bug
661in the \fIsyslog\fP program. The mail delivery program, \fIsendmail\fP
662logs all mail transactions through this process with the 4.2BSD interprocess
663communication facilities. \fISyslog\fP then records the information in
593payoff since they affect all programs that run on the system.
594However, the kernel has been tuned many times before, so the
595opportunity for significant improvement was small.
596By contrast, many of the libraries and utilities had never been tuned.
597For example, we found utilities that spent 90% of their
598running time doing single character read system calls.
599Changing the utility to use the standard I/O library cut the
600running time by a factor of five!

--- 57 unchanged lines hidden (view full) ---

658.NH 3
659Mail System
660.PP
661The problems discussed in section 3.1.1 prompted significant work
662on the entire mail system. The first problem identified was a bug
663in the \fIsyslog\fP program. The mail delivery program, \fIsendmail\fP
664logs all mail transactions through this process with the 4.2BSD interprocess
665communication facilities. \fISyslog\fP then records the information in
664a log file. Unfortunately, \fIsyslog\fP was performing a \fIsync\fP
666a log file. Unfortunately, \fIsyslog\fP was performing a \fIsync\fP
665operation after each message it received, whether it was logged to a file
666or not. This wreaked havoc on the effectiveness of the
667buffer cache and explained, to a large
668extent, why sending mail to large distribution lists generated such a
669heavy load on the system (one syslog message was generated for each
670message recipient causing almost a continuous sequence of sync operations).
671.PP
672The hashed data base files were
667operation after each message it received, whether it was logged to a file
668or not. This wreaked havoc on the effectiveness of the
669buffer cache and explained, to a large
670extent, why sending mail to large distribution lists generated such a
671heavy load on the system (one syslog message was generated for each
672message recipient causing almost a continuous sequence of sync operations).
673.PP
674The hashed data base files were
673installed in all mail programs, resulting in a order of magnitude
675installed in all mail programs, resulting in an order of magnitude
674speedup on large distribution lists. The code in \fI/bin/mail\fP
675that notifies the \fIcomsat\fP program when mail has been delivered to
676a user was changed to cache host table lookups, resulting in a similar
676speedup on large distribution lists. The code in \fI/bin/mail\fP
677that notifies the \fIcomsat\fP program when mail has been delivered to
678a user was changed to cache host table lookups, resulting in a similar
677speedup on large distribution lists.
679speedup on large distribution lists.
678.PP
679Next, the file locking facilities
680provided in 4.2BSD, \fIflock\fP\|(2), were used in place of the old
680.PP
681Next, the file locking facilities
682provided in 4.2BSD, \fIflock\fP\|(2), were used in place of the old
681locking mechanism.
683locking mechanism.
682The mail system previously used \fIlink\fP and \fIunlink\fP in
684The mail system previously used \fIlink\fP and \fIunlink\fP in
683implementing file locking primitives.
685implementing file locking primitives.
684Because these operations usually modify the contents of directories
685they require synchronous disk operations and cannot take
686advantage of the name cache maintained by the system.
687Unlink requires that the entry be found in the directory so that
686Because these operations usually modify the contents of directories
687they require synchronous disk operations and cannot take
688advantage of the name cache maintained by the system.
689Unlink requires that the entry be found in the directory so that
688it can be removed;
690it can be removed;
689link requires that the directory be scanned to insure that the name
690does not already exist.
691By contrast the advisory locking facility in 4.2BSD is
692efficient because it is all done with in-memory tables.
693Thus, the mail system was modified to use the file locking primitives.
694This yielded another 10% cut in the basic overhead of delivering mail.
695Extensive profiling and tuning of \fIsendmail\fP and
696compiling it without debugging code reduced the overhead by another 20%.
697.NH 3
698Network Servers
699.PP
700With the introduction of the network facilities in 4.2BSD,
691link requires that the directory be scanned to insure that the name
692does not already exist.
693By contrast the advisory locking facility in 4.2BSD is
694efficient because it is all done with in-memory tables.
695Thus, the mail system was modified to use the file locking primitives.
696This yielded another 10% cut in the basic overhead of delivering mail.
697Extensive profiling and tuning of \fIsendmail\fP and
698compiling it without debugging code reduced the overhead by another 20%.
699.NH 3
700Network Servers
701.PP
702With the introduction of the network facilities in 4.2BSD,
701a myriad of services became available, each of which
703a myriad of services became available, each of which
702required its own daemon process.
703Many of these daemons were rarely if ever used,
704yet they lay asleep in the process table consuming
705system resources and generally slowing down response.
706Rather than having many servers started at boot time, a single server,
707\fIinetd\fP was substituted.
708This process reads a simple configuration file
709that specifies the services the system is willing to support

--- 65 unchanged lines hidden ---
704required its own daemon process.
705Many of these daemons were rarely if ever used,
706yet they lay asleep in the process table consuming
707system resources and generally slowing down response.
708Rather than having many servers started at boot time, a single server,
709\fIinetd\fP was substituted.
710This process reads a simple configuration file
711that specifies the services the system is willing to support

--- 65 unchanged lines hidden ---