Deleted Added
sdiff udiff text old ( 21477 ) new ( 108533 )
full compact
1.\" Copyright (c) 1985 The Regents of the University of California.
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\" notice, this list of conditions and the following disclaimer.

--- 17 unchanged lines hidden (view full) ---

26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\" @(#)4.t 5.1 (Berkeley) 4/17/91
33.\"
34.\" $FreeBSD: head/share/doc/papers/sysperf/4.t 108533 2003-01-01 18:49:04Z schweikh $
35.\"
36.ds RH Performance Improvements
37.NH
38Performance Improvements
39.PP
40This section outlines the changes made to the system
41since the 4.2BSD distribution.
42The changes reported here were made in response
43to the problems described in Section 3.
44The improvements fall into two major classes;
45changes to the kernel that are described in this section,
46and changes to the system libraries and utilities that are
47described in the following section.
48.NH 2

--- 27 unchanged lines hidden (view full) ---

76.FE
77An inspection of \fInamei\fP shows that
78it consists of two nested loops.
79The outer loop is traversed once per pathname component.
80The inner loop performs a linear search through a directory looking
81for a particular pathname component.
82.PP
83Our first idea was to reduce the number of iterations
84around the inner loop of \fInamei\fP by observing that many programs
85step through a directory performing an operation on each entry in turn.
86To improve performance for processes doing directory scans,
87the system keeps track of the directory offset of the last component of the
88most recently translated path name for each process.
89If the next name the process requests is in the same directory,
90the search is started from the offset that the previous name was found
91(instead of from the beginning of the directory).
92Changing directories invalidates the cache, as
93does modifying the directory.
94For programs that step sequentially through a directory with
95.EQ
96delim $$
97.EN
98$N$ files, search time decreases from $O ( N sup 2 )$ to $O(N)$.
99.EQ
100delim off
101.EN
102.PP
103The cost of the cache is about 20 lines of code
104(about 0.2 kilobytes)
105and 16 bytes per process, with the cached data
106stored in a process's \fIuser\fP vector.
107.PP
108As a quick benchmark to verify the maximum effectiveness of the
109cache we ran ``ls \-l''
110on a directory containing 600 files.
111Before the per-process cache this command
112used 22.3 seconds of system time.

--- 46 unchanged lines hidden (view full) ---

159entry in the inode table.
160.FE
161This has the effect of eliminating the inner loop of \fInamei\fP.
162For each path name component,
163\fInamei\fP first looks in its cache of recent translations
164for the needed name.
165If it exists, the directory search can be completely eliminated.
166.PP
167The system already maintained a cache of recently accessed inodes,
168so the initial name cache
169maintained a simple name-inode association that was used to
170check each component of a path name during name translations.
171We considered implementing the cache by tagging each inode
172with its most recently translated name,
173but eventually decided to have a separate data structure that
174kept names with pointers to the inode table.
175Tagging inodes has two drawbacks;
176many inodes such as those associated with login ports remain in
177the inode table for a long period of time, but are never looked
178up by name.
179Other inodes, such as those describing directories are looked up
180frequently by many different names (\fIe.g.\fP ``..'').
181By keeping a separate table of names, the cache can
182truly reflect the most recently used names.
183An added benefit is that the table can be sized independently
184of the inode table, so that machines with small amounts of memory
185can reduce the size of the cache (or even eliminate it)
186without modifying the inode table structure.
187.PP
188Another issue to be considered is how the name cache should
189hold references to the inode table.
190Normally processes hold ``hard references'' by incrementing the
191reference count in the inode they reference.
192Since the system reuses only inodes with zero reference counts,
193a hard reference insures that the inode pointer will remain valid.
194However, if the name cache holds hard references,
195it is limited to some fraction of the size of the inode table,
196since some inodes must be left free for new files.

--- 19 unchanged lines hidden (view full) ---

216Since the name cache holds only soft references,
217it may be sized independent of the size of the inode table.
218A final benefit of using capabilities is that all
219cached names for an inode may be invalidated without
220searching through the entire cache;
221instead all you need to do is assign a new capability to the inode.
222.PP
223The cost of the name cache is about 200 lines of code
224(about 1.2 kilobytes)
225and 48 bytes per cache entry.
226Depending on the size of the system,
227about 200 to 1000 entries will normally be configured,
228using 10-50 kilobytes of physical memory.
229The name cache is resident in memory at all times.
230.PP
231After adding the system wide name cache we reran ``ls \-l''
232on the same directory.

--- 68 unchanged lines hidden (view full) ---

301It is more efficient to use a device's silo input mode,
302since this generates fewer interrupts than handling each character
303as a separate interrupt.
304Since a given dialup port may switch between uucp logins and user logins,
305it is impossible to statically select the most efficient input mode to use.
306.PP
307We therefore changed the terminal multiplexor handlers
308to dynamically choose between the use of the silo and the use of
309per-character interrupts.
310At low input rates the handler processes characters on an
311interrupt basis, avoiding the overhead
312of checking each interface on each clock interrupt.
313During periods of sustained input, the handler enables the silo
314and starts a timer to drain input.
315This timer runs less frequently than the clock interrupts,
316and is used only when there is a substantial amount of input.
317The transition from using silos to an interrupt per character is

--- 68 unchanged lines hidden (view full) ---

386Clock Handling
387.PP
388The hardware clock interrupts the processor 100 times per second
389at high priority.
390As most of the clock-based events need not be done at high priority,
391the system schedules a lower priority software interrupt to do the less
392time-critical events such as cpu scheduling and timeout processing.
393Often there are no such events, and the software interrupt handler
394finds nothing to do and returns.
395The high priority event now checks to see if there are low priority
396events to process;
397if there is nothing to do, the software interrupt is not requested.
398Often, the high priority interrupt occurs during a period when the
399machine had been running at low priority.
400Rather than posting a software interrupt that would occur as
401soon as it returns,
402the hardware clock interrupt handler simply lowers the processor priority

--- 127 unchanged lines hidden (view full) ---

530.PP
531Another optimization usually done by optimizing compilers
532is inline expansion of small or frequently used routines.
533In past Berkeley systems this has been done by using \fIsed\fP to
534run over the assembly language and replace calls to small
535routines with the code for the body of the routine, often
536a single VAX instruction.
537While this optimization eliminated the cost of the subroutine
538call and return,
539it did not eliminate the pushing and popping of several arguments
540to the routine.
541The \fIsed\fP script has been replaced by a more intelligent expander,
542\fIinline\fP, that merges the pushes and pops into moves to registers.
543For example, if the C code
544.DS
545if (scanc(map[i], 1, 47, i - 63))
546.DE

--- 37 unchanged lines hidden (view full) ---

584Most systems today have 200 to 1000 buffers.
585Consequently, most of the hash chains contained
586ten to a hundred buffers each!
587The running time of the low level buffer management primitives was
588dramatically improved simply by enlarging the size of the hash table.
589.NH 2
590Improvements to Libraries and Utilities
591.PP
592Intuitively, changes to the kernel would seem to have the greatest
593payoff since they affect all programs that run on the system.
594However, the kernel has been tuned many times before, so the
595opportunity for significant improvement was small.
596By contrast, many of the libraries and utilities had never been tuned.
597For example, we found utilities that spent 90% of their
598running time doing single character read system calls.
599Changing the utility to use the standard I/O library cut the
600running time by a factor of five!

--- 57 unchanged lines hidden (view full) ---

658.NH 3
659Mail System
660.PP
661The problems discussed in section 3.1.1 prompted significant work
662on the entire mail system. The first problem identified was a bug
663in the \fIsyslog\fP program. The mail delivery program, \fIsendmail\fP
664logs all mail transactions through this process with the 4.2BSD interprocess
665communication facilities. \fISyslog\fP then records the information in
666a log file. Unfortunately, \fIsyslog\fP was performing a \fIsync\fP
667operation after each message it received, whether it was logged to a file
668or not. This wreaked havoc on the effectiveness of the
669buffer cache and explained, to a large
670extent, why sending mail to large distribution lists generated such a
671heavy load on the system (one syslog message was generated for each
672message recipient causing almost a continuous sequence of sync operations).
673.PP
674The hashed data base files were
675installed in all mail programs, resulting in an order of magnitude
676speedup on large distribution lists. The code in \fI/bin/mail\fP
677that notifies the \fIcomsat\fP program when mail has been delivered to
678a user was changed to cache host table lookups, resulting in a similar
679speedup on large distribution lists.
680.PP
681Next, the file locking facilities
682provided in 4.2BSD, \fIflock\fP\|(2), were used in place of the old
683locking mechanism.
684The mail system previously used \fIlink\fP and \fIunlink\fP in
685implementing file locking primitives.
686Because these operations usually modify the contents of directories
687they require synchronous disk operations and cannot take
688advantage of the name cache maintained by the system.
689Unlink requires that the entry be found in the directory so that
690it can be removed;
691link requires that the directory be scanned to insure that the name
692does not already exist.
693By contrast the advisory locking facility in 4.2BSD is
694efficient because it is all done with in-memory tables.
695Thus, the mail system was modified to use the file locking primitives.
696This yielded another 10% cut in the basic overhead of delivering mail.
697Extensive profiling and tuning of \fIsendmail\fP and
698compiling it without debugging code reduced the overhead by another 20%.
699.NH 3
700Network Servers
701.PP
702With the introduction of the network facilities in 4.2BSD,
703a myriad of services became available, each of which
704required its own daemon process.
705Many of these daemons were rarely if ever used,
706yet they lay asleep in the process table consuming
707system resources and generally slowing down response.
708Rather than having many servers started at boot time, a single server,
709\fIinetd\fP was substituted.
710This process reads a simple configuration file
711that specifies the services the system is willing to support

--- 65 unchanged lines hidden ---