1<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]>
2
3<book id="LKLockingGuide">
4 <bookinfo>
5  <title>Unreliable Guide To Locking</title>
6  
7  <authorgroup>
8   <author>
9    <firstname>Paul</firstname>
10    <othername>Rusty</othername>
11    <surname>Russell</surname>
12    <affiliation>
13     <address>
14      <email>rusty@rustcorp.com.au</email>
15     </address>
16    </affiliation>
17   </author>
18  </authorgroup>
19
20  <copyright>
21   <year>2000</year>
22   <holder>Paul Russell</holder>
23  </copyright>
24
25  <legalnotice>
26   <para>
27     This documentation is free software; you can redistribute
28     it and/or modify it under the terms of the GNU General Public
29     License as published by the Free Software Foundation; either
30     version 2 of the License, or (at your option) any later
31     version.
32   </para>
33      
34   <para>
35     This program is distributed in the hope that it will be
36     useful, but WITHOUT ANY WARRANTY; without even the implied
37     warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
38     See the GNU General Public License for more details.
39   </para>
40      
41   <para>
42     You should have received a copy of the GNU General Public
43     License along with this program; if not, write to the Free
44     Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
45     MA 02111-1307 USA
46   </para>
47      
48   <para>
49     For more details see the file COPYING in the source
50     distribution of Linux.
51   </para>
52  </legalnotice>
53 </bookinfo>
54
55 <toc></toc>
56  <chapter id="intro">
57   <title>Introduction</title>
58   <para>
59     Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
60     Locking issues.  This document describes the locking systems in
61     the Linux Kernel as we approach 2.4.
62   </para>
63   <para>
64     It looks like <firstterm linkend="gloss-smp"><acronym>SMP</acronym>
65     </firstterm> is here to stay; so everyone hacking on the kernel 
66     these days needs to know the fundamentals of concurrency and locking 
67     for SMP.
68   </para>
69
70   <sect1 id="races">
71    <title>The Problem With Concurrency</title>
72    <para>
73      (Skip this if you know what a Race Condition is).
74    </para>
75    <para>
76      In a normal program, you can increment a counter like so:
77    </para>
78    <programlisting>
79      very_important_count++;
80    </programlisting>
81
82    <para>
83      This is what they would expect to happen:
84    </para>
85
86    <table>
87     <title>Expected Results</title>
88
89     <tgroup cols=2 align=left>
90
91      <thead>
92       <row>
93        <entry>Instance 1</entry>
94        <entry>Instance 2</entry>
95       </row>
96      </thead>
97
98      <tbody>
99       <row>
100        <entry>read very_important_count (5)</entry>
101        <entry></entry>
102       </row>
103       <row>
104        <entry>add 1 (6)</entry>
105        <entry></entry>
106       </row>
107       <row>
108        <entry>write very_important_count (6)</entry>
109        <entry></entry>
110       </row>
111       <row>
112        <entry></entry>
113        <entry>read very_important_count (6)</entry>
114       </row>
115       <row>
116        <entry></entry>
117        <entry>add 1 (7)</entry>
118       </row>
119       <row>
120        <entry></entry>
121        <entry>write very_important_count (7)</entry>
122       </row>
123      </tbody>
124
125     </tgroup>
126    </table>
127
128    <para>
129     This is what might happen:
130    </para>
131
132    <table>
133     <title>Possible Results</title>
134
135     <tgroup cols=2 align=left>
136      <thead>
137       <row>
138        <entry>Instance 1</entry>
139        <entry>Instance 2</entry>
140       </row>
141      </thead>
142
143      <tbody>
144       <row>
145        <entry>read very_important_count (5)</entry>
146        <entry></entry>
147       </row>
148       <row>
149        <entry></entry>
150        <entry>read very_important_count (5)</entry>
151       </row>
152       <row>
153        <entry>add 1 (6)</entry>
154        <entry></entry>
155       </row>
156       <row>
157        <entry></entry>
158        <entry>add 1 (6)</entry>
159       </row>
160       <row>
161        <entry>write very_important_count (6)</entry>
162        <entry></entry>
163       </row>
164       <row>
165        <entry></entry>
166        <entry>write very_important_count (6)</entry>
167       </row>
168      </tbody>
169     </tgroup>
170    </table>
171
172    <para>
173      This overlap, where what actually happens depends on the
174      relative timing of multiple tasks, is called a race condition.
175      The piece of code containing the concurrency issue is called a
176      critical region.  And especially since Linux starting running
177      on SMP machines, they became one of the major issues in kernel
178      design and implementation.
179    </para>
180    <para>
181      The solution is to recognize when these simultaneous accesses
182      occur, and use locks to make sure that only one instance can
183      enter the critical region at any time.  There are many
184      friendly primitives in the Linux kernel to help you do this.
185      And then there are the unfriendly primitives, but I'll pretend
186      they don't exist.
187    </para>
188   </sect1>
189  </chapter>
190
191  <chapter id="locks">
192   <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
193
194   <para>
195     There are two main types of kernel locks.  The fundamental type
196     is the spinlock 
197     (<filename class=headerfile>include/asm/spinlock.h</filename>), 
198     which is a very simple single-holder lock: if you can't get the 
199     spinlock, you keep trying (spinning) until you can.  Spinlocks are 
200     very small and fast, and can be used anywhere.
201   </para>
202   <para>
203     The second type is a semaphore
204     (<filename class=headerfile>include/asm/semaphore.h</filename>): it
205     can have more than one holder at any time (the number decided at
206     initialization time), although it is most commonly used as a
207     single-holder lock (a mutex).  If you can't get a semaphore,
208     your task will put itself on the queue, and be woken up when the
209     semaphore is released.  This means the CPU will do something
210     else while you are waiting, but there are many cases when you
211     simply can't sleep (see <xref linkend="sleeping-things">), and so
212     have to use a spinlock instead.
213   </para>
214   <para>
215     Neither type of lock is recursive: see
216     <xref linkend="techniques-deadlocks">.
217   </para>
218 
219   <sect1 id="uniprocessor">
220    <title>Locks and Uniprocessor Kernels</title>
221
222    <para>
223      For kernels compiled without <symbol>CONFIG_SMP</symbol>, spinlocks 
224      do not exist at all.  This is an excellent design decision: when
225      no-one else can run at the same time, there is no reason to
226      have a lock at all.
227    </para>
228
229    <para>
230      You should always test your locking code with <symbol>CONFIG_SMP</symbol>
231      enabled, even if you don't have an SMP test box, because it
232      will still catch some (simple) kinds of deadlock.
233    </para>
234
235    <para>
236      Semaphores still exist, because they are required for
237      synchronization between <firstterm linkend="gloss-usercontext">user 
238      contexts</firstterm>, as we will see below.
239    </para>
240   </sect1>
241
242   <sect1 id="rwlocks">
243    <title>Read/Write Lock Variants</title>
244
245    <para>
246      Both spinlocks and semaphores have read/write variants:
247      <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>. 
248      These divide users into two classes: the readers and the writers.  If 
249      you are only reading the data, you can get a read lock, but to write to 
250      the data you need the write lock.  Many people can hold a read lock,
251      but a writer must be sole holder.
252    </para>
253
254    <para>
255      This means much smoother locking if your code divides up
256      neatly along reader/writer lines.  All the discussions below
257      also apply to read/write variants.
258    </para>
259   </sect1>
260
261    <sect1 id="usercontextlocking">
262     <title>Locking Only In User Context</title>
263
264     <para>
265       If you have a data structure which is only ever accessed from
266       user context, then you can use a simple semaphore
267       (<filename>linux/asm/semaphore.h</filename>) to protect it.  This 
268       is the most trivial case: you initialize the semaphore to the number 
269       of resources available (usually 1), and call
270       <function>down_interruptible()</function> to grab the semaphore, and 
271       <function>up()</function> to release it.  There is also a 
272       <function>down()</function>, which should be avoided, because it 
273       will not return if a signal is received.
274     </para>
275
276     <para>
277       Example: <filename>linux/net/core/netfilter.c</filename> allows 
278       registration of new <function>setsockopt()</function> and 
279       <function>getsockopt()</function> calls, with
280       <function>nf_register_sockopt()</function>.  Registration and 
281       de-registration are only done on module load and unload (and boot 
282       time, where there is no concurrency), and the list of registrations 
283       is only consulted for an unknown <function>setsockopt()</function>
284       or <function>getsockopt()</function> system call.  The 
285       <varname>nf_sockopt_mutex</varname> is perfect to protect this,
286       especially since the setsockopt and getsockopt calls may well
287       sleep.
288     </para>
289   </sect1>
290
291   <sect1 id="lock-user-bh">
292    <title>Locking Between User Context and BHs</title>
293
294    <para>
295      If a <firstterm linkend="gloss-bh">bottom half</firstterm> shares 
296      data with user context, you have two problems.  Firstly, the current 
297      user context can be interrupted by a bottom half, and secondly, the 
298      critical region could be entered from another CPU.  This is where
299      <function>spin_lock_bh()</function> 
300      (<filename class=headerfile>include/linux/spinlock.h</filename>) is 
301      used.  It disables bottom halves on that CPU, then grabs the lock.
302      <function>spin_unlock_bh()</function> does the reverse.
303    </para>
304
305    <para>
306      This works perfectly for <firstterm linkend="gloss-up"><acronym>UP
307      </acronym></firstterm> as well: the spin lock vanishes, and this macro 
308      simply becomes <function>local_bh_disable()</function>
309      (<filename class=headerfile>include/asm/softirq.h</filename>), which 
310      protects you from the bottom half being run.
311    </para>
312   </sect1>
313
314   <sect1 id="lock-user-tasklet">
315    <title>Locking Between User Context and Tasklets/Soft IRQs</title>
316
317    <para>
318      This is exactly the same as above, because
319      <function>local_bh_disable()</function> actually disables all 
320      softirqs and <firstterm linkend="gloss-tasklet">tasklets</firstterm>
321      on that CPU as well.  It should really be 
322      called `local_softirq_disable()', but the name has been preserved 
323      for historical reasons.  Similarly,
324      <function>spin_lock_bh()</function> would now be called 
325      spin_lock_softirq() in a perfect world.
326    </para>
327   </sect1>
328
329   <sect1 id="lock-bh">
330    <title>Locking Between Bottom Halves</title>
331
332    <para>
333      Sometimes a bottom half might want to share data with
334      another bottom half (especially remember that timers are run
335      off a bottom half).
336    </para>
337
338    <sect2 id="lock-bh-same">
339     <title>The Same BH</title>
340
341     <para>
342       Since a bottom half is never run on two CPUs at once, you
343       don't need to worry about your bottom half being run twice
344       at once, even on SMP.
345     </para>
346    </sect2>
347
348    <sect2 id="lock-bh-different">
349     <title>Different BHs</title>
350
351     <para>
352       Since only one bottom half ever runs at a time once, you
353       don't need to worry about race conditions with other bottom
354       halves.  Beware that things might change under you, however,
355       if someone changes your bottom half to a tasklet.  If you
356       want to make your code future-proof, pretend you're already
357       running from a tasklet (see below), and doing the extra
358       locking.  Of course, if it's five years before that happens,
359       you're gonna look like a damn fool.
360     </para>
361    </sect2>
362   </sect1>
363
364   <sect1 id="lock-tasklets">
365    <title>Locking Between Tasklets</title>
366
367    <para>
368      Sometimes a tasklet might want to share data with another
369      tasklet, or a bottom half.
370    </para>
371
372    <sect2 id="lock-tasklets-same">
373     <title>The Same Tasklet</title>
374     <para>
375       Since a tasklet is never run on two CPUs at once, you don't
376       need to worry about your tasklet being reentrant (running
377       twice at once), even on SMP.
378     </para>
379    </sect2>
380
381    <sect2 id="lock-tasklets-different">
382     <title>Different Tasklets</title>
383     <para>
384       If another tasklet (or bottom half, such as a timer) wants
385       to share data with your tasklet, you will both need to use
386       <function>spin_lock()</function> and
387       <function>spin_unlock()</function> calls.  
388       <function>spin_lock_bh()</function> is
389       unnecessary here, as you are already in a tasklet, and
390       none will be run on the same CPU.
391     </para>
392    </sect2>
393   </sect1>
394
395   <sect1 id="lock-softirqs">
396    <title>Locking Between Softirqs</title>
397
398    <para>
399      Often a <firstterm linkend="gloss-softirq">softirq</firstterm> might 
400      want to share data with itself, a tasklet, or a bottom half.
401    </para>
402
403    <sect2 id="lock-softirqs-same">
404     <title>The Same Softirq</title>
405
406     <para>
407       The same softirq can run on the other CPUs: you can use a
408       per-CPU array (see <xref linkend="per-cpu">) for better
409       performance.  If you're going so far as to use a softirq,
410       you probably care about scalable performance enough
411       to justify the extra complexity.
412     </para>
413
414     <para>
415       You'll need to use <function>spin_lock()</function> and 
416       <function>spin_unlock()</function> for shared data.
417     </para>
418    </sect2>
419
420    <sect2 id="lock-softirqs-different">
421     <title>Different Softirqs</title>
422
423     <para>
424       You'll need to use <function>spin_lock()</function> and 
425       <function>spin_unlock()</function> for shared data, whether it 
426       be a timer (which can be running on a different CPU), bottom half, 
427       tasklet or the same or another softirq.
428     </para>
429    </sect2>
430   </sect1>
431  </chapter>
432
433  <chapter id="hardirq-context">
434   <title>Hard IRQ Context</title>
435
436   <para>
437     Hardware interrupts usually communicate with a bottom half,
438     tasklet or softirq.  Frequently this involves putting work in a
439     queue, which the BH/softirq will take out.
440   </para>
441
442   <sect1 id="hardirq-softirq">
443    <title>Locking Between Hard IRQ and Softirqs/Tasklets/BHs</title>
444
445    <para>
446      If a hardware irq handler shares data with a softirq, you have
447      two concerns.  Firstly, the softirq processing can be
448      interrupted by a hardware interrupt, and secondly, the
449      critical region could be entered by a hardware interrupt on
450      another CPU.  This is where <function>spin_lock_irq()</function> is 
451      used.  It is defined to disable interrupts on that cpu, then grab 
452      the lock. <function>spin_unlock_irq()</function> does the reverse.
453    </para>
454
455    <para>
456      This works perfectly for UP as well: the spin lock vanishes,
457      and this macro simply becomes <function>local_irq_disable()</function>
458      (<filename class=headerfile>include/asm/smp.h</filename>), which 
459      protects you from the softirq/tasklet/BH being run.
460    </para>
461
462    <para>
463      <function>spin_lock_irqsave()</function> 
464      (<filename>include/linux/spinlock.h</filename>) is a variant
465      which saves whether interrupts were on or off in a flags word,
466      which is passed to <function>spin_lock_irqrestore()</function>.  This 
467      means that the same code can be used inside an hard irq handler (where
468      interrupts are already off) and in softirqs (where the irq
469      disabling is required).
470    </para>
471   </sect1>
472  </chapter>
473
474  <chapter id="common-techniques">
475   <title>Common Techniques</title>
476
477   <para>
478     This section lists some common dilemmas and the standard
479     solutions used in the Linux kernel code.  If you use these,
480     people will find your code simpler to understand.
481   </para>
482
483   <para>
484     If I could give you one piece of advice: never sleep with anyone
485     crazier than yourself.  But if I had to give you advice on
486     locking: <emphasis>keep it simple</emphasis>.
487   </para>
488
489   <para>
490     Lock data, not code.
491   </para>
492
493   <para>
494     Be reluctant to introduce new locks.
495   </para>
496
497   <para>
498     Strangely enough, this is the exact reverse of my advice when
499     you <emphasis>have</emphasis> slept with someone crazier than yourself.
500   </para>
501
502   <sect1 id="techniques-no-writers">
503    <title>No Writers in Interrupt Context</title>
504
505    <para>
506      There is a fairly common case where an interrupt handler needs
507      access to a critical region, but does not need write access.
508      In this case, you do not need to use
509      <function>read_lock_irq()</function>, but only
510      <function>read_lock()</function> everywhere (since if an interrupt 
511      occurs, the irq handler will only try to grab a read lock, which 
512      won't deadlock).  You will still need to use 
513      <function>write_lock_irq()</function>.
514    </para>
515
516    <para>
517      Similar logic applies to locking between softirqs/tasklets/BHs
518      which never need a write lock, and user context: 
519      <function>read_lock()</function> and
520      <function>write_lock_bh()</function>.
521    </para>
522   </sect1>
523
524   <sect1 id="techniques-deadlocks">
525    <title>Deadlock: Simple and Advanced</title>
526
527    <para>
528      There is a coding bug where a piece of code tries to grab a
529      spinlock twice: it will spin forever, waiting for the lock to
530      be released (spinlocks, rwlocks and semaphores are not
531      recursive in Linux).  This is trivial to diagnose: not a
532      stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
533      problem.
534    </para>
535
536    <para>
537      For a slightly more complex case, imagine you have a region
538      shared by a bottom half and user context.  If you use a
539      <function>spin_lock()</function> call to protect it, it is 
540      possible that the user context will be interrupted by the bottom 
541      half while it holds the lock, and the bottom half will then spin 
542      forever trying to get the same lock.
543    </para>
544
545    <para>
546      Both of these are called deadlock, and as shown above, it can
547      occur even with a single CPU (although not on UP compiles,
548      since spinlocks vanish on kernel compiles with 
549      <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption 
550      in the second example).
551    </para>
552
553    <para>
554      This complete lockup is easy to diagnose: on SMP boxes the
555      watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set
556      (<filename>include/linux/spinlock.h</filename>) will show this up 
557      immediately when it happens.
558    </para>
559
560    <para>
561      A more complex problem is the so-called `deadly embrace',
562      involving two or more locks.  Say you have a hash table: each
563      entry in the table is a spinlock, and a chain of hashed
564      objects.  Inside a softirq handler, you sometimes want to
565      alter an object from one place in the hash to another: you
566      grab the spinlock of the old hash chain and the spinlock of
567      the new hash chain, and delete the object from the old one,
568      and insert it in the new one.
569    </para>
570
571    <para>
572      There are two problems here.  First, if your code ever
573      tries to move the object to the same chain, it will deadlock
574      with itself as it tries to lock it twice.  Secondly, if the
575      same softirq on another CPU is trying to move another object
576      in the reverse direction, the following could happen:
577    </para>
578
579    <table>
580     <title>Consequences</title>
581
582     <tgroup cols=2 align=left>
583
584      <thead>
585       <row>
586        <entry>CPU 1</entry>
587        <entry>CPU 2</entry>
588       </row>
589      </thead>
590
591      <tbody>
592       <row>
593        <entry>Grab lock A -&gt; OK</entry>
594        <entry>Grab lock B -&gt; OK</entry>
595       </row>
596       <row>
597        <entry>Grab lock B -&gt; spin</entry>
598        <entry>Grab lock A -&gt; spin</entry>
599       </row>
600      </tbody>
601     </tgroup>
602    </table>
603
604    <para>
605      The two CPUs will spin forever, waiting for the other to give up
606      their lock.  It will look, smell, and feel like a crash.
607    </para>
608
609    <sect2 id="techs-deadlock-prevent">
610     <title>Preventing Deadlock</title>
611
612     <para>
613       Textbooks will tell you that if you always lock in the same
614       order, you will never get this kind of deadlock.  Practice
615       will tell you that this approach doesn't scale: when I
616       create a new lock, I don't understand enough of the kernel
617       to figure out where in the 5000 lock hierarchy it will fit.
618     </para>
619
620     <para>
621       The best locks are encapsulated: they never get exposed in
622       headers, and are never held around calls to non-trivial
623       functions outside the same file.  You can read through this
624       code and see that it will never deadlock, because it never
625       tries to grab another lock while it has that one.  People
626       using your code don't even need to know you are using a
627       lock.
628     </para>
629
630     <para>
631       A classic problem here is when you provide callbacks or
632       hooks: if you call these with the lock held, you risk simple
633       deadlock, or a deadly embrace (who knows what the callback
634       will do?).  Remember, the other programmers are out to get
635       you, so don't do this.
636     </para>
637    </sect2>
638
639    <sect2 id="techs-deadlock-overprevent">
640     <title>Overzealous Prevention Of Deadlocks</title>
641
642     <para>
643       Deadlocks are problematic, but not as bad as data
644       corruption.  Code which grabs a read lock, searches a list,
645       fails to find what it wants, drops the read lock, grabs a
646       write lock and inserts the object has a race condition.
647     </para>
648
649     <para>
650       If you don't see why, please stay the fuck away from my code.
651     </para>
652    </sect2>
653   </sect1>
654
655   <sect1 id="per-cpu">
656    <title>Per-CPU Data</title>
657      
658    <para>
659      A great technique for avoiding locking which is used fairly
660      widely is to duplicate information for each CPU.  For example,
661      if you wanted to keep a count of a common condition, you could
662      use a spin lock and a single counter.  Nice and simple.
663    </para>
664
665    <para>
666      If that was too slow [it's probably not], you could instead
667      use a counter for each CPU [don't], then none of them need an
668      exclusive lock [you're wasting your time here].  To make sure
669      the CPUs don't have to synchronize caches all the time, align
670      the counters to cache boundaries by appending
671      `__cacheline_aligned' to the declaration
672      (<filename class=headerfile>include/linux/cache.h</filename>). 
673      [Can't you think of anything better to do?]
674    </para>
675
676    <para>
677      They will need a read lock to access their own counters,
678      however.  That way you can use a write lock to grant exclusive
679      access to all of them at once, to tally them up.
680    </para>
681   </sect1>
682
683   <sect1 id="brlock">
684    <title>Big Reader Locks</title>
685
686    <para>
687      A classic example of per-CPU information is Ingo's `big
688      reader' locks 
689      (<filename class=headerfile>linux/include/brlock.h</filename>).  These 
690      use the Per-CPU Data techniques described above to create a lock which 
691      is very fast to get a read lock, but agonizingly slow for a write
692      lock.
693    </para>
694
695    <para>
696      Fortunately, there are a limited number of these locks
697      available: you have to go through a strict interview process
698      to get one.
699    </para>
700   </sect1>
701
702   <sect1 id="lock-avoidance-rw">
703    <title>Avoiding Locks: Read And Write Ordering</title>
704
705    <para>
706      Sometimes it is possible to avoid locking.  Consider the
707      following case from the 2.2 firewall code, which inserted an
708      element into a single linked list in user context:
709    </para>
710
711    <programlisting>
712        new-&gt;next = i-&gt;next;
713        i-&gt;next = new;
714    </programlisting>
715
716    <para>
717      Here the author (Alan Cox, who knows what he's doing) assumes
718      that the pointer assignments are atomic.  This is important,
719      because networking packets would traverse this list on bottom
720      halves without a lock.  Depending on their exact timing, they
721      would either see the new element in the list with a valid 
722      <structfield>next</structfield> pointer, or it would not be in the 
723      list yet.  A lock is still required against other CPUs inserting
724      or deleting from the list, of course.
725    </para>
726
727    <para>
728      Of course, the writes <emphasis>must</emphasis> be in this
729      order, otherwise the new element appears in the list with an
730      invalid <structfield>next</structfield> pointer, and any other 
731      CPU iterating at the wrong time will jump through it into garbage.  
732      Because modern CPUs reorder, Alan's code actually read as follows:
733    </para>
734      
735    <programlisting>
736        new-&gt;next = i-&gt;next;
737        wmb();
738        i-&gt;next = new;
739    </programlisting>
740
741    <para>
742      The <function>wmb()</function> is a write memory barrier 
743      (<filename class=headerfile>include/asm/system.h</filename>): neither 
744      the compiler nor the CPU will allow any writes to memory after the 
745      <function>wmb()</function> to be visible to other hardware
746      before any of the writes before the <function>wmb()</function>.
747    </para>
748
749    <para>
750      As i386 does not do write reordering, this bug would never
751      show up on that platform.  On other SMP platforms, however, it
752      will.
753    </para>
754
755    <para>
756      There is also <function>rmb()</function> for read ordering: to ensure 
757      any previous variable reads occur before following reads.  The simple
758      <function>mb()</function> macro combines both 
759      <function>rmb()</function> and <function>wmb()</function>.
760    </para>
761
762    <para>
763      Some atomic operations are defined to act as a memory barrier
764      (ie. as per the <function>mb()</function> macro, but if in
765      doubt, be explicit.
766      <!-- Rusty Russell 2 May 2001, 2.4.4 -->
767      Also,
768      spinlock operations act as partial barriers: operations after
769      gaining a spinlock will never be moved to precede the
770      <function>spin_lock()</function> call, and operations before
771      releasing a spinlock will never be moved after the
772      <function>spin_unlock()</function> call.
773      <!-- Manfred Spraul <manfreds@colorfullife.com>
774           24 May 2000 2.3.99-pre9 -->
775    </para>
776   </sect1>
777
778   <sect1 id="lock-avoidance-atomic-ops">
779    <title>Avoiding Locks: Atomic Operations</title>
780
781    <para>
782      There are a number of atomic operations defined in
783      <filename class=headerfile>include/asm/atomic.h</filename>: these 
784      are guaranteed to be seen atomically from all CPUs in the system, thus 
785      avoiding races. If your shared data consists of a single counter, say, 
786      these operations might be simpler than using spinlocks (although for
787      anything non-trivial using spinlocks is clearer).
788    </para>
789
790    <para>
791      Note that the atomic operations do in general not act as memory
792      barriers. Instead you can insert a memory barrier before or
793      after <function>atomic_inc()</function> or
794      <function>atomic_dec()</function> by inserting
795      <function>smp_mb__before_atomic_inc()</function>,
796      <function>smp_mb__after_atomic_inc()</function>,
797      <function>smp_mb__before_atomic_dec()</function> or
798      <function>smp_mb__after_atomic_dec()</function>
799      respectively. The advantage of using those macros instead of
800      <function>smp_mb()</function> is, that they are cheaper on some
801      platforms.    
802      <!-- Sebastian Wilhelmi <seppi@seppi.de> 2002-03-04 -->
803    </para>
804   </sect1>
805
806   <sect1 id="ref-counts">
807    <title>Protecting A Collection of Objects: Reference Counts</title>
808
809    <para>
810      Locking a collection of objects is fairly easy: you get a
811      single spinlock, and you make sure you grab it before
812      searching, adding or deleting an object.
813    </para>
814
815    <para>
816      The purpose of this lock is not to protect the individual
817      objects: you might have a separate lock inside each one for
818      that.  It is to protect the <emphasis>data structure
819      containing the objects</emphasis> from race conditions.  Often
820      the same lock is used to protect the contents of all the
821      objects as well, for simplicity, but they are inherently
822      orthogonal (and many other big words designed to confuse).
823    </para>
824
825    <para>
826      Changing this to a read-write lock will often help markedly if
827      reads are far more common that writes.  If not, there is
828      another approach you can use to reduce the time the lock is
829      held: reference counts.
830    </para>
831
832    <para>
833      In this approach, an object has an owner, who sets the
834      reference count to one.  Whenever you get a pointer to the
835      object, you increment the reference count (a `get' operation).
836      Whenever you relinquish a pointer, you decrement the reference
837      count (a `put' operation).  When the owner wants to destroy
838      it, they mark it dead, and do a put.
839    </para>
840
841    <para>
842      Whoever drops the reference count to zero (usually implemented
843      with <function>atomic_dec_and_test()</function>) actually cleans 
844      up and frees the object.
845    </para>
846
847    <para>
848      This means that you are guaranteed that the object won't
849      vanish underneath you, even though you no longer have a lock
850      for the collection.
851    </para>
852
853    <para>
854      Here's some skeleton code:
855    </para>
856
857    <programlisting>
858        void create_foo(struct foo *x)
859        {
860                atomic_set(&amp;x-&gt;use, 1);
861                spin_lock_bh(&amp;list_lock);
862                ... insert in list ...
863                spin_unlock_bh(&amp;list_lock);
864        }
865
866        struct foo *get_foo(int desc)
867        {
868                struct foo *ret;
869
870                spin_lock_bh(&amp;list_lock);
871                ... find in list ...
872                if (ret) atomic_inc(&amp;ret-&gt;use);
873                spin_unlock_bh(&amp;list_lock);
874
875                return ret;
876        }
877
878        void put_foo(struct foo *x)
879        {
880                if (atomic_dec_and_test(&amp;x-&gt;use))
881                        kfree(foo);
882        }
883
884        void destroy_foo(struct foo *x)
885        {
886                spin_lock_bh(&amp;list_lock);
887                ... remove from list ...
888                spin_unlock_bh(&amp;list_lock);
889
890                put_foo(x);
891        }
892    </programlisting>
893
894    <sect2 id="helpful-macros">
895     <title>Macros To Help You</title>
896     <para>
897       There are a set of debugging macros tucked inside
898       <filename class=headerfile>include/linux/netfilter_ipv4/lockhelp.h</filename>
899       and <filename class=headerfile>listhelp.h</filename>: these are very
900       useful for ensuring that locks are held in the right places to protect
901       infrastructure.
902     </para>
903    </sect2>
904   </sect1>
905   
906   <sect1 id="sleeping-things">
907    <title>Things Which Sleep</title>
908
909    <para>
910      You can never call the following routines while holding a
911      spinlock, as they may sleep.  This also means you need to be in
912      user context.
913    </para>
914
915    <itemizedlist>
916     <listitem>
917      <para>
918        Accesses to 
919        <firstterm linkend="gloss-userspace">userspace</firstterm>:
920      </para>
921      <itemizedlist>
922       <listitem>
923        <para>
924          <function>copy_from_user()</function>
925        </para>
926       </listitem>
927       <listitem>
928        <para>
929          <function>copy_to_user()</function>
930        </para>
931       </listitem>
932       <listitem>
933        <para>
934          <function>get_user()</function>
935        </para>
936       </listitem>
937       <listitem>
938        <para>
939          <function> put_user()</function>
940        </para>
941       </listitem>
942      </itemizedlist>
943     </listitem>
944
945     <listitem>
946      <para>
947        <function>kmalloc(GFP_KERNEL)</function>
948      </para>
949     </listitem>
950
951     <listitem>
952      <para>
953      <function>down_interruptible()</function> and
954      <function>down()</function>
955      </para>
956      <para>
957       There is a <function>down_trylock()</function> which can be
958       used inside interrupt context, as it will not sleep.
959       <function>up()</function> will also never sleep.
960      </para>
961     </listitem>
962    </itemizedlist>
963
964    <para>
965     <function>printk()</function> can be called in
966     <emphasis>any</emphasis> context, interestingly enough.
967    </para>
968   </sect1>
969
970   <sect1 id="sparc">
971    <title>The Fucked Up Sparc</title>
972
973    <para>
974      Alan Cox says <quote>the irq disable/enable is in the register
975      window on a sparc</quote>.  Andi Kleen says <quote>when you do
976      restore_flags in a different function you mess up all the
977      register windows</quote>.
978    </para>
979
980    <para>
981      So never pass the flags word set by 
982      <function>spin_lock_irqsave()</function> and brethren to another 
983      function (unless it's declared <type>inline</type>.  Usually no-one 
984      does this, but now you've been warned.  Dave Miller can never do 
985      anything in a straightforward manner (I can say that, because I have
986      pictures of him and a certain PowerPC maintainer in a compromising 
987      position).
988    </para>
989   </sect1>
990
991   <sect1 id="racing-timers">
992    <title>Racing Timers: A Kernel Pastime</title>
993
994    <para>
995      Timers can produce their own special problems with races.
996      Consider a collection of objects (list, hash, etc) where each
997      object has a timer which is due to destroy it.
998    </para>
999
1000    <para>
1001      If you want to destroy the entire collection (say on module
1002      removal), you might do the following:
1003    </para>
1004
1005    <programlisting>
1006        /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
1007           HUNGARIAN NOTATION */
1008        spin_lock_bh(&amp;list_lock);
1009
1010        while (list) {
1011                struct foo *next = list-&gt;next;
1012                del_timer(&amp;list-&gt;timer);
1013                kfree(list);
1014                list = next;
1015        }
1016
1017        spin_unlock_bh(&amp;list_lock);
1018    </programlisting>
1019
1020    <para>
1021      Sooner or later, this will crash on SMP, because a timer can
1022      have just gone off before the <function>spin_lock_bh()</function>, 
1023      and it will only get the lock after we 
1024      <function>spin_unlock_bh()</function>, and then try to free
1025      the element (which has already been freed!).
1026    </para>
1027
1028    <para>
1029      This can be avoided by checking the result of 
1030      <function>del_timer()</function>: if it returns
1031      <returnvalue>1</returnvalue>, the timer has been deleted.  
1032      If <returnvalue>0</returnvalue>, it means (in this
1033      case) that it is currently running, so we can do:
1034    </para>
1035
1036    <programlisting>
1037        retry:  
1038                spin_lock_bh(&amp;list_lock);
1039
1040                while (list) {
1041                        struct foo *next = list-&gt;next;
1042                        if (!del_timer(&amp;list-&gt;timer)) {
1043                                /* Give timer a chance to delete this */
1044                                spin_unlock_bh(&amp;list_lock);
1045                                goto retry;
1046                        }
1047                        kfree(list);
1048                        list = next;
1049                }
1050
1051                spin_unlock_bh(&amp;list_lock);
1052    </programlisting>
1053
1054    <para>
1055      Another common problem is deleting timers which restart
1056      themselves (by calling <function>add_timer()</function> at the end 
1057      of their timer function).  Because this is a fairly common case 
1058      which is prone to races, you can put a call to
1059      <function>timer_exit()</function> at the very end of your timer function,
1060      and user <function>del_timer_sync()</function> 
1061      (<filename class=headerfile>include/linux/timer.h</filename>)
1062      to handle this case.  It returns the number of times the timer 
1063      had to be deleted before we finally stopped it from adding itself back 
1064      in.
1065    </para>
1066   </sect1>
1067  </chapter>
1068
1069  <chapter id="references">
1070   <title>Further reading</title>
1071
1072   <itemizedlist>
1073    <listitem>
1074     <para>
1075       <filename>Documentation/spinlocks.txt</filename>: 
1076       Linus Torvalds' spinlocking tutorial in the kernel sources.
1077     </para>
1078    </listitem>
1079
1080    <listitem>
1081     <para>
1082       Unix Systems for Modern Architectures: Symmetric
1083       Multiprocessing and Caching for Kernel Programmers:
1084     </para>
1085
1086     <para>
1087       Curt Schimmel's very good introduction to kernel level
1088       locking (not written for Linux, but nearly everything
1089       applies).  The book is expensive, but really worth every
1090       penny to understand SMP locking. [ISBN: 0201633388]
1091     </para>
1092    </listitem>
1093   </itemizedlist>
1094  </chapter>
1095
1096  <chapter id="thanks">
1097    <title>Thanks</title>
1098
1099    <para>
1100      Thanks to Telsa Gwynne for DocBooking, neatening and adding
1101      style.
1102    </para>
1103
1104    <para>
1105      Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul
1106      Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul and Tim
1107      Waugh for proofreading, correcting, flaming, commenting.
1108    </para>
1109
1110    <para>
1111      Thanks to the cabal for having no influence on this document.
1112    </para>
1113  </chapter>
1114
1115  <glossary id="glossary">
1116   <title>Glossary</title>
1117
1118   <glossentry id="gloss-bh">
1119    <glossterm>bh</glossterm>
1120     <glossdef>
1121      <para>
1122        Bottom Half: for historical reasons, functions with
1123        `_bh' in them often now refer to any software interrupt, e.g.
1124        <function>spin_lock_bh()</function> blocks any software interrupt 
1125        on the current CPU.  Bottom halves are deprecated, and will 
1126        eventually be replaced by tasklets.  Only one bottom half will be 
1127        running at any time.
1128     </para>
1129    </glossdef>
1130   </glossentry>
1131
1132   <glossentry id="gloss-hwinterrupt">
1133    <glossterm>Hardware Interrupt / Hardware IRQ</glossterm>
1134    <glossdef>
1135     <para>
1136       Hardware interrupt request.  <function>in_irq()</function> returns 
1137       <returnvalue>true</returnvalue> in a hardware interrupt handler (it 
1138       also returns true when interrupts are blocked).
1139     </para>
1140    </glossdef>
1141   </glossentry>
1142
1143   <glossentry id="gloss-interruptcontext">
1144    <glossterm>Interrupt Context</glossterm>
1145    <glossdef>
1146     <para>
1147       Not user context: processing a hardware irq or software irq.
1148       Indicated by the <function>in_interrupt()</function> macro 
1149       returning <returnvalue>true</returnvalue> (although it also
1150       returns true when interrupts or BHs are blocked).
1151     </para>
1152    </glossdef>
1153   </glossentry>
1154
1155   <glossentry id="gloss-smp">
1156    <glossterm><acronym>SMP</acronym></glossterm>
1157    <glossdef>
1158     <para>
1159       Symmetric Multi-Processor: kernels compiled for multiple-CPU
1160       machines.  (CONFIG_SMP=y).
1161     </para>
1162    </glossdef>
1163   </glossentry>
1164
1165   <glossentry id="gloss-softirq">
1166    <glossterm>softirq</glossterm>
1167    <glossdef>
1168     <para>
1169       Strictly speaking, one of up to 32 enumerated software
1170       interrupts which can run on multiple CPUs at once.
1171       Sometimes used to refer to tasklets and bottom halves as
1172       well (ie. all software interrupts).
1173     </para>
1174    </glossdef>
1175   </glossentry>
1176
1177   <glossentry id="gloss-swinterrupt">
1178    <glossterm>Software Interrupt / Software IRQ</glossterm>
1179    <glossdef>
1180     <para>
1181       Software interrupt handler.  <function>in_irq()</function> returns 
1182       <returnvalue>false</returnvalue>; <function>in_softirq()</function>
1183       returns <returnvalue>true</returnvalue>.  Tasklets, softirqs and 
1184       bottom halves all fall into the category of `software interrupts'.
1185     </para>
1186    </glossdef>
1187   </glossentry>
1188
1189   <glossentry id="gloss-tasklet">
1190    <glossterm>tasklet</glossterm>
1191    <glossdef>
1192     <para>
1193       A dynamically-registrable software interrupt,
1194       which is guaranteed to only run on one CPU at a time.
1195     </para>
1196    </glossdef>
1197   </glossentry>
1198
1199   <glossentry id="gloss-up">
1200    <glossterm><acronym>UP</acronym></glossterm>
1201    <glossdef>
1202     <para>
1203       Uni-Processor: Non-SMP.  (CONFIG_SMP=n).
1204     </para>
1205    </glossdef>
1206   </glossentry>
1207
1208   <glossentry id="gloss-usercontext">
1209    <glossterm>User Context</glossterm>
1210    <glossdef>
1211     <para>
1212       The kernel executing on behalf of a particular
1213       process or kernel thread (given by the <function>current()</function>
1214       macro.)  Not to be confused with userspace.  Can be interrupted by 
1215       software  or hardware interrupts.
1216     </para>
1217    </glossdef>
1218   </glossentry>
1219
1220   <glossentry id="gloss-userspace">
1221    <glossterm>Userspace</glossterm>
1222    <glossdef>
1223     <para>
1224       A process executing its own code outside the kernel.
1225     </para>
1226    </glossdef>
1227   </glossentry>      
1228
1229  </glossary>
1230</book>
1231
1232