1<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]> 2 3<book id="lk-hacking-guide"> 4 <bookinfo> 5 <title>Unreliable Guide To Hacking The Linux Kernel</title> 6 7 <authorgroup> 8 <author> 9 <firstname>Paul</firstname> 10 <othername>Rusty</othername> 11 <surname>Russell</surname> 12 <affiliation> 13 <address> 14 <email>rusty@rustcorp.com.au</email> 15 </address> 16 </affiliation> 17 </author> 18 </authorgroup> 19 20 <copyright> 21 <year>2001</year> 22 <holder>Rusty Russell</holder> 23 </copyright> 24 25 <legalnotice> 26 <para> 27 This documentation is free software; you can redistribute 28 it and/or modify it under the terms of the GNU General Public 29 License as published by the Free Software Foundation; either 30 version 2 of the License, or (at your option) any later 31 version. 32 </para> 33 34 <para> 35 This program is distributed in the hope that it will be 36 useful, but WITHOUT ANY WARRANTY; without even the implied 37 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 38 See the GNU General Public License for more details. 39 </para> 40 41 <para> 42 You should have received a copy of the GNU General Public 43 License along with this program; if not, write to the Free 44 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 45 MA 02111-1307 USA 46 </para> 47 48 <para> 49 For more details see the file COPYING in the source 50 distribution of Linux. 51 </para> 52 </legalnotice> 53 54 <releaseinfo> 55 This is the first release of this document as part of the kernel tarball. 56 </releaseinfo> 57 58 </bookinfo> 59 60 <toc></toc> 61 62 <chapter id="introduction"> 63 <title>Introduction</title> 64 <para> 65 Welcome, gentle reader, to Rusty's Unreliable Guide to Linux 66 Kernel Hacking. This document describes the common routines and 67 general requirements for kernel code: its goal is to serve as a 68 primer for Linux kernel development for experienced C 69 programmers. I avoid implementation details: that's what the 70 code is for, and I ignore whole tracts of useful routines. 71 </para> 72 <para> 73 Before you read this, please understand that I never wanted to 74 write this document, being grossly under-qualified, but I always 75 wanted to read it, and this was the only way. I hope it will 76 grow into a compendium of best practice, common starting points 77 and random information. 78 </para> 79 </chapter> 80 81 <chapter id="basic-players"> 82 <title>The Players</title> 83 84 <para> 85 At any time each of the CPUs in a system can be: 86 </para> 87 88 <itemizedlist> 89 <listitem> 90 <para> 91 not associated with any process, serving a hardware interrupt; 92 </para> 93 </listitem> 94 95 <listitem> 96 <para> 97 not associated with any process, serving a softirq, tasklet or bh; 98 </para> 99 </listitem> 100 101 <listitem> 102 <para> 103 running in kernel space, associated with a process; 104 </para> 105 </listitem> 106 107 <listitem> 108 <para> 109 running a process in user space. 110 </para> 111 </listitem> 112 </itemizedlist> 113 114 <para> 115 There is a strict ordering between these: other than the last 116 category (userspace) each can only be pre-empted by those above. 117 For example, while a softirq is running on a CPU, no other 118 softirq will pre-empt it, but a hardware interrupt can. However, 119 any other CPUs in the system execute independently. 120 </para> 121 122 <para> 123 We'll see a number of ways that the user context can block 124 interrupts, to become truly non-preemptable. 125 </para> 126 127 <sect1 id="basics-usercontext"> 128 <title>User Context</title> 129 130 <para> 131 User context is when you are coming in from a system call or 132 other trap: you can sleep, and you own the CPU (except for 133 interrupts) until you call <function>schedule()</function>. 134 In other words, user context (unlike userspace) is not pre-emptable. 135 </para> 136 137 <note> 138 <para> 139 You are always in user context on module load and unload, 140 and on operations on the block device layer. 141 </para> 142 </note> 143 144 <para> 145 In user context, the <varname>current</varname> pointer (indicating 146 the task we are currently executing) is valid, and 147 <function>in_interrupt()</function> 148 (<filename>include/asm/hardirq.h</filename>) is <returnvalue>false 149 </returnvalue>. 150 </para> 151 152 <caution> 153 <para> 154 Beware that if you have interrupts or bottom halves disabled 155 (see below), <function>in_interrupt()</function> will return a 156 false positive. 157 </para> 158 </caution> 159 </sect1> 160 161 <sect1 id="basics-hardirqs"> 162 <title>Hardware Interrupts (Hard IRQs)</title> 163 164 <para> 165 Timer ticks, <hardware>network cards</hardware> and 166 <hardware>keyboard</hardware> are examples of real 167 hardware which produce interrupts at any time. The kernel runs 168 interrupt handlers, which services the hardware. The kernel 169 guarantees that this handler is never re-entered: if another 170 interrupt arrives, it is queued (or dropped). Because it 171 disables interrupts, this handler has to be fast: frequently it 172 simply acknowledges the interrupt, marks a `software interrupt' 173 for execution and exits. 174 </para> 175 176 <para> 177 You can tell you are in a hardware interrupt, because 178 <function>in_irq()</function> returns <returnvalue>true</returnvalue>. 179 </para> 180 <caution> 181 <para> 182 Beware that this will return a false positive if interrupts are disabled 183 (see below). 184 </para> 185 </caution> 186 </sect1> 187 188 <sect1 id="basics-softirqs"> 189 <title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title> 190 191 <para> 192 Whenever a system call is about to return to userspace, or a 193 hardware interrupt handler exits, any `software interrupts' 194 which are marked pending (usually by hardware interrupts) are 195 run (<filename>kernel/softirq.c</filename>). 196 </para> 197 198 <para> 199 Much of the real interrupt handling work is done here. Early in 200 the transition to <acronym>SMP</acronym>, there were only `bottom 201 halves' (BHs), which didn't take advantage of multiple CPUs. Shortly 202 after we switched from wind-up computers made of match-sticks and snot, 203 we abandoned this limitation. 204 </para> 205 206 <para> 207 <filename class=headerfile>include/linux/interrupt.h</filename> lists the 208 different BH's. No matter how many CPUs you have, no two BHs will run at 209 the same time. This made the transition to SMP simpler, but sucks hard for 210 scalable performance. A very important bottom half is the timer 211 BH (<filename class=headerfile>include/linux/timer.h</filename>): you 212 can register to have it call functions for you in a given length of time. 213 </para> 214 215 <para> 216 2.3.43 introduced softirqs, and re-implemented the (now 217 deprecated) BHs underneath them. Softirqs are fully-SMP 218 versions of BHs: they can run on as many CPUs at once as 219 required. This means they need to deal with any races in shared 220 data using their own locks. A bitmask is used to keep track of 221 which are enabled, so the 32 available softirqs should not be 222 used up lightly. (<emphasis>Yes</emphasis>, people will 223 notice). 224 </para> 225 226 <para> 227 tasklets (<filename class=headerfile>include/linux/interrupt.h</filename>) 228 are like softirqs, except they are dynamically-registrable (meaning you 229 can have as many as you want), and they also guarantee that any tasklet 230 will only run on one CPU at any time, although different tasklets can 231 run simultaneously (unlike different BHs). 232 </para> 233 <caution> 234 <para> 235 The name `tasklet' is misleading: they have nothing to do with `tasks', 236 and probably more to do with some bad vodka Alexey Kuznetsov had at the 237 time. 238 </para> 239 </caution> 240 241 <para> 242 You can tell you are in a softirq (or bottom half, or tasklet) 243 using the <function>in_softirq()</function> macro 244 (<filename class=headerfile>include/asm/softirq.h</filename>). 245 </para> 246 <caution> 247 <para> 248 Beware that this will return a false positive if a bh lock (see below) 249 is held. 250 </para> 251 </caution> 252 </sect1> 253 </chapter> 254 255 <chapter id="basic-rules"> 256 <title>Some Basic Rules</title> 257 258 <variablelist> 259 <varlistentry> 260 <term>No memory protection</term> 261 <listitem> 262 <para> 263 If you corrupt memory, whether in user context or 264 interrupt context, the whole machine will crash. Are you 265 sure you can't do what you want in userspace? 266 </para> 267 </listitem> 268 </varlistentry> 269 270 <varlistentry> 271 <term>No floating point or <acronym>MMX</acronym></term> 272 <listitem> 273 <para> 274 The <acronym>FPU</acronym> context is not saved; even in user 275 context the <acronym>FPU</acronym> state probably won't 276 correspond with the current process: you would mess with some 277 user process' <acronym>FPU</acronym> state. If you really want 278 to do this, you would have to explicitly save/restore the full 279 <acronym>FPU</acronym> state (and avoid context switches). It 280 is generally a bad idea; use fixed point arithmetic first. 281 </para> 282 </listitem> 283 </varlistentry> 284 285 <varlistentry> 286 <term>A rigid stack limit</term> 287 <listitem> 288 <para> 289 The kernel stack is about 6K in 2.2 (for most 290 architectures: it's about 14K on the Alpha), and shared 291 with interrupts so you can't use it all. Avoid deep 292 recursion and huge local arrays on the stack (allocate 293 them dynamically instead). 294 </para> 295 </listitem> 296 </varlistentry> 297 298 <varlistentry> 299 <term>The Linux kernel is portable</term> 300 <listitem> 301 <para> 302 Let's keep it that way. Your code should be 64-bit clean, 303 and endian-independent. You should also minimize CPU 304 specific stuff, e.g. inline assembly should be cleanly 305 encapsulated and minimized to ease porting. Generally it 306 should be restricted to the architecture-dependent part of 307 the kernel tree. 308 </para> 309 </listitem> 310 </varlistentry> 311 </variablelist> 312 </chapter> 313 314 <chapter id="ioctls"> 315 <title>ioctls: Not writing a new system call</title> 316 317 <para> 318 A system call generally looks like this 319 </para> 320 321 <programlisting> 322asmlinkage int sys_mycall(int arg) 323{ 324 return 0; 325} 326 </programlisting> 327 328 <para> 329 First, in most cases you don't want to create a new system call. 330 You create a character device and implement an appropriate ioctl 331 for it. This is much more flexible than system calls, doesn't have 332 to be entered in every architecture's 333 <filename class=headerfile>include/asm/unistd.h</filename> and 334 <filename>arch/kernel/entry.S</filename> file, and is much more 335 likely to be accepted by Linus. 336 </para> 337 338 <para> 339 If all your routine does is read or write some parameter, consider 340 implementing a <function>sysctl</function> interface instead. 341 </para> 342 343 <para> 344 Inside the ioctl you're in user context to a process. When a 345 error occurs you return a negated errno (see 346 <filename class=headerfile>include/linux/errno.h</filename>), 347 otherwise you return <returnvalue>0</returnvalue>. 348 </para> 349 350 <para> 351 After you slept you should check if a signal occurred: the 352 Unix/Linux way of handling signals is to temporarily exit the 353 system call with the <constant>-ERESTARTSYS</constant> error. The 354 system call entry code will switch back to user context, process 355 the signal handler and then your system call will be restarted 356 (unless the user disabled that). So you should be prepared to 357 process the restart, e.g. if you're in the middle of manipulating 358 some data structure. 359 </para> 360 361 <programlisting> 362if (signal_pending()) 363 return -ERESTARTSYS; 364 </programlisting> 365 366 <para> 367 If you're doing longer computations: first think userspace. If you 368 <emphasis>really</emphasis> want to do it in kernel you should 369 regularly check if you need to give up the CPU (remember there is 370 cooperative multitasking per CPU). Idiom: 371 </para> 372 373 <programlisting> 374if (current->need_resched) 375 schedule(); /* Will sleep */ 376 </programlisting> 377 378 <para> 379 A short note on interface design: the UNIX system call motto is 380 "Provide mechanism not policy". 381 </para> 382 </chapter> 383 384 <chapter id="deadlock-recipes"> 385 <title>Recipes for Deadlock</title> 386 387 <para> 388 You cannot call any routines which may sleep, unless: 389 </para> 390 <itemizedlist> 391 <listitem> 392 <para> 393 You are in user context. 394 </para> 395 </listitem> 396 397 <listitem> 398 <para> 399 You do not own any spinlocks. 400 </para> 401 </listitem> 402 403 <listitem> 404 <para> 405 You have interrupts enabled (actually, Andi Kleen says 406 that the scheduling code will enable them for you, but 407 that's probably not what you wanted). 408 </para> 409 </listitem> 410 </itemizedlist> 411 412 <para> 413 Note that some functions may sleep implicitly: common ones are 414 the user space access functions (*_user) and memory allocation 415 functions without <symbol>GFP_ATOMIC</symbol>. 416 </para> 417 418 <para> 419 You will eventually lock up your box if you break these rules. 420 </para> 421 422 <para> 423 Really. 424 </para> 425 </chapter> 426 427 <chapter id="common-routines"> 428 <title>Common Routines</title> 429 430 <sect1 id="routines-printk"> 431 <title> 432 <function>printk()</function> 433 <filename class=headerfile>include/linux/kernel.h</filename> 434 </title> 435 436 <para> 437 <function>printk()</function> feeds kernel messages to the 438 console, dmesg, and the syslog daemon. It is useful for debugging 439 and reporting errors, and can be used inside interrupt context, 440 but use with caution: a machine which has its console flooded with 441 printk messages is unusable. It uses a format string mostly 442 compatible with ANSI C printf, and C string concatenation to give 443 it a first "priority" argument: 444 </para> 445 446 <programlisting> 447printk(KERN_INFO "i = %u\n", i); 448 </programlisting> 449 450 <para> 451 See <filename class=headerfile>include/linux/kernel.h</filename>; 452 for other KERN_ values; these are interpreted by syslog as the 453 level. Special case: for printing an IP address use 454 </para> 455 456 <programlisting> 457__u32 ipaddress; 458printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); 459 </programlisting> 460 461 <para> 462 <function>printk()</function> internally uses a 1K buffer and does 463 not catch overruns. Make sure that will be enough. 464 </para> 465 466 <note> 467 <para> 468 You will know when you are a real kernel hacker 469 when you start typoing printf as printk in your user programs :) 470 </para> 471 </note> 472 473 <!--- From the Lions book reader department --> 474 475 <note> 476 <para> 477 Another sidenote: the original Unix Version 6 sources had a 478 comment on top of its printf function: "Printf should not be 479 used for chit-chat". You should follow that advice. 480 </para> 481 </note> 482 </sect1> 483 484 <sect1 id="routines-copy"> 485 <title> 486 <function>copy_[to/from]_user()</function> 487 / 488 <function>get_user()</function> 489 / 490 <function>put_user()</function> 491 <filename class=headerfile>include/asm/uaccess.h</filename> 492 </title> 493 494 <para> 495 <emphasis>[SLEEPS]</emphasis> 496 </para> 497 498 <para> 499 <function>put_user()</function> and <function>get_user()</function> 500 are used to get and put single values (such as an int, char, or 501 long) from and to userspace. A pointer into userspace should 502 never be simply dereferenced: data should be copied using these 503 routines. Both return <constant>-EFAULT</constant> or 0. 504 </para> 505 <para> 506 <function>copy_to_user()</function> and 507 <function>copy_from_user()</function> are more general: they copy 508 an arbitrary amount of data to and from userspace. 509 <caution> 510 <para> 511 Unlike <function>put_user()</function> and 512 <function>get_user()</function>, they return the amount of 513 uncopied data (ie. <returnvalue>0</returnvalue> still means 514 success). 515 </para> 516 </caution> 517 [Yes, this moronic interface makes me cringe. Please submit a 518 patch and become my hero --RR.] 519 </para> 520 <para> 521 The functions may sleep implicitly. This should never be called 522 outside user context (it makes no sense), with interrupts 523 disabled, or a spinlock held. 524 </para> 525 </sect1> 526 527 <sect1 id="routines-kmalloc"> 528 <title><function>kmalloc()</function>/<function>kfree()</function> 529 <filename class=headerfile>include/linux/slab.h</filename></title> 530 531 <para> 532 <emphasis>[MAY SLEEP: SEE BELOW]</emphasis> 533 </para> 534 535 <para> 536 These routines are used to dynamically request pointer-aligned 537 chunks of memory, like malloc and free do in userspace, but 538 <function>kmalloc()</function> takes an extra flag word. 539 Important values: 540 </para> 541 542 <variablelist> 543 <varlistentry> 544 <term> 545 <constant> 546 GFP_KERNEL 547 </constant> 548 </term> 549 <listitem> 550 <para> 551 May sleep and swap to free memory. Only allowed in user 552 context, but is the most reliable way to allocate memory. 553 </para> 554 </listitem> 555 </varlistentry> 556 557 <varlistentry> 558 <term> 559 <constant> 560 GFP_ATOMIC 561 </constant> 562 </term> 563 <listitem> 564 <para> 565 Don't sleep. Less reliable than <constant>GFP_KERNEL</constant>, 566 but may be called from interrupt context. You should 567 <emphasis>really</emphasis> have a good out-of-memory 568 error-handling strategy. 569 </para> 570 </listitem> 571 </varlistentry> 572 573 <varlistentry> 574 <term> 575 <constant> 576 GFP_DMA 577 </constant> 578 </term> 579 <listitem> 580 <para> 581 Allocate ISA DMA lower than 16MB. If you don't know what that 582 is you don't need it. Very unreliable. 583 </para> 584 </listitem> 585 </varlistentry> 586 </variablelist> 587 588 <para> 589 If you see a <errorname>kmem_grow: Called nonatomically from int 590 </errorname> warning message you called a memory allocation function 591 from interrupt context without <constant>GFP_ATOMIC</constant>. 592 You should really fix that. Run, don't walk. 593 </para> 594 595 <para> 596 If you are allocating at least <constant>PAGE_SIZE</constant> 597 (<filename class=headerfile>include/asm/page.h</filename>) bytes, 598 consider using <function>__get_free_pages()</function> 599 600 (<filename class=headerfile>include/linux/mm.h</filename>). It 601 takes an order argument (0 for page sized, 1 for double page, 2 602 for four pages etc.) and the same memory priority flag word as 603 above. 604 </para> 605 606 <para> 607 If you are allocating more than a page worth of bytes you can use 608 <function>vmalloc()</function>. It'll allocate virtual memory in 609 the kernel map. This block is not contiguous in physical memory, 610 but the <acronym>MMU</acronym> makes it look like it is for you 611 (so it'll only look contiguous to the CPUs, not to external device 612 drivers). If you really need large physically contiguous memory 613 for some weird device, you have a problem: it is poorly supported 614 in Linux because after some time memory fragmentation in a running 615 kernel makes it hard. The best way is to allocate the block early 616 in the boot process via the <function>alloc_bootmem()</function> 617 routine. 618 </para> 619 620 <para> 621 Before inventing your own cache of often-used objects consider 622 using a slab cache in 623 <filename class=headerfile>include/linux/slab.h</filename> 624 </para> 625 </sect1> 626 627 <sect1 id="routines-current"> 628 <title><function>current</function> 629 <filename class=headerfile>include/asm/current.h</filename></title> 630 631 <para> 632 This global variable (really a macro) contains a pointer to 633 the current task structure, so is only valid in user context. 634 For example, when a process makes a system call, this will 635 point to the task structure of the calling process. It is 636 <emphasis>not NULL</emphasis> in interrupt context. 637 </para> 638 </sect1> 639 640 <sect1 id="routines-udelay"> 641 <title><function>udelay()</function>/<function>mdelay()</function> 642 <filename class=headerfile>include/asm/delay.h</filename> 643 <filename class=headerfile>include/linux/delay.h</filename> 644 </title> 645 646 <para> 647 The <function>udelay()</function> function can be used for small pauses. 648 Do not use large values with <function>udelay()</function> as you risk 649 overflow - the helper function <function>mdelay()</function> is useful 650 here, or even consider <function>schedule_timeout()</function>. 651 </para> 652 </sect1> 653 654 <sect1 id="routines-endian"> 655 <title><function>cpu_to_be32()</function>/<function>be32_to_cpu()</function>/<function>cpu_to_le32()</function>/<function>le32_to_cpu()</function> 656 <filename class=headerfile>include/asm/byteorder.h</filename> 657 </title> 658 659 <para> 660 The <function>cpu_to_be32()</function> family (where the "32" can 661 be replaced by 64 or 16, and the "be" can be replaced by "le") are 662 the general way to do endian conversions in the kernel: they 663 return the converted value. All variations supply the reverse as 664 well: <function>be32_to_cpu()</function>, etc. 665 </para> 666 667 <para> 668 There are two major variations of these functions: the pointer 669 variation, such as <function>cpu_to_be32p()</function>, which take 670 a pointer to the given type, and return the converted value. The 671 other variation is the "in-situ" family, such as 672 <function>cpu_to_be32s()</function>, which convert value referred 673 to by the pointer, and return void. 674 </para> 675 </sect1> 676 677 <sect1 id="routines-local-irqs"> 678 <title><function>local_irq_save()</function>/<function>local_irq_restore()</function> 679 <filename class=headerfile>include/asm/system.h</filename> 680 </title> 681 682 <para> 683 These routines disable hard interrupts on the local CPU, and 684 restore them. They are reentrant; saving the previous state in 685 their one <varname>unsigned long flags</varname> argument. If you 686 know that interrupts are enabled, you can simply use 687 <function>local_irq_disable()</function> and 688 <function>local_irq_enable()</function>. 689 </para> 690 </sect1> 691 692 <sect1 id="routines-softirqs"> 693 <title><function>local_bh_disable()</function>/<function>local_bh_enable()</function> 694 <filename class=headerfile>include/asm/softirq.h</filename></title> 695 696 <para> 697 These routines disable soft interrupts on the local CPU, and 698 restore them. They are reentrant; if soft interrupts were 699 disabled before, they will still be disabled after this pair 700 of functions has been called. They prevent softirqs, tasklets 701 and bottom halves from running on the current CPU. 702 </para> 703 </sect1> 704 705 <sect1 id="routines-processorids"> 706 <title><function>smp_processor_id</function>()/<function>cpu_[number/logical]_map()</function> 707 <filename class=headerfile>include/asm/smp.h</filename></title> 708 709 <para> 710 <function>smp_processor_id()</function> returns the current 711 processor number, between 0 and <symbol>NR_CPUS</symbol> (the 712 maximum number of CPUs supported by Linux, currently 32). These 713 values are not necessarily continuous: to get a number between 0 714 and <function>smp_num_cpus()</function> (the number of actual 715 processors in this machine), the 716 <function>cpu_number_map()</function> function is used to map the 717 processor id to a logical number. 718 <function>cpu_logical_map()</function> does the reverse. 719 </para> 720 </sect1> 721 722 <sect1 id="routines-init"> 723 <title><type>__init</type>/<type>__exit</type>/<type>__initdata</type> 724 <filename class=headerfile>include/linux/init.h</filename></title> 725 726 <para> 727 After boot, the kernel frees up a special section; functions 728 marked with <type>__init</type> and data structures marked with 729 <type>__initdata</type> are dropped after boot is complete (within 730 modules this directive is currently ignored). <type>__exit</type> 731 is used to declare a function which is only required on exit: the 732 function will be dropped if this file is not compiled as a module. 733 See the header file for use. Note that it makes no sense for a function 734 marked with <type>__init</type> to be exported to modules with 735 <function>EXPORT_SYMBOL()</function> - this will break. 736 </para> 737 <para> 738 Static data structures marked as <type>__initdata</type> must be initialised 739 (as opposed to ordinary static data which is zeroed BSS) and cannot be 740 <type>const</type>. 741 </para> 742 743 </sect1> 744 745 <sect1 id="routines-init-again"> 746 <title><function>__initcall()</function>/<function>module_init()</function> 747 <filename class=headerfile>include/linux/init.h</filename></title> 748 <para> 749 Many parts of the kernel are well served as a module 750 (dynamically-loadable parts of the kernel). Using the 751 <function>module_init()</function> and 752 <function>module_exit()</function> macros it is easy to write code 753 without #ifdefs which can operate both as a module or built into 754 the kernel. 755 </para> 756 757 <para> 758 The <function>module_init()</function> macro defines which 759 function is to be called at module insertion time (if the file is 760 compiled as a module), or at boot time: if the file is not 761 compiled as a module the <function>module_init()</function> macro 762 becomes equivalent to <function>__initcall()</function>, which 763 through linker magic ensures that the function is called on boot. 764 </para> 765 766 <para> 767 The function can return a negative error number to cause 768 module loading to fail (unfortunately, this has no effect if 769 the module is compiled into the kernel). For modules, this is 770 called in user context, with interrupts enabled, and the 771 kernel lock held, so it can sleep. 772 </para> 773 </sect1> 774 775 <sect1 id="routines-moduleexit"> 776 <title> <function>module_exit()</function> 777 <filename class=headerfile>include/linux/init.h</filename> </title> 778 779 <para> 780 This macro defines the function to be called at module removal 781 time (or never, in the case of the file compiled into the 782 kernel). It will only be called if the module usage count has 783 reached zero. This function can also sleep, but cannot fail: 784 everything must be cleaned up by the time it returns. 785 </para> 786 </sect1> 787 788 <sect1 id="routines-module-use-counters"> 789 <title> <function>MOD_INC_USE_COUNT</function>/<function>MOD_DEC_USE_COUNT</function> 790 <filename class=headerfile>include/linux/module.h</filename></title> 791 792 <para> 793 These manipulate the module usage count, to protect against 794 removal (a module also can't be removed if another module uses 795 one of its exported symbols: see below). Every reference to 796 the module from user context should be reflected by this 797 counter (e.g. for every data structure or socket) before the 798 function sleeps. To quote Tim Waugh: 799 </para> 800 801 <programlisting> 802/* THIS IS BAD */ 803foo_open (...) 804{ 805 stuff.. 806 if (fail) 807 return -EBUSY; 808 sleep.. (might get unloaded here) 809 stuff.. 810 MOD_INC_USE_COUNT; 811 return 0; 812} 813 814 if (idx >= __BR_END) 815 __br_lock_usage_bug(); 816 817 read_lock(&__brlock_array[smp_processor_id()][idx]); 818} 819 </programlisting> 820 821 <para> 822 <filename>include/linux/fs.h</filename>: 823 </para> 824 <programlisting> 825/* 826 * Kernel pointers have redundant information, so we can use a 827 * scheme where we can return either an error code or a dentry 828 * pointer with the same return value. 829 * 830 * This should be a per-architecture thing, to allow different 831 * error and pointer decisions. 832 */ 833 #define ERR_PTR(err) ((void *)((long)(err))) 834 #define PTR_ERR(ptr) ((long)(ptr)) 835 #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) 836</programlisting> 837 838 <para> 839 <filename>include/asm-i386/uaccess.h:</filename> 840 </para> 841 842 <programlisting> 843#define copy_to_user(to,from,n) \ 844 (__builtin_constant_p(n) ? \ 845 __constant_copy_to_user((to),(from),(n)) : \ 846 __generic_copy_to_user((to),(from),(n))) 847 </programlisting> 848 849 <para> 850 <filename>arch/sparc/kernel/head.S:</filename> 851 </para> 852 853 <programlisting> 854/* 855 * Sun people can't spell worth damn. "compatability" indeed. 856 * At least we *know* we can't spell, and use a spell-checker. 857 */ 858 859/* Uh, actually Linus it is I who cannot spell. Too much murky 860 * Sparc assembly will do this to ya. 861 */ 862C_LABEL(cputypvar): 863 .asciz "compatability" 864 865/* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ 866 .align 4 867C_LABEL(cputypvar_sun4m): 868 .asciz "compatible" 869 </programlisting> 870 871 <para> 872 <filename>arch/sparc/lib/checksum.S:</filename> 873 </para> 874 875 <programlisting> 876 /* Sun, you just can't beat me, you just can't. Stop trying, 877 * give up. I'm serious, I am going to kick the living shit 878 * out of you, game over, lights out. 879 */ 880 </programlisting> 881 </chapter> 882 883 <chapter id="credits"> 884 <title>Thanks</title> 885 886 <para> 887 Thanks to Andi Kleen for the idea, answering my questions, fixing 888 my mistakes, filling content, etc. Philipp Rumpf for more spelling 889 and clarity fixes, and some excellent non-obvious points. Werner 890 Almesberger for giving me a great summary of 891 <function>disable_irq()</function>, and Jes Sorensen and Andrea 892 Arcangeli added caveats. Michael Elizabeth Chastain for checking 893 and adding to the Configure section. <!-- Rusty insisted on this 894 bit; I didn't do it! --> Telsa Gwynne for teaching me DocBook. 895 </para> 896 </chapter> 897</book> 898 899