1HISTORY: 2February 16/2002 -- revision 0.2.1: 3COR typo corrected 4February 10/2002 -- revision 0.2: 5some spell checking ;-> 6January 12/2002 -- revision 0.1 7This is still work in progress so may change. 8To keep up to date please watch this space. 9 10Introduction to NAPI 11==================== 12 13NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique 14to improve network performance on Linux. For more details please 15read that paper. 16NAPI provides a "inherent mitigation" which is bound by system capacity 17as can be seen from the following data collected by Robert on Gigabit 18ethernet (e1000): 19 20 Psize Ipps Tput Rxint Txint Done Ndone 21 --------------------------------------------------------------- 22 60 890000 409362 17 27622 7 6823 23 128 758150 464364 21 9301 10 7738 24 256 445632 774646 42 15507 21 12906 25 512 232666 994445 241292 19147 241192 1062 26 1024 119061 1000003 872519 19258 872511 0 27 1440 85193 1000003 946576 19505 946569 0 28 29 30Legend: 31"Ipps" stands for input packets per second. 32"Tput" == packets out of total 1M that made it out. 33"txint" == transmit completion interrupts seen 34"Done" == The number of times that the poll() managed to pull all 35packets out of the rx ring. Note from this that the lower the 36load the more we could clean up the rxring 37"Ndone" == is the converse of "Done". Note again, that the higher 38the load the more times we couldn't clean up the rxring. 39 40Observe that: 41when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. 42The system cant handle the processing at 1 interrupt/packet at that load level. 43At lower rates on the other hand, rx interrupts go up and therefore the 44interrupt/packet ratio goes up (as observable from that table). So there is 45possibility that under low enough input, you get one poll call for each 46input packet caused by a single interrupt each time. And if the system 47cant handle interrupt per packet ratio of 1, then it will just have to 48chug along .... 49 50 510) Prerequisites: 52================== 53A driver MAY continue using the old 2.4 technique for interfacing 54to the network stack and not benefit from the NAPI changes. 55NAPI additions to the kernel do not break backward compatibility. 56NAPI, however, requires the following features to be available: 57 58A) DMA ring or enough RAM to store packets in software devices. 59 60B) Ability to turn off interrupts or maybe events that send packets up 61the stack. 62 63NAPI processes packet events in what is known as dev->poll() method. 64Typically, only packet receive events are processed in dev->poll(). 65The rest of the events MAY be processed by the regular interrupt handler 66to reduce processing latency (justified also because there are not that 67many of them). 68Note, however, NAPI does not enforce that dev->poll() only processes 69receive events. 70Tests with the tulip driver indicated slightly increased latency if 71all of the interrupt handler is moved to dev->poll(). Also MII handling 72gets a little trickier. 73The example used in this document is to move the receive processing only 74to dev->poll(); this is shown with the patch for the tulip driver. 75For an example of code that moves all the interrupt driver to 76dev->poll() look at the ported e1000 code. 77 78There are caveats that might force you to go with moving everything to 79dev->poll(). Different NICs work differently depending on their status/event 80acknowledgement setup. 81There are two types of event register ACK mechanisms. 82 I) what is known as Clear-on-read (COR). 83 when you read the status/event register, it clears everything! 84 The natsemi and sunbmac NICs are known to do this. 85 In this case your only choice is to move all to dev->poll() 86 87 II) Clear-on-write (COW) 88 i) you clear the status by writing a 1 in the bit-location you want. 89 These are the majority of the NICs and work the best with NAPI. 90 Put only receive events in dev->poll(); leave the rest in 91 the old interrupt handler. 92 ii) whatever you write in the status register clears every thing ;-> 93 Cant seem to find any supported by Linux which do this. If 94 someone knows such a chip email us please. 95 Move all to dev->poll() 96 97C) Ability to detect new work correctly. 98NAPI works by shutting down event interrupts when there's work and 99turning them on when there's none. 100New packets might show up in the small window while interrupts were being 101re-enabled (refer to appendix 2). A packet might sneak in during the period 102we are enabling interrupts. We only get to know about such a packet when the 103next new packet arrives and generates an interrupt. 104Essentially, there is a small window of opportunity for a race condition 105which for clarity we'll refer to as the "rotting packet". 106 107This is a very important topic and appendix 2 is dedicated for more 108discussion. 109 110Locking rules and environmental guarantees 111========================================== 112 113-Guarantee: Only one CPU at any time can call dev->poll(); this is because 114only one CPU can pick the initial interrupt and hence the initial 115netif_rx_schedule(dev); 116- The core layer invokes devices to send packets in a round robin format. 117This implies receive is totally lockless because of the guarantee that only 118one CPU is executing it. 119- contention can only be the result of some other CPU accessing the rx 120ring. This happens only in close() and suspend() (when these methods 121try to clean the rx ring); 122****guarantee: driver authors need not worry about this; synchronization 123is taken care for them by the top net layer. 124-local interrupts are enabled (if you dont move all to dev->poll()). For 125example link/MII and txcomplete continue functioning just same old way. 126This improves the latency of processing these events. It is also assumed that 127the receive interrupt is the largest cause of noise. Note this might not 128always be true. 129[according to Manfred Spraul, the winbond insists on sending one 130txmitcomplete interrupt for each packet (although this can be mitigated)]. 131For these broken drivers, move all to dev->poll(). 132 133For the rest of this text, we'll assume that dev->poll() only 134processes receive events. 135 136new methods introduce by NAPI 137============================= 138 139a) netif_rx_schedule(dev) 140Called by an IRQ handler to schedule a poll for device 141 142b) netif_rx_schedule_prep(dev) 143puts the device in a state which allows for it to be added to the 144CPU polling list if it is up and running. You can look at this as 145the first half of netif_rx_schedule(dev) above; the second half 146being c) below. 147 148c) __netif_rx_schedule(dev) 149Add device to the poll list for this CPU; assuming that _prep above 150has already been called and returned 1. 151 152d) netif_rx_reschedule(dev, undo) 153Called to reschedule polling for device specifically for some 154deficient hardware. Read Appendix 2 for more details. 155 156e) netif_rx_complete(dev) 157 158Remove interface from the CPU poll list: it must be in the poll list 159on current cpu. This primitive is called by dev->poll(), when 160it completes its work. The device cannot be out of poll list at this 161call, if it is then clearly it is a BUG(). You'll know ;-> 162 163All of the above methods are used below, so keep reading for clarity. 164 165Device driver changes to be made when porting NAPI 166================================================== 167 168Below we describe what kind of changes are required for NAPI to work. 169 1701) introduction of dev->poll() method 171===================================== 172 173This is the method that is invoked by the network core when it requests 174for new packets from the driver. A driver is allowed to send upto 175dev->quota packets by the current CPU before yielding to the network 176subsystem (so other devices can also get opportunity to send to the stack). 177 178dev->poll() prototype looks as follows: 179int my_poll(struct net_device *dev, int *budget) 180 181budget is the remaining number of packets the network subsystem on the 182current CPU can send up the stack before yielding to other system tasks. 183*Each driver is responsible for decrementing budget by the total number of 184packets sent. 185 Total number of packets cannot exceed dev->quota. 186 187dev->poll() method is invoked by the top layer, the driver just sends if it 188can to the stack the packet quantity requested. 189 190more on dev->poll() below after the interrupt changes are explained. 191 1922) registering dev->poll() method 193=================================== 194 195dev->poll should be set in the dev->probe() method. 196e.g: 197dev->open = my_open; 198. 199. 200/* two new additions */ 201/* first register my poll method */ 202dev->poll = my_poll; 203/* next register my weight/quanta; can be overridden in /proc */ 204dev->weight = 16; 205. 206. 207dev->stop = my_close; 208 209 210 2113) scheduling dev->poll() 212============================= 213This involves modifying the interrupt handler and the code 214path which takes the packet off the NIC and sends them to the 215stack. 216 217it's important at this point to introduce the classical D Becker 218interrupt processor: 219 220------------------ 221static irqreturn_t 222netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) 223{ 224 225 struct net_device *dev = (struct net_device *)dev_instance; 226 struct my_private *tp = (struct my_private *)dev->priv; 227 228 int work_count = my_work_count; 229 status = read_interrupt_status_reg(); 230 if (status == 0) 231 return IRQ_NONE; /* Shared IRQ: not us */ 232 if (status == 0xffff) 233 return IRQ_HANDLED; /* Hot unplug */ 234 if (status & error) 235 do_some_error_handling() 236 237 do { 238 acknowledge_ints_ASAP(); 239 240 if (status & link_interrupt) { 241 spin_lock(&tp->link_lock); 242 do_some_link_stat_stuff(); 243 spin_lock(&tp->link_lock); 244 } 245 246 if (status & rx_interrupt) { 247 receive_packets(dev); 248 } 249 250 if (status & rx_nobufs) { 251 make_rx_buffs_avail(); 252 } 253 254 if (status & tx_related) { 255 spin_lock(&tp->lock); 256 tx_ring_free(dev); 257 if (tx_died) 258 restart_tx(); 259 spin_unlock(&tp->lock); 260 } 261 262 status = read_interrupt_status_reg(); 263 264 } while (!(status & error) || more_work_to_be_done); 265 return IRQ_HANDLED; 266} 267 268---------------------------------------------------------------------- 269 270We now change this to what is shown below to NAPI-enable it: 271 272---------------------------------------------------------------------- 273static irqreturn_t 274netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) 275{ 276 struct net_device *dev = (struct net_device *)dev_instance; 277 struct my_private *tp = (struct my_private *)dev->priv; 278 279 status = read_interrupt_status_reg(); 280 if (status == 0) 281 return IRQ_NONE; /* Shared IRQ: not us */ 282 if (status == 0xffff) 283 return IRQ_HANDLED; /* Hot unplug */ 284 if (status & error) 285 do_some_error_handling(); 286 287 do { 288/************************ start note *********************************/ 289 acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here 290/************************ end note *********************************/ 291 292 if (status & link_interrupt) { 293 spin_lock(&tp->link_lock); 294 do_some_link_stat_stuff(); 295 spin_unlock(&tp->link_lock); 296 } 297/************************ start note *********************************/ 298 if (status & rx_interrupt || (status & rx_nobuffs)) { 299 if (netif_rx_schedule_prep(dev)) { 300 301 /* disable interrupts caused 302 * by arriving packets */ 303 disable_rx_and_rxnobuff_ints(); 304 /* tell system we have work to be done. */ 305 __netif_rx_schedule(dev); 306 } else { 307 printk("driver bug! interrupt while in poll\n"); 308 /* FIX by disabling interrupts */ 309 disable_rx_and_rxnobuff_ints(); 310 } 311 } 312/************************ end note note *********************************/ 313 314 if (status & tx_related) { 315 spin_lock(&tp->lock); 316 tx_ring_free(dev); 317 318 if (tx_died) 319 restart_tx(); 320 spin_unlock(&tp->lock); 321 } 322 323 status = read_interrupt_status_reg(); 324 325/************************ start note *********************************/ 326 } while (!(status & error) || more_work_to_be_done(status)); 327/************************ end note note *********************************/ 328 return IRQ_HANDLED; 329} 330 331--------------------------------------------------------------------- 332 333 334We note several things from above: 335 336I) Any interrupt source which is caused by arriving packets is now 337turned off when it occurs. Depending on the hardware, there could be 338several reasons that arriving packets would cause interrupts; these are the 339interrupt sources we wish to avoid. The two common ones are a) a packet 340arriving (rxint) b) a packet arriving and finding no DMA buffers available 341(rxnobuff) . 342This means also acknowledge_ints_ASAP() will not clear the status 343register for those two items above; clearing is done in the place where 344proper work is done within NAPI; at the poll() and refill_rx_ring() 345discussed further below. 346netif_rx_schedule_prep() returns 1 if device is in running state and 347gets successfully added to the core poll list. If we get a zero value 348we can _almost_ assume are already added to the list (instead of not running. 349Logic based on the fact that you shouldn't get interrupt if not running) 350We rectify this by disabling rx and rxnobuf interrupts. 351 352II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. 353These functionalities are still around actually...... 354 355infact, receive_packets(dev) is very close to my_poll() and 356make_rx_buffs_avail() is invoked from my_poll() 357 3584) converting receive_packets() to dev->poll() 359=============================================== 360 361We need to convert the classical D Becker receive_packets(dev) to my_poll() 362 363First the typical receive_packets() below: 364------------------------------------------------------------------- 365 366/* this is called by interrupt handler */ 367static void receive_packets (struct net_device *dev) 368{ 369 370 struct my_private *tp = (struct my_private *)dev->priv; 371 rx_ring = tp->rx_ring; 372 cur_rx = tp->cur_rx; 373 int entry = cur_rx % RX_RING_SIZE; 374 int received = 0; 375 int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; 376 377 while (rx_ring_not_empty) { 378 u32 rx_status; 379 unsigned int rx_size; 380 unsigned int pkt_size; 381 struct sk_buff *skb; 382 /* read size+status of next frame from DMA ring buffer */ 383 /* the number 16 and 4 are just examples */ 384 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); 385 rx_size = rx_status >> 16; 386 pkt_size = rx_size - 4; 387 388 /* process errors */ 389 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || 390 (!(rx_status & RxStatusOK))) { 391 netdrv_rx_err (rx_status, dev, tp, ioaddr); 392 return; 393 } 394 395 if (--rx_work_limit < 0) 396 break; 397 398 /* grab a skb */ 399 skb = dev_alloc_skb (pkt_size + 2); 400 if (skb) { 401 . 402 . 403 netif_rx (skb); 404 . 405 . 406 } else { /* OOM */ 407 /*seems very driver specific ... some just pass 408 whatever is on the ring already. */ 409 } 410 411 /* move to the next skb on the ring */ 412 entry = (++tp->cur_rx) % RX_RING_SIZE; 413 received++ ; 414 415 } 416 417 /* store current ring pointer state */ 418 tp->cur_rx = cur_rx; 419 420 /* Refill the Rx ring buffers if they are needed */ 421 refill_rx_ring(); 422 . 423 . 424 425} 426------------------------------------------------------------------- 427We change it to a new one below; note the additional parameter in 428the call. 429 430------------------------------------------------------------------- 431 432/* this is called by the network core */ 433static int my_poll (struct net_device *dev, int *budget) 434{ 435 436 struct my_private *tp = (struct my_private *)dev->priv; 437 rx_ring = tp->rx_ring; 438 cur_rx = tp->cur_rx; 439 int entry = cur_rx % RX_BUF_LEN; 440 /* maximum packets to send to the stack */ 441/************************ note note *********************************/ 442 int rx_work_limit = dev->quota; 443 444/************************ end note note *********************************/ 445 do { // outer beginning loop starts here 446 447 clear_rx_status_register_bit(); 448 449 while (rx_ring_not_empty) { 450 u32 rx_status; 451 unsigned int rx_size; 452 unsigned int pkt_size; 453 struct sk_buff *skb; 454 /* read size+status of next frame from DMA ring buffer */ 455 /* the number 16 and 4 are just examples */ 456 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); 457 rx_size = rx_status >> 16; 458 pkt_size = rx_size - 4; 459 460 /* process errors */ 461 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || 462 (!(rx_status & RxStatusOK))) { 463 netdrv_rx_err (rx_status, dev, tp, ioaddr); 464 return 1; 465 } 466 467/************************ note note *********************************/ 468 if (--rx_work_limit < 0) { /* we got packets, but no quota */ 469 /* store current ring pointer state */ 470 tp->cur_rx = cur_rx; 471 472 /* Refill the Rx ring buffers if they are needed */ 473 refill_rx_ring(dev); 474 goto not_done; 475 } 476/********************** end note **********************************/ 477 478 /* grab a skb */ 479 skb = dev_alloc_skb (pkt_size + 2); 480 if (skb) { 481 . 482 . 483/************************ note note *********************************/ 484 netif_receive_skb (skb); 485/********************** end note **********************************/ 486 . 487 . 488 } else { /* OOM */ 489 /*seems very driver specific ... common is just pass 490 whatever is on the ring already. */ 491 } 492 493 /* move to the next skb on the ring */ 494 entry = (++tp->cur_rx) % RX_RING_SIZE; 495 received++ ; 496 497 } 498 499 /* store current ring pointer state */ 500 tp->cur_rx = cur_rx; 501 502 /* Refill the Rx ring buffers if they are needed */ 503 refill_rx_ring(dev); 504 505 /* no packets on ring; but new ones can arrive since we last 506 checked */ 507 status = read_interrupt_status_reg(); 508 if (rx status is not set) { 509 /* If something arrives in this narrow window, 510 an interrupt will be generated */ 511 goto done; 512 } 513 /* done! at least that's what it looks like ;-> 514 if new packets came in after our last check on status bits 515 they'll be caught by the while check and we go back and clear them 516 since we havent exceeded our quota */ 517 } while (rx_status_is_set); 518 519done: 520 521/************************ note note *********************************/ 522 dev->quota -= received; 523 *budget -= received; 524 525 /* If RX ring is not full we are out of memory. */ 526 if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) 527 goto oom; 528 529 /* we are happy/done, no more packets on ring; put us back 530 to where we can start processing interrupts again */ 531 netif_rx_complete(dev); 532 enable_rx_and_rxnobuf_ints(); 533 534 /* The last op happens after poll completion. Which means the following: 535 * 1. it can race with disabling irqs in irq handler (which are done to 536 * schedule polls) 537 * 2. it can race with dis/enabling irqs in other poll threads 538 * 3. if an irq raised after the beginning of the outer beginning 539 * loop (marked in the code above), it will be immediately 540 * triggered here. 541 * 542 * Summarizing: the logic may result in some redundant irqs both 543 * due to races in masking and due to too late acking of already 544 * processed irqs. The good news: no events are ever lost. 545 */ 546 547 return 0; /* done */ 548 549not_done: 550 if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || 551 tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) 552 refill_rx_ring(dev); 553 554 if (!received) { 555 printk("received==0\n"); 556 received = 1; 557 } 558 dev->quota -= received; 559 *budget -= received; 560 return 1; /* not_done */ 561 562oom: 563 /* Start timer, stop polling, but do not enable rx interrupts. */ 564 start_poll_timer(dev); 565 return 0; /* we'll take it from here so tell core "done"*/ 566 567/************************ End note note *********************************/ 568} 569------------------------------------------------------------------- 570 571From above we note that: 5720) rx_work_limit = dev->quota 5731) refill_rx_ring() is in charge of clearing the bit for rxnobuff when 574it does the work. 5752) We have a done and not_done state. 5763) instead of netif_rx() we call netif_receive_skb() to pass the skb. 5774) we have a new way of handling oom condition 5785) A new outer for (;;) loop has been added. This serves the purpose of 579ensuring that if a new packet has come in, after we are all set and done, 580and we have not exceeded our quota that we continue sending packets up. 581 582 583----------------------------------------------------------- 584Poll timer code will need to do the following: 585 586a) 587 588 if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || 589 tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) 590 refill_rx_ring(dev); 591 592 /* If RX ring is not full we are still out of memory. 593 Restart the timer again. Else we re-add ourselves 594 to the master poll list. 595 */ 596 597 if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) 598 restart_timer(); 599 600 else netif_rx_schedule(dev); /* we are back on the poll list */ 601 6025) dev->close() and dev->suspend() issues 603========================================== 604The driver writer needn't worry about this; the top net layer takes 605care of it. 606 6076) Adding new Stats to /proc 608============================= 609In order to debug some of the new features, we introduce new stats 610that need to be collected. 611TODO: Fill this later. 612 613APPENDIX 1: discussion on using ethernet HW FC 614============================================== 615Most chips with FC only send a pause packet when they run out of Rx buffers. 616Since packets are pulled off the DMA ring by a softirq in NAPI, 617if the system is slow in grabbing them and we have a high input 618rate (faster than the system's capacity to remove packets), then theoretically 619there will only be one rx interrupt for all packets during a given packetstorm. 620Under low load, we might have a single interrupt per packet. 621FC should be programmed to apply in the case when the system cant pull out 622packets fast enough i.e send a pause only when you run out of rx buffers. 623Note FC in itself is a good solution but we have found it to not be 624much of a commodity feature (both in NICs and switches) and hence falls 625under the same category as using NIC based mitigation. Also, experiments 626indicate that it's much harder to resolve the resource allocation 627issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness 628proved harder. In any case, FC works even better with NAPI but is not 629necessary. 630 631 632APPENDIX 2: the "rotting packet" race-window avoidance scheme 633============================================================= 634 635There are two types of associations seen here 636 6371) status/int which honors level triggered IRQ 638 639If a status bit for receive or rxnobuff is set and the corresponding 640interrupt-enable bit is not on, then no interrupts will be generated. However, 641as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is 642generated. [assuming the status bit was not turned off]. 643Generally the concept of level triggered IRQs in association with a status and 644interrupt-enable CSR register set is used to avoid the race. 645 646If we take the example of the tulip: 647"pending work" is indicated by the status bit(CSR5 in tulip). 648the corresponding interrupt bit (CSR7 in tulip) might be turned off (but 649the CSR5 will continue to be turned on with new packet arrivals even if 650we clear it the first time) 651Very important is the fact that if we turn on the interrupt bit on when 652status is set that an immediate irq is triggered. 653 654If we cleared the rx ring and proclaimed there was "no more work 655to be done" and then went on to do a few other things; then when we enable 656interrupts, there is a possibility that a new packet might sneak in during 657this phase. It helps to look at the pseudo code for the tulip poll 658routine: 659 660-------------------------- 661 do { 662 ACK; 663 while (ring_is_not_empty()) { 664 work-work-work 665 if quota is exceeded: exit, no touching irq status/mask 666 } 667 /* No packets, but new can arrive while we are doing this*/ 668 CSR5 := read 669 if (CSR5 is not set) { 670 /* If something arrives in this narrow window here, 671 * where the comments are ;-> irq will be generated */ 672 unmask irqs; 673 exit poll; 674 } 675 } while (rx_status_is_set); 676------------------------ 677 678CSR5 bit of interest is only the rx status. 679If you look at the last if statement: 680you just finished grabbing all the packets from the rx ring .. you check if 681status bit says there are more packets just in ... it says none; you then 682enable rx interrupts again; if a new packet just came in during this check, 683we are counting that CSR5 will be set in that small window of opportunity 684and that by re-enabling interrupts, we would actually trigger an interrupt 685to register the new packet for processing. 686 687[The above description nay be very verbose, if you have better wording 688that will make this more understandable, please suggest it.] 689 6902) non-capable hardware 691 692These do not generally respect level triggered IRQs. Normally, 693irqs may be lost while being masked and the only way to leave poll is to do 694a double check for new input after netif_rx_complete() is invoked 695and re-enable polling (after seeing this new input). 696 697Sample code: 698 699--------- 700 . 701 . 702restart_poll: 703 while (ring_is_not_empty()) { 704 work-work-work 705 if quota is exceeded: exit, not touching irq status/mask 706 } 707 . 708 . 709 . 710 enable_rx_interrupts() 711 netif_rx_complete(dev); 712 if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { 713 disable_rx_and_rxnobufs() 714 goto restart_poll 715 } while (rx_status_is_set); 716--------- 717 718Basically netif_rx_complete() removes us from the poll list, but because a 719new packet which will never be caught due to the possibility of a race 720might come in, we attempt to re-add ourselves to the poll list. 721 722 723 724 725APPENDIX 3: Scheduling issues. 726============================== 727As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the 728general solution to schedule softirq's to run before next interrupt and by putting 729them under scheduler control. Also this prevents consecutive softirq's from 730monopolize the CPU. This also have the effect that the priority of ksoftirq needs 731to be considered when running very CPU-intensive applications and networking to 732get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 733(eventually more) is reported cure problems with low network performance at high 734CPU load. 735 736Most used processes in a GIGE router: 737USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND 738root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) 739root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated 740 741-------------------------------------------------------------------- 742 743relevant sites: 744================== 745ftp://robur.slu.se/pub/Linux/net-development/NAPI/ 746 747 748-------------------------------------------------------------------- 749TODO: Write net-skeleton.c driver. 750------------------------------------------------------------- 751 752Authors: 753======== 754Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> 755Jamal Hadi Salim <hadi@cyberus.ca> 756Robert Olsson <Robert.Olsson@data.slu.se> 757 758Acknowledgements: 759================ 760People who made this document better: 761 762Lennert Buytenhek <buytenh@gnu.org> 763Andrew Morton <akpm@zip.com.au> 764Manfred Spraul <manfred@colorfullife.com> 765Donald Becker <becker@scyld.com> 766Jeff Garzik <jgarzik@pobox.com> 767