1 Goals, Design and Implementation of the 2 new ultra-scalable O(1) scheduler 3 4 5 This is an edited version of an email Ingo Molnar sent to 6 lkml on 4 Jan 2002. It describes the goals, design, and 7 implementation of Ingo's new ultra-scalable O(1) scheduler. 8 Last Updated: 18 April 2002. 9 10 11Goal 12==== 13 14The main goal of the new scheduler is to keep all the good things we know 15and love about the current Linux scheduler: 16 17 - good interactive performance even during high load: if the user 18 types or clicks then the system must react instantly and must execute 19 the user tasks smoothly, even during considerable background load. 20 21 - good scheduling/wakeup performance with 1-2 runnable processes. 22 23 - fairness: no process should stay without any timeslice for any 24 unreasonable amount of time. No process should get an unjustly high 25 amount of CPU time. 26 27 - priorities: less important tasks can be started with lower priority, 28 more important tasks with higher priority. 29 30 - SMP efficiency: no CPU should stay idle if there is work to do. 31 32 - SMP affinity: processes which run on one CPU should stay affine to 33 that CPU. Processes should not bounce between CPUs too frequently. 34 35 - plus additional scheduler features: RT scheduling, CPU binding. 36 37and the goal is also to add a few new things: 38 39 - fully O(1) scheduling. Are you tired of the recalculation loop 40 blowing the L1 cache away every now and then? Do you think the goodness 41 loop is taking a bit too long to finish if there are lots of runnable 42 processes? This new scheduler takes no prisoners: wakeup(), schedule(), 43 the timer interrupt are all O(1) algorithms. There is no recalculation 44 loop. There is no goodness loop either. 45 46 - 'perfect' SMP scalability. With the new scheduler there is no 'big' 47 runqueue_lock anymore - it's all per-CPU runqueues and locks - two 48 tasks on two separate CPUs can wake up, schedule and context-switch 49 completely in parallel, without any interlocking. All 50 scheduling-relevant data is structured for maximum scalability. 51 52 - better SMP affinity. The old scheduler has a particular weakness that 53 causes the random bouncing of tasks between CPUs if/when higher 54 priority/interactive tasks, this was observed and reported by many 55 people. The reason is that the timeslice recalculation loop first needs 56 every currently running task to consume its timeslice. But when this 57 happens on eg. an 8-way system, then this property starves an 58 increasing number of CPUs from executing any process. Once the last 59 task that has a timeslice left has finished using up that timeslice, 60 the recalculation loop is triggered and other CPUs can start executing 61 tasks again - after having idled around for a number of timer ticks. 62 The more CPUs, the worse this effect. 63 64 Furthermore, this same effect causes the bouncing effect as well: 65 whenever there is such a 'timeslice squeeze' of the global runqueue, 66 idle processors start executing tasks which are not affine to that CPU. 67 (because the affine tasks have finished off their timeslices already.) 68 69 The new scheduler solves this problem by distributing timeslices on a 70 per-CPU basis, without having any global synchronization or 71 recalculation. 72 73 - batch scheduling. A significant proportion of computing-intensive tasks 74 benefit from batch-scheduling, where timeslices are long and processes 75 are roundrobin scheduled. The new scheduler does such batch-scheduling 76 of the lowest priority tasks - so nice +19 jobs will get 77 'batch-scheduled' automatically. With this scheduler, nice +19 jobs are 78 in essence SCHED_IDLE, from an interactiveness point of view. 79 80 - handle extreme loads more smoothly, without breakdown and scheduling 81 storms. 82 83 - O(1) RT scheduling. For those RT folks who are paranoid about the 84 O(nr_running) property of the goodness loop and the recalculation loop. 85 86 - run fork()ed children before the parent. Andrea has pointed out the 87 advantages of this a few months ago, but patches for this feature 88 do not work with the old scheduler as well as they should, 89 because idle processes often steal the new child before the fork()ing 90 CPU gets to execute it. 91 92 93Design 94====== 95 96The core of the new scheduler contains the following mechanisms: 97 98 - *two* priority-ordered 'priority arrays' per CPU. There is an 'active' 99 array and an 'expired' array. The active array contains all tasks that 100 are affine to this CPU and have timeslices left. The expired array 101 contains all tasks which have used up their timeslices - but this array 102 is kept sorted as well. The active and expired array is not accessed 103 directly, it's accessed through two pointers in the per-CPU runqueue 104 structure. If all active tasks are used up then we 'switch' the two 105 pointers and from now on the ready-to-go (former-) expired array is the 106 active array - and the empty active array serves as the new collector 107 for expired tasks. 108 109 - there is a 64-bit bitmap cache for array indices. Finding the highest 110 priority task is thus a matter of two x86 BSFL bit-search instructions. 111 112the split-array solution enables us to have an arbitrary number of active 113and expired tasks, and the recalculation of timeslices can be done 114immediately when the timeslice expires. Because the arrays are always 115access through the pointers in the runqueue, switching the two arrays can 116be done very quickly. 117 118this is a hybride priority-list approach coupled with roundrobin 119scheduling and the array-switch method of distributing timeslices. 120 121 - there is a per-task 'load estimator'. 122 123one of the toughest things to get right is good interactive feel during 124heavy system load. While playing with various scheduler variants i found 125that the best interactive feel is achieved not by 'boosting' interactive 126tasks, but by 'punishing' tasks that want to use more CPU time than there 127is available. This method is also much easier to do in an O(1) fashion. 128 129to establish the actual 'load' the task contributes to the system, a 130complex-looking but pretty accurate method is used: there is a 4-entry 131'history' ringbuffer of the task's activities during the last 4 seconds. 132This ringbuffer is operated without much overhead. The entries tell the 133scheduler a pretty accurate load-history of the task: has it used up more 134CPU time or less during the past N seconds. [the size '4' and the interval 135of 4x 1 seconds was found by lots of experimentation - this part is 136flexible and can be changed in both directions.] 137 138the penalty a task gets for generating more load than the CPU can handle 139is a priority decrease - there is a maximum amount to this penalty 140relative to their static priority, so even fully CPU-bound tasks will 141observe each other's priorities, and will share the CPU accordingly. 142 143the SMP load-balancer can be extended/switched with additional parallel 144computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs 145can be supported easily by changing the load-balancer. Right now it's 146tuned for my SMP systems. 147 148i skipped the prev->mm == next->mm advantage - no workload i know of shows 149any sensitivity to this. It can be added back by sacrificing O(1) 150schedule() [the current and one-lower priority list can be searched for a 151that->mm == current->mm condition], but costs a fair number of cycles 152during a number of important workloads, so i wanted to avoid this as much 153as possible. 154 155- the SMP idle-task startup code was still racy and the new scheduler 156triggered this. So i streamlined the idle-setup code a bit. We do not call 157into schedule() before all processors have started up fully and all idle 158threads are in place. 159 160- the patch also cleans up a number of aspects of sched.c - moves code 161into other areas of the kernel where it's appropriate, and simplifies 162certain code paths and data constructs. As a result, the new scheduler's 163code is smaller than the old one. 164 165 Ingo 166