1Anticipatory IO scheduler 2------------------------- 3Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003 4 5Attention! Database servers, especially those using "TCQ" disks should 6investigate performance with the 'deadline' IO scheduler. Any system with high 7disk performance requirements should do so, in fact. 8 9If you see unusual performance characteristics of your disk systems, or you 10see big performance regressions versus the deadline scheduler, please email 11me. Database users don't bother unless you're willing to test a lot of patches 12from me ;) its a known issue. 13 14Also, users with hardware RAID controllers, doing striping, may find 15highly variable performance results with using the as-iosched. The 16as-iosched anticipatory implementation is based on the notion that a disk 17device has only one physical seeking head. A striped RAID controller 18actually has a head for each physical device in the logical RAID device. 19 20However, setting the antic_expire (see tunable parameters below) produces 21very similar behavior to the deadline IO scheduler. 22 23 24Selecting IO schedulers 25----------------------- 26To choose IO schedulers at boot time, use the argument 'elevator=deadline'. 27'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are 28assigned globally at boot time only presently. It's also possible to change 29the IO scheduler for a determined device on the fly, as described in 30Documentation/block/switching-sched.txt. 31 32 33Anticipatory IO scheduler Policies 34---------------------------------- 35The as-iosched implementation implements several layers of policies 36to determine when an IO request is dispatched to the disk controller. 37Here are the policies outlined, in order of application. 38 391. one-way Elevator algorithm. 40 41The elevator algorithm is similar to that used in deadline scheduler, with 42the addition that it allows limited backward movement of the elevator 43(i.e. seeks backwards). A seek backwards can occur when choosing between 44two IO requests where one is behind the elevator's current position, and 45the other is in front of the elevator's position. If the seek distance to 46the request in back of the elevator is less than half the seek distance to 47the request in front of the elevator, then the request in back can be chosen. 48Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. 49This favors forward movement of the elevator, while allowing opportunistic 50"short" backward seeks. 51 522. FIFO expiration times for reads and for writes. 53 54This is again very similar to the deadline IO scheduler. The expiration 55times for requests on these lists is tunable using the parameters read_expire 56and write_expire discussed below. When a read or a write expires in this way, 57the IO scheduler will interrupt its current elevator sweep or read anticipation 58to service the expired request. 59 603. Read and write request batching 61 62A batch is a collection of read requests or a collection of write 63requests. The as scheduler alternates dispatching read and write batches 64to the driver. In the case a read batch, the scheduler submits read 65requests to the driver as long as there are read requests to submit, and 66the read batch time limit has not been exceeded (read_batch_expire). 67The read batch time limit begins counting down only when there are 68competing write requests pending. 69 70In the case of a write batch, the scheduler submits write requests to 71the driver as long as there are write requests available, and the 72write batch time limit has not been exceeded (write_batch_expire). 73However, the length of write batches will be gradually shortened 74when read batches frequently exceed their time limit. 75 76When changing between batch types, the scheduler waits for all requests 77from the previous batch to complete before scheduling requests for the 78next batch. 79 80The read and write fifo expiration times described in policy 2 above 81are checked only when in scheduling IO of a batch for the corresponding 82(read/write) type. So for example, the read FIFO timeout values are 83tested only during read batches. Likewise, the write FIFO timeout 84values are tested only during write batches. For this reason, 85it is generally not recommended for the read batch time 86to be longer than the write expiration time, nor for the write batch 87time to exceed the read expiration time (see tunable parameters below). 88 89When the IO scheduler changes from a read to a write batch, 90it begins the elevator from the request that is on the head of the 91write expiration FIFO. Likewise, when changing from a write batch to 92a read batch, scheduler begins the elevator from the first entry 93on the read expiration FIFO. 94 954. Read anticipation. 96 97Read anticipation occurs only when scheduling a read batch. 98This implementation of read anticipation allows only one read request 99to be dispatched to the disk controller at a time. In 100contrast, many write requests may be dispatched to the disk controller 101at a time during a write batch. It is this characteristic that can make 102the anticipatory scheduler perform anomalously with controllers supporting 103TCQ, or with hardware striped RAID devices. Setting the antic_expire 104queue parameter (see below) to zero disables this behavior, and the 105anticipatory scheduler behaves essentially like the deadline scheduler. 106 107When read anticipation is enabled (antic_expire is not zero), reads 108are dispatched to the disk controller one at a time. 109At the end of each read request, the IO scheduler examines its next 110candidate read request from its sorted read list. If that next request 111is from the same process as the request that just completed, 112or if the next request in the queue is "very close" to the 113just completed request, it is dispatched immediately. Otherwise, 114statistics (average think time, average seek distance) on the process 115that submitted the just completed request are examined. If it seems 116likely that that process will submit another request soon, and that 117request is likely to be near the just completed request, then the IO 118scheduler will stop dispatching more read requests for up time (antic_expire) 119milliseconds, hoping that process will submit a new request near the one 120that just completed. If such a request is made, then it is dispatched 121immediately. If the antic_expire wait time expires, then the IO scheduler 122will dispatch the next read request from the sorted read queue. 123 124To decide whether an anticipatory wait is worthwhile, the scheduler 125maintains statistics for each process that can be used to compute 126mean "think time" (the time between read requests), and mean seek 127distance for that process. One observation is that these statistics 128are associated with each process, but those statistics are not associated 129with a specific IO device. So for example, if a process is doing IO 130on several file systems on separate devices, the statistics will be 131a combination of IO behavior from all those devices. 132 133 134Tuning the anticipatory IO scheduler 135------------------------------------ 136When using 'as', the anticipatory IO scheduler there are 5 parameters under 137/sys/block/*/queue/iosched/. All are units of milliseconds. 138 139The parameters are: 140* read_expire 141 Controls how long until a read request becomes "expired". It also controls the 142 interval between which expired requests are served, so set to 50, a request 143 might take anywhere < 100ms to be serviced _if_ it is the next on the 144 expired list. Obviously request expiration strategies won't make the disk 145 go faster. The result basically equates to the timeslice a single reader 146 gets in the presence of other IO. 100*((seek time / read_expire) + 1) is 147 very roughly the % streaming read efficiency your disk should get with 148 multiple readers. 149 150* read_batch_expire 151 Controls how much time a batch of reads is given before pending writes are 152 served. A higher value is more efficient. This might be set below read_expire 153 if writes are to be given higher priority than reads, but reads are to be 154 as efficient as possible when there are no writes. Generally though, it 155 should be some multiple of read_expire. 156 157* write_expire, and 158* write_batch_expire are equivalent to the above, for writes. 159 160* antic_expire 161 Controls the maximum amount of time we can anticipate a good read (one 162 with a short seek distance from the most recently completed request) before 163 giving up. Many other factors may cause anticipation to be stopped early, 164 or some processes will not be "anticipated" at all. Should be a bit higher 165 for big seek time devices though not a linear correspondence - most 166 processes have only a few ms thinktime. 167