1206497Sluigi 2206497Sluigi --- GEOM BASED DISK SCHEDULERS FOR FREEBSD --- 3206497Sluigi 4206497SluigiThis code contains a framework for GEOM-based disk schedulers and a 5206497Sluigicouple of sample scheduling algorithms that use the framework and 6206497Sluigiimplement two forms of "anticipatory scheduling" (see below for more 7206497Sluigidetails). 8206497Sluigi 9206497SluigiAs a quick example of what this code can give you, try to run "dd", 10206497Sluigi"tar", or some other program with highly SEQUENTIAL access patterns, 11206497Sluigitogether with "cvs", "cvsup", "svn" or other highly RANDOM access patterns 12206497Sluigi(this is not a made-up example: it is pretty common for developers 13206497Sluigito have one or more apps doing random accesses, and others that do 14206497Sluigisequential accesses e.g., loading large binaries from disk, checking 15206497Sluigithe integrity of tarballs, watching media streams and so on). 16206497Sluigi 17206497SluigiThese are the results we get on a local machine (AMD BE2400 dual 18206497Sluigicore CPU, SATA 250GB disk): 19206497Sluigi 20206497Sluigi /mnt is a partition mounted on /dev/ad0s1f 21206497Sluigi 22206497Sluigi cvs: cvs -d /mnt/home/ncvs-local update -Pd /mnt/ports 23206497Sluigi dd-read: dd bs=128k of=/dev/null if=/dev/ad0 (or ad0-sched-) 24206497Sluigi dd-writew dd bs=128k if=/dev/zero of=/mnt/largefile 25206497Sluigi 26206497Sluigi NO SCHEDULER RR SCHEDULER 27206497Sluigi dd cvs dd cvs 28206497Sluigi 29206497Sluigi dd-read only 72 MB/s ---- 72 MB/s --- 30206497Sluigi dd-write only 55 MB/s --- 55 MB/s --- 31206497Sluigi dd-read+cvs 6 MB/s ok 30 MB/s ok 32206497Sluigi dd-write+cvs 55 MB/s slooow 14 MB/s ok 33206497Sluigi 34206497SluigiAs you can see, when a cvs is running concurrently with dd, the 35206497Sluigiperformance drops dramatically, and depending on read or write mode, 36206497Sluigione of the two is severely penalized. The use of the RR scheduler 37206497Sluigiin this example makes the dd-reader go much faster when competing 38206497Sluigiwith cvs, and lets cvs progress when competing with a writer. 39206497Sluigi 40206497SluigiTo try it out: 41206497Sluigi 42206497Sluigi1. USERS OF FREEBSD 7, PLEASE READ CAREFULLY THE FOLLOWING: 43206497Sluigi 44206497Sluigi On loading, this module patches one kernel function (g_io_request()) 45206497Sluigi so that I/O requests ("bio's") carry a classification tag, useful 46206497Sluigi for scheduling purposes. 47206497Sluigi 48206497Sluigi ON FREEBSD 7, the tag is stored in an existing (though rarely used) 49206497Sluigi field of the "struct bio", a solution which makes this module 50206497Sluigi incompatible with other modules using it, such as ZFS and gjournal. 51206497Sluigi Additionally, g_io_request() is patched in-memory to add a call 52206497Sluigi to the function that initializes this field (i386/amd64 only; 53206497Sluigi for other architectures you need to manually patch sys/geom/geom_io.c). 54206497Sluigi See details in the file g_sched.c. 55206497Sluigi 56206497Sluigi On FreeBSD 8.0 and above, the above trick is not necessary, 57206497Sluigi as the struct bio contains dedicated fields for the classifier, 58206497Sluigi and hooks for request classifiers. 59206497Sluigi 60206497Sluigi If you don't like the above, don't run this code. 61206497Sluigi 62206497Sluigi2. PLEASE MAKE SURE THAT THE DISK THAT YOU WILL BE USING FOR TESTS 63206497Sluigi DOES NOT CONTAIN PRECIOUS DATA. 64206497Sluigi This is experimental code, so we make no guarantees, though 65206497Sluigi I am routinely using it on my desktop and laptop. 66206497Sluigi 67206497Sluigi3. EXTRACT AND BUILD THE PROGRAMS 68206497Sluigi A 'make install' in the directory should work (with root privs), 69206497Sluigi or you can even try the binary modules. 70206497Sluigi If you want to build the modules yourself, look at the Makefile. 71206497Sluigi 72206497Sluigi4. LOAD THE MODULE, CREATE A GEOM NODE, RUN TESTS 73206497Sluigi 74206497Sluigi The scheduler's module must be loaded first: 75206497Sluigi 76206497Sluigi # kldload gsched_rr 77206497Sluigi 78206497Sluigi substitute with gsched_as to test AS. Then, supposing that you are 79206497Sluigi using /dev/ad0 for testing, a scheduler can be attached to it with: 80206497Sluigi 81206497Sluigi # geom sched insert ad0 82206497Sluigi 83206497Sluigi The scheduler is inserted transparently in the geom chain, so 84206497Sluigi mounted partitions and filesystems will keep working, but 85206497Sluigi now requests will go through the scheduler. 86206497Sluigi 87206497Sluigi To change scheduler on-the-fly, you can reconfigure the geom: 88206497Sluigi 89206497Sluigi # geom sched configure -a as ad0.sched. 90206497Sluigi 91206497Sluigi assuming that gsched_as was loaded previously. 92206497Sluigi 93206497Sluigi5. SCHEDULER REMOVAL 94206497Sluigi 95206497Sluigi In principle it is possible to remove the scheduler module 96206497Sluigi even on an active chain by doing 97206497Sluigi 98206497Sluigi # geom sched destroy ad0.sched. 99206497Sluigi 100206497Sluigi However, there is some race in the geom subsystem which makes 101206497Sluigi the removal unsafe if there are active requests on a chain. 102206497Sluigi So, in order to reduce the risk of data losses, make sure 103206497Sluigi you don't remove a scheduler from a chain with ongoing transactions. 104206497Sluigi 105206497Sluigi--- NOTES ON THE SCHEDULERS --- 106206497Sluigi 107206497SluigiThe important contribution of this code is the framework to experiment 108206497Sluigiwith different scheduling algorithms. 'Anticipatory scheduling' 109206497Sluigiis a very powerful technique based on the following reasoning: 110206497Sluigi 111206497Sluigi The disk throughput is much better if it serves sequential requests. 112206497Sluigi If we have a mix of sequential and random requests, and we see a 113206497Sluigi non-sequential request, do not serve it immediately but instead wait 114206497Sluigi a little bit (2..5ms) to see if there is another one coming that 115206497Sluigi the disk can serve more efficiently. 116206497Sluigi 117206497SluigiThere are many details that should be added to make sure that the 118206497Sluigimechanism is effective with different workloads and systems, to 119206497Sluigigain a few extra percent in performance, to improve fairness, 120206497Sluigiinsulation among processes etc. A discussion of the vast literature 121206497Sluigion the subject is beyond the purpose of this short note. 122206497Sluigi 123206497Sluigi-------------------------------------------------------------------------- 124206497Sluigi 125206497SluigiTRANSPARENT INSERT/DELETE 126206497Sluigi 127206497Sluigigeom_sched is an ordinary geom module, however it is convenient 128206497Sluigito plug it transparently into the geom graph, so that one can 129206497Sluigienable or disable scheduling on a mounted filesystem, and the 130206497Sluiginames in /etc/fstab do not depend on the presence of the scheduler. 131206497Sluigi 132206497SluigiTo understand how this works in practice, remember that in GEOM 133206497Sluigiwe have "providers" and "geom" objects. 134206497SluigiSay that we want to hook a scheduler on provider "ad0", 135206497Sluigiaccessible through pointer 'pp'. Originally, pp is attached to 136206497Sluigigeom "ad0" (same name, different object) accessible through pointer old_gp 137206497Sluigi 138206497Sluigi BEFORE ---> [ pp --> old_gp ...] 139206497Sluigi 140206497SluigiA normal "geom sched create ad0" call would create a new geom node 141206497Sluigion top of provider ad0/pp, and export a newly created provider 142206497Sluigi("ad0.sched." accessible through pointer newpp). 143206497Sluigi 144206497Sluigi AFTER create ---> [ newpp --> gp --> cp ] ---> [ pp --> old_gp ... ] 145206497Sluigi 146206497SluigiOn top of newpp, a whole tree will be created automatically, and we 147206497Sluigican e.g. mount partitions on /dev/ad0.sched.s1d, and those requests 148206497Sluigiwill go through the scheduler, whereas any partition mounted on 149206497Sluigithe pre-existing device entries will not go through the scheduler. 150206497Sluigi 151206497SluigiWith the transparent insert mechanism, the original provider "ad0"/pp 152206497Sluigiis hooked to the newly created geom, as follows: 153206497Sluigi 154206497Sluigi AFTER insert ---> [ pp --> gp --> cp ] ---> [ newpp --> old_gp ... ] 155206497Sluigi 156206497Sluigiso anything that was previously using provider pp will now have 157206497Sluigithe requests routed through the scheduler node. 158206497Sluigi 159206497SluigiA removal ("geom sched destroy ad0.sched.") will restore the original 160206497Sluigiconfiguration. 161206497Sluigi 162206497Sluigi# $FreeBSD$ 163