1206497Sluigi
2206497Sluigi	--- GEOM BASED DISK SCHEDULERS FOR FREEBSD ---
3206497Sluigi
4206497SluigiThis code contains a framework for GEOM-based disk schedulers and a
5206497Sluigicouple of sample scheduling algorithms that use the framework and
6206497Sluigiimplement two forms of "anticipatory scheduling" (see below for more
7206497Sluigidetails).
8206497Sluigi
9206497SluigiAs a quick example of what this code can give you, try to run "dd",
10206497Sluigi"tar", or some other program with highly SEQUENTIAL access patterns,
11206497Sluigitogether with "cvs", "cvsup", "svn" or other highly RANDOM access patterns
12206497Sluigi(this is not a made-up example: it is pretty common for developers
13206497Sluigito have one or more apps doing random accesses, and others that do
14206497Sluigisequential accesses e.g., loading large binaries from disk, checking
15206497Sluigithe integrity of tarballs, watching media streams and so on).
16206497Sluigi
17206497SluigiThese are the results we get on a local machine (AMD BE2400 dual
18206497Sluigicore CPU, SATA 250GB disk):
19206497Sluigi
20206497Sluigi    /mnt is a partition mounted on /dev/ad0s1f
21206497Sluigi
22206497Sluigi    cvs: 	cvs -d /mnt/home/ncvs-local update -Pd /mnt/ports
23206497Sluigi    dd-read:	dd bs=128k of=/dev/null if=/dev/ad0 (or ad0-sched-)
24206497Sluigi    dd-writew	dd bs=128k if=/dev/zero of=/mnt/largefile
25206497Sluigi
26206497Sluigi			NO SCHEDULER		RR SCHEDULER
27206497Sluigi                	dd	cvs		dd	cvs
28206497Sluigi
29206497Sluigi    dd-read only        72 MB/s	----		72 MB/s	---
30206497Sluigi    dd-write only	55 MB/s	---		55 MB/s	---
31206497Sluigi    dd-read+cvs		 6 MB/s	ok    		30 MB/s	ok
32206497Sluigi    dd-write+cvs	55 MB/s slooow		14 MB/s	ok
33206497Sluigi
34206497SluigiAs you can see, when a cvs is running concurrently with dd, the
35206497Sluigiperformance drops dramatically, and depending on read or write mode,
36206497Sluigione of the two is severely penalized.  The use of the RR scheduler
37206497Sluigiin this example makes the dd-reader go much faster when competing
38206497Sluigiwith cvs, and lets cvs progress when competing with a writer.
39206497Sluigi
40206497SluigiTo try it out:
41206497Sluigi
42206497Sluigi1. USERS OF FREEBSD 7, PLEASE READ CAREFULLY THE FOLLOWING:
43206497Sluigi
44206497Sluigi    On loading, this module patches one kernel function (g_io_request())
45206497Sluigi    so that I/O requests ("bio's") carry a classification tag, useful
46206497Sluigi    for scheduling purposes.
47206497Sluigi
48206497Sluigi    ON FREEBSD 7, the tag is stored in an existing (though rarely used)
49206497Sluigi    field of the "struct bio", a solution which makes this module
50206497Sluigi    incompatible with other modules using it, such as ZFS and gjournal.
51206497Sluigi    Additionally, g_io_request() is patched in-memory to add a call
52206497Sluigi    to the function that initializes this field (i386/amd64 only;
53206497Sluigi    for other architectures you need to manually patch sys/geom/geom_io.c).
54206497Sluigi    See details in the file g_sched.c.
55206497Sluigi
56206497Sluigi    On FreeBSD 8.0 and above, the above trick is not necessary,
57206497Sluigi    as the struct bio contains dedicated fields for the classifier,
58206497Sluigi    and hooks for request classifiers.
59206497Sluigi
60206497Sluigi    If you don't like the above, don't run this code.
61206497Sluigi
62206497Sluigi2. PLEASE MAKE SURE THAT THE DISK THAT YOU WILL BE USING FOR TESTS
63206497Sluigi   DOES NOT CONTAIN PRECIOUS DATA.
64206497Sluigi    This is experimental code, so we make no guarantees, though
65206497Sluigi    I am routinely using it on my desktop and laptop.
66206497Sluigi
67206497Sluigi3. EXTRACT AND BUILD THE PROGRAMS
68206497Sluigi    A 'make install' in the directory should work (with root privs),
69206497Sluigi    or you can even try the binary modules.
70206497Sluigi    If you want to build the modules yourself, look at the Makefile.
71206497Sluigi
72206497Sluigi4. LOAD THE MODULE, CREATE A GEOM NODE, RUN TESTS
73206497Sluigi
74206497Sluigi    The scheduler's module must be loaded first:
75206497Sluigi
76206497Sluigi      # kldload gsched_rr
77206497Sluigi
78206497Sluigi    substitute with gsched_as to test AS.  Then, supposing that you are
79206497Sluigi    using /dev/ad0 for testing, a scheduler can be attached to it with:
80206497Sluigi
81206497Sluigi      # geom sched insert ad0
82206497Sluigi
83206497Sluigi    The scheduler is inserted transparently in the geom chain, so
84206497Sluigi    mounted partitions and filesystems will keep working, but
85206497Sluigi    now requests will go through the scheduler.
86206497Sluigi
87206497Sluigi    To change scheduler on-the-fly, you can reconfigure the geom:
88206497Sluigi
89206497Sluigi      # geom sched configure -a as ad0.sched.
90206497Sluigi
91206497Sluigi    assuming that gsched_as was loaded previously.
92206497Sluigi
93206497Sluigi5. SCHEDULER REMOVAL
94206497Sluigi
95206497Sluigi    In principle it is possible to remove the scheduler module
96206497Sluigi    even on an active chain by doing
97206497Sluigi
98206497Sluigi	# geom sched destroy ad0.sched.
99206497Sluigi
100206497Sluigi    However, there is some race in the geom subsystem which makes
101206497Sluigi    the removal unsafe if there are active requests on a chain.
102206497Sluigi    So, in order to reduce the risk of data losses, make sure
103206497Sluigi    you don't remove a scheduler from a chain with ongoing transactions.
104206497Sluigi
105206497Sluigi--- NOTES ON THE SCHEDULERS ---
106206497Sluigi
107206497SluigiThe important contribution of this code is the framework to experiment
108206497Sluigiwith different scheduling algorithms.  'Anticipatory scheduling'
109206497Sluigiis a very powerful technique based on the following reasoning:
110206497Sluigi
111206497Sluigi    The disk throughput is much better if it serves sequential requests.
112206497Sluigi    If we have a mix of sequential and random requests, and we see a
113206497Sluigi    non-sequential request, do not serve it immediately but instead wait
114206497Sluigi    a little bit (2..5ms) to see if there is another one coming that
115206497Sluigi    the disk can serve more efficiently.
116206497Sluigi
117206497SluigiThere are many details that should be added to make sure that the
118206497Sluigimechanism is effective with different workloads and systems, to
119206497Sluigigain a few extra percent in performance, to improve fairness,
120206497Sluigiinsulation among processes etc.  A discussion of the vast literature
121206497Sluigion the subject is beyond the purpose of this short note.
122206497Sluigi
123206497Sluigi--------------------------------------------------------------------------
124206497Sluigi
125206497SluigiTRANSPARENT INSERT/DELETE
126206497Sluigi
127206497Sluigigeom_sched is an ordinary geom module, however it is convenient
128206497Sluigito plug it transparently into the geom graph, so that one can
129206497Sluigienable or disable scheduling on a mounted filesystem, and the
130206497Sluiginames in /etc/fstab do not depend on the presence of the scheduler.
131206497Sluigi
132206497SluigiTo understand how this works in practice, remember that in GEOM
133206497Sluigiwe have "providers" and "geom" objects.
134206497SluigiSay that we want to hook a scheduler on provider "ad0",
135206497Sluigiaccessible through pointer 'pp'. Originally, pp is attached to
136206497Sluigigeom "ad0" (same name, different object) accessible through pointer old_gp
137206497Sluigi
138206497Sluigi  BEFORE	---> [ pp    --> old_gp ...]
139206497Sluigi
140206497SluigiA normal "geom sched create ad0" call would create a new geom node
141206497Sluigion top of provider ad0/pp, and export a newly created provider
142206497Sluigi("ad0.sched." accessible through pointer newpp).
143206497Sluigi
144206497Sluigi  AFTER create  ---> [ newpp --> gp --> cp ] ---> [ pp    --> old_gp ... ]
145206497Sluigi
146206497SluigiOn top of newpp, a whole tree will be created automatically, and we
147206497Sluigican e.g. mount partitions on /dev/ad0.sched.s1d, and those requests
148206497Sluigiwill go through the scheduler, whereas any partition mounted on
149206497Sluigithe pre-existing device entries will not go through the scheduler.
150206497Sluigi
151206497SluigiWith the transparent insert mechanism, the original provider "ad0"/pp
152206497Sluigiis hooked to the newly created geom, as follows:
153206497Sluigi
154206497Sluigi  AFTER insert  ---> [ pp    --> gp --> cp ] ---> [ newpp --> old_gp ... ]
155206497Sluigi
156206497Sluigiso anything that was previously using provider pp will now have
157206497Sluigithe requests routed through the scheduler node.
158206497Sluigi
159206497SluigiA removal ("geom sched destroy ad0.sched.") will restore the original
160206497Sluigiconfiguration.
161206497Sluigi
162206497Sluigi# $FreeBSD$
163