notes revision 110710
1139749Simp$FreeBSD: head/sys/geom/notes 110710 2003-02-11 14:57:34Z phk $
2123120Simp
3123120SimpFor the lack of a better place to put them, this file will contain
4123120Simpnotes on some of the more intricate details of geom.
5123120Simp
6123120Simp-----------------------------------------------------------------------
7123120SimpLocking of bio_children and bio_inbed
8123120Simp
9123120Simpbio_children is used by g_std_done() and g_clone_bio() to keep track
10123120Simpof children cloned off a request.  g_clone_bio will increment the
11123120Simpbio_children counter for each time it is called and g_std_done will
12123120Simpincrement bio_inbed for every call, and if the two counters are
13123120Simpequal, call g_io_deliver() on the parent bio.
14123120Simp
15123120SimpThe general assumption is that g_clone_bio() is called only in
16123120Simpthe g_down thread, and g_std_done() only in the g_up thread and
17123120Simptherefore the two fields do not generally need locking.  These
18123120Simprestrictions are not enforced by the code, but only with great
19123120Simpcare should they be violated.
20123120Simp
21123120SimpIt is the responsibility of the class implementation to avoid the
22123120Simpfollowing race condition:  A class intend to split a bio in two
23123120Simpchildren.  It clones the bio, and requests I/O on the child. 
24123120SimpThis I/O operation completes before the second child is cloned
25123120Simpand g_std_done() sees the counters both equal 1 and finishes off
26123120Simpthe bio.
27123120Simp
28123120SimpThere is no race present in the common case where the bio is split
29123120Simpin multiple parts in the class start method and the I/O is requested
30123120Simpon another GEOM class below:  There is only one g_down thread and
31123120Simpthe class below will not get its start method run until we return
32123120Simpfrom our start method, and consequently the I/O cannot complete
33prematurely.
34
35In all other cases, this race needs to be mitigated, for instance
36by cloning all children before I/O is request on any of them.
37
38Notice that cloning an "extra" child and calling g_std_done() on
39it directly opens another race since the assumption is that
40g_std_done() only is called in the g_up thread.
41
42-----------------------------------------------------------------------
43Statistics collection
44
45Statistics collection can run at three levels controlled by the
46"kern.geom.collectstats" sysctl.
47
48At level zero, only the number of transactions started and completed
49are counted, and this is only because GEOM internally uses the difference
50between these two as sanity checks.
51
52At level one we collect the full statistics.  Higher levels are
53reserved for future use.  Statistics are collected independently
54on both the provider and the consumer, because multiple consumers
55can be active against the same provider at the same time.
56
57The statistics collection falls in two parts:
58
59The first and simpler part consists of g_io_request() timestamping
60the struct bio when the request is first started and g_io_deliver()
61updating the consumer and providers statistics based on fields in
62the bio when it is completed.  There are no concurrency or locking
63concerns in this part.  The statistics collected consists of number
64of requests, number of bytes, number of ENOMEM errors, number of
65other errors and duration of the request for each of the three
66major request types: BIO_READ, BIO_WRITE and BIO_DELETE.
67
68The second part is trying to keep track of the "busy%".
69
70If in g_io_request() we find that there are no outstanding requests,
71(based on the counters for scheduled and completed requests being
72equal), we set a timestamp in the "wentbusy" field.  Since there
73are no outstanding requests, and as long as there is only one thread
74pushing the g_down queue, we cannot possibly conflict with
75g_io_deliver() until we ship the current request down.
76
77In g_io_deliver() we calculate the delta-T from wentbusy and add this
78to the "bt" field, and set wentbusy to the current timestamp.  We
79take care to do this before we increment the "requests completed"
80counter, since that prevents g_io_request() from touching the
81"wentbusy" timestamp concurrently.
82
83The statistics data is made available to userland through the use
84of a special allocator (in geom_stats.c) which through a device
85allows userland to mmap(2) the pages containing the statistics data.
86In order to indicate to userland when the data in a statstics
87structure might be inconsistent, g_io_deliver() atomically sets a
88flag "updating" and resets it when the structure is again consistent.
89-----------------------------------------------------------------------
90maxsize, stripesize and stripeoffset
91
92maxsize is the biggest request we are willing to handle.  If not
93set there is no upper bound on the size of a request and the code
94is responsible for chopping it up.  Only hardware methods should
95set an upper bound in this field.  Geom_disk will inherit the upper
96bound set by the device driver.
97
98stripesize is the width of any natural request boundaries for the
99device.  This would be the width of a stripe on a raid-5 unit or
100one zone in GBDE.  The idea with this field is to hint to clustering
101type code to not trivially overrun these boundaries.
102
103stripeoffset is the amount of the first stripe which lies before the
104devices beginning.
105
106If we have a device with 64k stripes:
107	[0...64k[
108	[64k...128k[
109	[128k..192k[
110Then it will have stripesize = 64k and stripeoffset = 0.
111
112If we put a MBR on this device, where slice#1 starts on sector#63,
113then this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize.
114
115If the clustering code wants to widen a request which writes to
116sector#53 of the slice, it can calculate how many bytes till the end of
117the stripe as:
118	stripewith - (53 * sectorsize + stripeoffset) % stripewidth.
119