1I/O Barriers
2============
3Tejun Heo <htejun@gmail.com>, July 22 2005
4
5I/O barrier requests are used to guarantee ordering around the barrier
6requests.  Unless you're crazy enough to use disk drives for
7implementing synchronization constructs (wow, sounds interesting...),
8the ordering is meaningful only for write requests for things like
9journal checkpoints.  All requests queued before a barrier request
10must be finished (made it to the physical medium) before the barrier
11request is started, and all requests queued after the barrier request
12must be started only after the barrier request is finished (again,
13made it to the physical medium).
14
15In other words, I/O barrier requests have the following two properties.
16
171. Request ordering
18
19Requests cannot pass the barrier request.  Preceding requests are
20processed before the barrier and following requests after.
21
22Depending on what features a drive supports, this can be done in one
23of the following three ways.
24
25i. For devices which have queue depth greater than 1 (TCQ devices) and
26support ordered tags, block layer can just issue the barrier as an
27ordered request and the lower level driver, controller and drive
28itself are responsible for making sure that the ordering constraint is
29met.  Most modern SCSI controllers/drives should support this.
30
31NOTE: SCSI ordered tag isn't currently used due to limitation in the
32      SCSI midlayer, see the following random notes section.
33
34ii. For devices which have queue depth greater than 1 but don't
35support ordered tags, block layer ensures that the requests preceding
36a barrier request finishes before issuing the barrier request.  Also,
37it defers requests following the barrier until the barrier request is
38finished.  Older SCSI controllers/drives and SATA drives fall in this
39category.
40
41iii. Devices which have queue depth of 1.  This is a degenerate case
42of ii.  Just keeping issue order suffices.  Ancient SCSI
43controllers/drives and IDE drives are in this category.
44
452. Forced flushing to physical medium
46
47Again, if you're not gonna do synchronization with disk drives (dang,
48it sounds even more appealing now!), the reason you use I/O barriers
49is mainly to protect filesystem integrity when power failure or some
50other events abruptly stop the drive from operating and possibly make
51the drive lose data in its cache.  So, I/O barriers need to guarantee
52that requests actually get written to non-volatile medium in order.
53
54There are four cases,
55
56i. No write-back cache.  Keeping requests ordered is enough.
57
58ii. Write-back cache but no flush operation.  There's no way to
59guarantee physical-medium commit order.  This kind of devices can't to
60I/O barriers.
61
62iii. Write-back cache and flush operation but no FUA (forced unit
63access).  We need two cache flushes - before and after the barrier
64request.
65
66iv. Write-back cache, flush operation and FUA.  We still need one
67flush to make sure requests preceding a barrier are written to medium,
68but post-barrier flush can be avoided by using FUA write on the
69barrier itself.
70
71
72How to support barrier requests in drivers
73------------------------------------------
74
75All barrier handling is done inside block layer proper.  All low level
76drivers have to are implementing its prepare_flush_fn and using one
77the following two functions to indicate what barrier type it supports
78and how to prepare flush requests.  Note that the term 'ordered' is
79used to indicate the whole sequence of performing barrier requests
80including draining and flushing.
81
82typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
83
84int blk_queue_ordered(request_queue_t *q, unsigned ordered,
85		      prepare_flush_fn *prepare_flush_fn,
86		      unsigned gfp_mask);
87
88int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
89			     prepare_flush_fn *prepare_flush_fn,
90			     unsigned gfp_mask);
91
92The only difference between the two functions is whether or not the
93caller is holding q->queue_lock on entry.  The latter expects the
94caller is holding the lock.
95
96@q			: the queue in question
97@ordered		: the ordered mode the driver/device supports
98@prepare_flush_fn	: this function should prepare @rq such that it
99			  flushes cache to physical medium when executed
100@gfp_mask		: gfp_mask used when allocating data structures
101			  for ordered processing
102
103For example, SCSI disk driver's prepare_flush_fn looks like the
104following.
105
106static void sd_prepare_flush(request_queue_t *q, struct request *rq)
107{
108	memset(rq->cmd, 0, sizeof(rq->cmd));
109	rq->flags |= REQ_BLOCK_PC;
110	rq->timeout = SD_TIMEOUT;
111	rq->cmd[0] = SYNCHRONIZE_CACHE;
112}
113
114The following seven ordered modes are supported.  The following table
115shows which mode should be used depending on what features a
116device/driver supports.  In the leftmost column of table,
117QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
118
119The table is followed by description of each mode.  Note that in the
120descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
121used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
122preceding step must be complete before proceeding to the next step.
123'->' indicates that the next step can start as soon as the previous
124step is issued.
125
126	    write-back cache	ordered tag	flush		FUA
127-----------------------------------------------------------------------
128NONE		yes/no		N/A		no		N/A
129DRAIN		no		no		N/A		N/A
130DRAIN_FLUSH	yes		no		yes		no
131DRAIN_FUA	yes		no		yes		yes
132TAG		no		yes		N/A		N/A
133TAG_FLUSH	yes		yes		yes		no
134TAG_FUA		yes		yes		yes		yes
135
136
137QUEUE_ORDERED_NONE
138	I/O barriers are not needed and/or supported.
139
140	Sequence: N/A
141
142QUEUE_ORDERED_DRAIN
143	Requests are ordered by draining the request queue and cache
144	flushing isn't needed.
145
146	Sequence: drain => barrier
147
148QUEUE_ORDERED_DRAIN_FLUSH
149	Requests are ordered by draining the request queue and both
150	pre-barrier and post-barrier cache flushings are needed.
151
152	Sequence: drain => preflush => barrier => postflush
153
154QUEUE_ORDERED_DRAIN_FUA
155	Requests are ordered by draining the request queue and
156	pre-barrier cache flushing is needed.  By using FUA on barrier
157	request, post-barrier flushing can be skipped.
158
159	Sequence: drain => preflush => barrier
160
161QUEUE_ORDERED_TAG
162	Requests are ordered by ordered tag and cache flushing isn't
163	needed.
164
165	Sequence: barrier
166
167QUEUE_ORDERED_TAG_FLUSH
168	Requests are ordered by ordered tag and both pre-barrier and
169	post-barrier cache flushings are needed.
170
171	Sequence: preflush -> barrier -> postflush
172
173QUEUE_ORDERED_TAG_FUA
174	Requests are ordered by ordered tag and pre-barrier cache
175	flushing is needed.  By using FUA on barrier request,
176	post-barrier flushing can be skipped.
177
178	Sequence: preflush -> barrier
179
180
181Random notes/caveats
182--------------------
183
184* SCSI layer currently can't use TAG ordering even if the drive,
185controller and driver support it.  The problem is that SCSI midlayer
186request dispatch function is not atomic.  It releases queue lock and
187switch to SCSI host lock during issue and it's possible and likely to
188happen in time that requests change their relative positions.  Once
189this problem is solved, TAG ordering can be enabled.
190
191* Currently, no matter which ordered mode is used, there can be only
192one barrier request in progress.  All I/O barriers are held off by
193block layer until the previous I/O barrier is complete.  This doesn't
194make any difference for DRAIN ordered devices, but, for TAG ordered
195devices with very high command latency, passing multiple I/O barriers
196to low level *might* be helpful if they are very frequent.  Well, this
197certainly is a non-issue.  I'm writing this just to make clear that no
198two I/O barrier is ever passed to low-level driver.
199
200* Completion order.  Requests in ordered sequence are issued in order
201but not required to finish in order.  Barrier implementation can
202handle out-of-order completion of ordered sequence.  IOW, the requests
203MUST be processed in order but the hardware/software completion paths
204are allowed to reorder completion notifications - eg. current SCSI
205midlayer doesn't preserve completion order during error handling.
206
207* Requeueing order.  Low-level drivers are free to requeue any request
208after they removed it from the request queue with
209blkdev_dequeue_request().  As barrier sequence should be kept in order
210when requeued, generic elevator code takes care of putting requests in
211order around barrier.  See blk_ordered_req_seq() and
212ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
213
214Note that block drivers must not requeue preceding requests while
215completing latter requests in an ordered sequence.  Currently, no
216error checking is done against this.
217
218* Error handling.  Currently, block layer will report error to upper
219layer if any of requests in an ordered sequence fails.  Unfortunately,
220this doesn't seem to be enough.  Look at the following request flow.
221QUEUE_ORDERED_TAG_FLUSH is in use.
222
223 [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
224					  still in elevator
225
226Let's say request [2], [3] are write requests to update file system
227metadata (journal or whatever) and [barrier] is used to mark that
228those updates are valid.  Consider the following sequence.
229
230 i.	Requests [0] ~ [post] leaves the request queue and enters
231	low-level driver.
232 ii.	After a while, unfortunately, something goes wrong and the
233	drive fails [2].  Note that any of [0], [1] and [3] could have
234	completed by this time, but [pre] couldn't have been finished
235	as the drive must process it in order and it failed before
236	processing that command.
237 iii.	Error handling kicks in and determines that the error is
238	unrecoverable and fails [2], and resumes operation.
239 iv.	[pre] [barrier] [post] gets processed.
240 v.	*BOOM* power fails
241
242The problem here is that the barrier request is *supposed* to indicate
243that filesystem update requests [2] and [3] made it safely to the
244physical medium and, if the machine crashes after the barrier is
245written, filesystem recovery code can depend on that.  Sadly, that
246isn't true in this case anymore.  IOW, the success of a I/O barrier
247should also be dependent on success of some of the preceding requests,
248where only upper layer (filesystem) knows what 'some' is.
249
250This can be solved by implementing a way to tell the block layer which
251requests affect the success of the following barrier request and
252making lower lever drivers to resume operation on error only after
253block layer tells it to do so.
254
255As the probability of this happening is very low and the drive should
256be faulty, implementing the fix is probably an overkill.  But, still,
257it's there.
258
259* In previous drafts of barrier implementation, there was fallback
260mechanism such that, if FUA or ordered TAG fails, less fancy ordered
261mode can be selected and the failed barrier request is retried
262automatically.  The rationale for this feature was that as FUA is
263pretty new in ATA world and ordered tag was never used widely, there
264could be devices which report to support those features but choke when
265actually given such requests.
266
267 This was removed for two reasons 1. it's an overkill 2. it's
268impossible to implement properly when TAG ordering is used as low
269level drivers resume after an error automatically.  If it's ever
270needed adding it back and modifying low level drivers accordingly
271shouldn't be difficult.
272