• Home
  • History
  • Annotate
  • Line#
  • Navigate
  • Raw
  • Download
  • only in /asuswrt-rt-n18u-9.0.0.4.380.2695/release/src-rt-6.x.4708/linux/linux-2.6/Documentation/block/
1I/O Barriers
2============
3Tejun Heo <htejun@gmail.com>, July 22 2005
4
5I/O barrier requests are used to guarantee ordering around the barrier
6requests.  Unless you're crazy enough to use disk drives for
7implementing synchronization constructs (wow, sounds interesting...),
8the ordering is meaningful only for write requests for things like
9journal checkpoints.  All requests queued before a barrier request
10must be finished (made it to the physical medium) before the barrier
11request is started, and all requests queued after the barrier request
12must be started only after the barrier request is finished (again,
13made it to the physical medium).
14
15In other words, I/O barrier requests have the following two properties.
16
171. Request ordering
18
19Requests cannot pass the barrier request.  Preceding requests are
20processed before the barrier and following requests after.
21
22Depending on what features a drive supports, this can be done in one
23of the following three ways.
24
25i. For devices which have queue depth greater than 1 (TCQ devices) and
26support ordered tags, block layer can just issue the barrier as an
27ordered request and the lower level driver, controller and drive
28itself are responsible for making sure that the ordering constraint is
29met.  Most modern SCSI controllers/drives should support this.
30
31NOTE: SCSI ordered tag isn't currently used due to limitation in the
32      SCSI midlayer, see the following random notes section.
33
34ii. For devices which have queue depth greater than 1 but don't
35support ordered tags, block layer ensures that the requests preceding
36a barrier request finishes before issuing the barrier request.  Also,
37it defers requests following the barrier until the barrier request is
38finished.  Older SCSI controllers/drives and SATA drives fall in this
39category.
40
41iii. Devices which have queue depth of 1.  This is a degenerate case
42of ii.  Just keeping issue order suffices.  Ancient SCSI
43controllers/drives and IDE drives are in this category.
44
452. Forced flushing to physical medium
46
47Again, if you're not gonna do synchronization with disk drives (dang,
48it sounds even more appealing now!), the reason you use I/O barriers
49is mainly to protect filesystem integrity when power failure or some
50other events abruptly stop the drive from operating and possibly make
51the drive lose data in its cache.  So, I/O barriers need to guarantee
52that requests actually get written to non-volatile medium in order.
53
54There are four cases,
55
56i. No write-back cache.  Keeping requests ordered is enough.
57
58ii. Write-back cache but no flush operation.  There's no way to
59guarantee physical-medium commit order.  This kind of devices can't to
60I/O barriers.
61
62iii. Write-back cache and flush operation but no FUA (forced unit
63access).  We need two cache flushes - before and after the barrier
64request.
65
66iv. Write-back cache, flush operation and FUA.  We still need one
67flush to make sure requests preceding a barrier are written to medium,
68but post-barrier flush can be avoided by using FUA write on the
69barrier itself.
70
71
72How to support barrier requests in drivers
73------------------------------------------
74
75All barrier handling is done inside block layer proper.  All low level
76drivers have to are implementing its prepare_flush_fn and using one
77the following two functions to indicate what barrier type it supports
78and how to prepare flush requests.  Note that the term 'ordered' is
79used to indicate the whole sequence of performing barrier requests
80including draining and flushing.
81
82typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);
83
84int blk_queue_ordered(struct request_queue *q, unsigned ordered,
85		      prepare_flush_fn *prepare_flush_fn);
86
87@q			: the queue in question
88@ordered		: the ordered mode the driver/device supports
89@prepare_flush_fn	: this function should prepare @rq such that it
90			  flushes cache to physical medium when executed
91
92For example, SCSI disk driver's prepare_flush_fn looks like the
93following.
94
95static void sd_prepare_flush(struct request_queue *q, struct request *rq)
96{
97	memset(rq->cmd, 0, sizeof(rq->cmd));
98	rq->cmd_type = REQ_TYPE_BLOCK_PC;
99	rq->timeout = SD_TIMEOUT;
100	rq->cmd[0] = SYNCHRONIZE_CACHE;
101	rq->cmd_len = 10;
102}
103
104The following seven ordered modes are supported.  The following table
105shows which mode should be used depending on what features a
106device/driver supports.  In the leftmost column of table,
107QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
108
109The table is followed by description of each mode.  Note that in the
110descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
111used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
112preceding step must be complete before proceeding to the next step.
113'->' indicates that the next step can start as soon as the previous
114step is issued.
115
116	    write-back cache	ordered tag	flush		FUA
117-----------------------------------------------------------------------
118NONE		yes/no		N/A		no		N/A
119DRAIN		no		no		N/A		N/A
120DRAIN_FLUSH	yes		no		yes		no
121DRAIN_FUA	yes		no		yes		yes
122TAG		no		yes		N/A		N/A
123TAG_FLUSH	yes		yes		yes		no
124TAG_FUA		yes		yes		yes		yes
125
126
127QUEUE_ORDERED_NONE
128	I/O barriers are not needed and/or supported.
129
130	Sequence: N/A
131
132QUEUE_ORDERED_DRAIN
133	Requests are ordered by draining the request queue and cache
134	flushing isn't needed.
135
136	Sequence: drain => barrier
137
138QUEUE_ORDERED_DRAIN_FLUSH
139	Requests are ordered by draining the request queue and both
140	pre-barrier and post-barrier cache flushings are needed.
141
142	Sequence: drain => preflush => barrier => postflush
143
144QUEUE_ORDERED_DRAIN_FUA
145	Requests are ordered by draining the request queue and
146	pre-barrier cache flushing is needed.  By using FUA on barrier
147	request, post-barrier flushing can be skipped.
148
149	Sequence: drain => preflush => barrier
150
151QUEUE_ORDERED_TAG
152	Requests are ordered by ordered tag and cache flushing isn't
153	needed.
154
155	Sequence: barrier
156
157QUEUE_ORDERED_TAG_FLUSH
158	Requests are ordered by ordered tag and both pre-barrier and
159	post-barrier cache flushings are needed.
160
161	Sequence: preflush -> barrier -> postflush
162
163QUEUE_ORDERED_TAG_FUA
164	Requests are ordered by ordered tag and pre-barrier cache
165	flushing is needed.  By using FUA on barrier request,
166	post-barrier flushing can be skipped.
167
168	Sequence: preflush -> barrier
169
170
171Random notes/caveats
172--------------------
173
174* SCSI layer currently can't use TAG ordering even if the drive,
175controller and driver support it.  The problem is that SCSI midlayer
176request dispatch function is not atomic.  It releases queue lock and
177switch to SCSI host lock during issue and it's possible and likely to
178happen in time that requests change their relative positions.  Once
179this problem is solved, TAG ordering can be enabled.
180
181* Currently, no matter which ordered mode is used, there can be only
182one barrier request in progress.  All I/O barriers are held off by
183block layer until the previous I/O barrier is complete.  This doesn't
184make any difference for DRAIN ordered devices, but, for TAG ordered
185devices with very high command latency, passing multiple I/O barriers
186to low level *might* be helpful if they are very frequent.  Well, this
187certainly is a non-issue.  I'm writing this just to make clear that no
188two I/O barrier is ever passed to low-level driver.
189
190* Completion order.  Requests in ordered sequence are issued in order
191but not required to finish in order.  Barrier implementation can
192handle out-of-order completion of ordered sequence.  IOW, the requests
193MUST be processed in order but the hardware/software completion paths
194are allowed to reorder completion notifications - eg. current SCSI
195midlayer doesn't preserve completion order during error handling.
196
197* Requeueing order.  Low-level drivers are free to requeue any request
198after they removed it from the request queue with
199blkdev_dequeue_request().  As barrier sequence should be kept in order
200when requeued, generic elevator code takes care of putting requests in
201order around barrier.  See blk_ordered_req_seq() and
202ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
203
204Note that block drivers must not requeue preceding requests while
205completing latter requests in an ordered sequence.  Currently, no
206error checking is done against this.
207
208* Error handling.  Currently, block layer will report error to upper
209layer if any of requests in an ordered sequence fails.  Unfortunately,
210this doesn't seem to be enough.  Look at the following request flow.
211QUEUE_ORDERED_TAG_FLUSH is in use.
212
213 [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
214					  still in elevator
215
216Let's say request [2], [3] are write requests to update file system
217metadata (journal or whatever) and [barrier] is used to mark that
218those updates are valid.  Consider the following sequence.
219
220 i.	Requests [0] ~ [post] leaves the request queue and enters
221	low-level driver.
222 ii.	After a while, unfortunately, something goes wrong and the
223	drive fails [2].  Note that any of [0], [1] and [3] could have
224	completed by this time, but [pre] couldn't have been finished
225	as the drive must process it in order and it failed before
226	processing that command.
227 iii.	Error handling kicks in and determines that the error is
228	unrecoverable and fails [2], and resumes operation.
229 iv.	[pre] [barrier] [post] gets processed.
230 v.	*BOOM* power fails
231
232The problem here is that the barrier request is *supposed* to indicate
233that filesystem update requests [2] and [3] made it safely to the
234physical medium and, if the machine crashes after the barrier is
235written, filesystem recovery code can depend on that.  Sadly, that
236isn't true in this case anymore.  IOW, the success of a I/O barrier
237should also be dependent on success of some of the preceding requests,
238where only upper layer (filesystem) knows what 'some' is.
239
240This can be solved by implementing a way to tell the block layer which
241requests affect the success of the following barrier request and
242making lower lever drivers to resume operation on error only after
243block layer tells it to do so.
244
245As the probability of this happening is very low and the drive should
246be faulty, implementing the fix is probably an overkill.  But, still,
247it's there.
248
249* In previous drafts of barrier implementation, there was fallback
250mechanism such that, if FUA or ordered TAG fails, less fancy ordered
251mode can be selected and the failed barrier request is retried
252automatically.  The rationale for this feature was that as FUA is
253pretty new in ATA world and ordered tag was never used widely, there
254could be devices which report to support those features but choke when
255actually given such requests.
256
257 This was removed for two reasons 1. it's an overkill 2. it's
258impossible to implement properly when TAG ordering is used as low
259level drivers resume after an error automatically.  If it's ever
260needed adding it back and modifying low level drivers accordingly
261shouldn't be difficult.
262