History log of /linux-master/block/elevator.c
Revision Date Author Comments
# e5c0ca13 13-Aug-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: release scheduler resource when request completes

Chuck reported [1] an IO hang problem on NFS exports that reside on SATA
devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
submission path for post-flush requests").

We analysed the IO hang problem, found there are two postflush requests
waiting for each other.

The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
failed to blk_kick_flush() because of the second postflush request which
is inflight waiting in scheduler queue.

The second postflush waiting in scheduler queue can't be dispatched because
the first postflush hasn't released scheduler resource even though it has
completed by itself.

Fix it by releasing scheduler resource when the first postflush request
completed, so the second postflush can be dispatched and completed, then
make blk_kick_flush() succeed.

While at it, remove the check for e->ops.finish_request, as all
schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE()
at scheduler registration time, just like we do for insert_requests and
dispatch_request.

[1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/

Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com
Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev
[axboe: folded in incremental fix and added tags]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 20d09975 30-May-2023 Azeem Shaikh <azeemshaikh38@gmail.com>

block: Replace all non-returning strlcpy with strscpy

strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230530155608.272266-1-azeemshaikh38@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5f622417 07-Feb-2023 Thomas Weißschuh <linux@weissschuh.net>

block: make kobj_type structures constant

Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20230208-kobj_type-block-v1-1-0b3eafd7d983@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2bd85221 13-Nov-2022 Christoph Hellwig <hch@lst.de>

block: untangle request_queue refcounting from sysfs

The kobject embedded into the request_queue is used for the queue
directory in sysfs, but that is a child of the gendisks directory and is
intimately tied to it. Move this kobject to the gendisk and use a
refcount_t in the request_queue for the actual request_queue refcounting
that is completely unrelated to the device model.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221114042637.1009333-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8d283ee6 29-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: use bool as the return type of elv_iosched_allow_bio_merge

We have bool type now, update the old signature.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/0db0a0298758d60d0f4df8b7126ac6a381e5a5bb.1669736350.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6451ede 29-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: replace "len+name" with "name+len" in elv_iosched_show

The "pointer + offset" pattern is more resonable.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/d9beaee71b14f7b2a39ab0db6458dc0f7d961ceb.1669736350.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7a3b3660 29-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: always use 'e' when printing scheduler name

Printing e->elevator_name in all cases improves the readability, and
'e' and 'cur' are identical in this branch.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Link: https://lore.kernel.org/r/4bae180ffbac608ea0cf46ffa9739ce0973b60aa.1669736350.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5998249e 29-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: replace continue with else-if in elv_iosched_show

else-if is more readable than continue here.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Link: https://lore.kernel.org/r/77ac19ba556efd2c8639a6396eb4203c59bc13d6.1669736350.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7919d679 29-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: include 'none' for initial elv_iosched_show call

This makes the printing order of the io schedulers consistent, and removes
a redundant q->elevator check.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/bdd7083ed4f232e3285f39081e3c5f30b20b8da2.1669736350.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 42843547 22-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

elevator: remove an outdated comment in elevator_change

mq is no longer a special case.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/cbf47824fc726440371e74c867bf635ae1b671a3.1669126766.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f69b5e8f 22-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

elevator: update the document of elevator_match

elevator_match does not care about elevator_features any more. Remove
related descriptions from its document.

Fixes: ffb86425ee2c ("block: don't check for required features in elevator_match")
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/a58424555202c07a9ccf7f60c3ad7e247da09e25.1669126766.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e0cca8bc 22-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

elevator: printk a warning if switching to a new io scheduler fails

printk a warning to indicate that the io scheduler has been set to none
if switching to a new io scheduler fails.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/d51ed0fb457db7a4f9cbb0dbce36d534e22be457.1669126766.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ac1171bd 22-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

elevator: update the document of elevator_switch

We no longer support falling back to the old io scheduler if switching to
the new one fails. Update the document to indicate that.

Fixes: a1ce35fa4985 ("block: remove dead elevator code")
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/94250961689ba7d2e67a7d9e7995a11166fedb31.1669126766.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5b2560c4 06-Nov-2022 Yang Li <yang.lee@linux.alibaba.com>

block: Fix some kernel-doc comments

Remove the description of @required_features in elevator_match()
to clear the below warning:

block/elevator.c:103: warning: Excess function parameter 'required_features' description in 'elevator_match'

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2734
Fixes: ffb86425ee2c ("block: don't check for required features in elevator_match")
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221107062255.2685-1-yang.lee@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 64b36075 30-Oct-2022 Christoph Hellwig <hch@lst.de>

block: split elevator_switch

Split an elevator_disable helper from elevator_switch for the case where
we want to switch to no scheduler at all. This includes removing the
pointless elevator_switch_mq helper and removing the switch to no
schedule logic from blk_mq_init_sched.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ffb86425 30-Oct-2022 Christoph Hellwig <hch@lst.de>

block: don't check for required features in elevator_match

Checking for the required features in the callers simplifies the code
quite a bit, so do that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-7-hch@lst.de
[axboe: adjust for dropping patch 1, use __elevator_find()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2eef17a2 30-Oct-2022 Christoph Hellwig <hch@lst.de>

block: simplify the check for the current elevator in elv_iosched_show

Just compare the pointers instead of using the string based
elevator_match.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 16095af2 30-Oct-2022 Christoph Hellwig <hch@lst.de>

block: cleanup the variable naming in elv_iosched_store

Use eq for the elevator_queue as done elsewhere. This frees e to be used
for the loop iterator instead of the odd __ prefix. In addition rename
elv to cur to make it more clear it is the currently selected elevator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aae2a643 30-Oct-2022 Christoph Hellwig <hch@lst.de>

block: exit elv_iosched_show early when I/O schedulers are not supported

If the tag_set has BLK_MQ_F_NO_SCHED flag set we will never show any
scheduler, so exit early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 81eaca44 30-Oct-2022 Christoph Hellwig <hch@lst.de>

block: cleanup elevator_get

Do the request_module and repeated lookup in the only caller that cares,
pick a saner name that explains where are actually doing a lookup and
use a sane calling conventions that passes the queue first.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ed40ee3 20-Oct-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: fix up elevator_type refcounting

The current reference management logic of io scheduler modules contains
refcnt problems. For example, blk_mq_init_sched may fail before or after
the calling of e->ops.init_sched. If it fails before the calling, it does
nothing to the reference to the io scheduler module. But if it fails after
the calling, it releases the reference by calling kobject_put(&eq->kobj).

As the callers of blk_mq_init_sched can't know exactly where the failure
happens, they can't handle the reference to the io scheduler module
properly: releasing the reference on failure results in double-release if
blk_mq_init_sched has released it, and not releasing the reference results
in ghost reference if blk_mq_init_sched did not release it either.

The same problem also exists in io schedulers' init_sched implementations.

We can address the problem by adding releasing statements to the error
handling procedures of blk_mq_init_sched and init_sched implementations.
But that is counterintuitive and requires modifications to existing io
schedulers.

Instead, We make elevator_alloc get the io scheduler module references
that will be released by elevator_release. And then, we match each
elevator_get with an elevator_put. Therefore, each reference to an io
scheduler module explicitly has its own getter and releaser, and we no
longer need to worry about the refcnt problems.

The bugs and the patch can be validated with tools here:
https://github.com/nickyc975/linux_elv_refcnt_bug.git

[hch: split out a few bits into separate patches, use a non-try
module_get in elevator_alloc]

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b54c2ad9 20-Oct-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: check for an unchanged elevator earlier in __elevator_change

No need to find the actual elevator_type struct for this comparism,
the name is all that is needed.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 58367c8a 20-Oct-2022 Christoph Hellwig <hch@lst.de>

block: sanitize the elevator name before passing it to __elevator_change

The stripped name should also be used for the none check. To do so
strip it in the caller and pass in the sanitized name. Drop the pointless
__ prefix in the function name while we're at it.

Based on a patch from Jinlong Chen <nickyc975@zju.edu.cn>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dd6f7f17 20-Oct-2022 Christoph Hellwig <hch@lst.de>

block: add proper helpers for elevator_type module refcount management

Make sure we have helpers for all relevant module refcount operations on
the elevator_type in elevator.h, and use them. Move the call to the get
helper in blk_mq_elv_switch_none a bit so that it is obvious with a less
verbose comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 181d0663 19-Oct-2022 Yu Kuai <yukuai3@huawei.com>

elevator: add new field flags in struct elevator_queue

There are only one flag to indicate that elevator is registered currently,
prepare to add a flag to disable wbt if default elevator is bfq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6d9f4cf1 19-Oct-2022 Yu Kuai <yukuai3@huawei.com>

elevator: remove redundant code in elv_unregister_queue()

"elevator_queue *e" is already declared and initialized in the beginning
of elv_unregister_queue().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221019121518.3865235-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8237c01f 27-Sep-2022 Keith Busch <kbusch@kernel.org>

blk-mq: use quiesced elevator switch when reinitializing queues

The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.

nvme nvme0: resetting controller
nvme nvme0: 32/0/0 default/read/poll queues
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
Oops: 0000 [#1] SMP PTI
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:kyber_has_work+0x29/0x70

...

Call Trace:
__blk_mq_do_dispatch_sched+0x83/0x2b0
__blk_mq_sched_dispatch_requests+0x12e/0x170
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x2b/0x50
process_one_work+0x1ef/0x380
worker_thread+0x2d/0x3e0

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 28883074 07-Mar-2022 Christoph Hellwig <hch@lst.de>

block: do more work in elevator_exit

Move the calls to ioc_clear_queue and blk_mq_sched_free_rqs into
elevator_exit. Except for one call where we know we can't have io_cq
structures yet these always go together, and that extra call in an
error path is harmless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308055200.735835-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f5ec592d 24-Jan-2022 Eric Biggers <ebiggers@google.com>

block: simplify calling convention of elv_unregister_queue()

Make elv_unregister_queue() a no-op if q->elevator is NULL or is not
registered.

This simplifies the existing callers, as well as the future caller in
the error path of blk_register_queue().

Also don't bother checking whether q is NULL, since it never is.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220124215938.2769-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 672fdcf0 11-Feb-2022 Ming Lei <ming.lei@redhat.com>

block: partition include/linux/blk-cgroup.h

Partition include/linux/blk-cgroup.h into two parts: one is public part,
the other is block layer private part.

Suggested by Christoph Hellwig.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220211101149.2368042-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e92bc4cd 22-Jan-2022 Laibin Qiu <qiulaibin@huawei.com>

block/wbt: fix negative inflight counter when remove scsi device

Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
wbt_disable_default() when switch elevator to bfq. And when
we remove scsi device, wbt will be enabled by wbt_enable_default.
If it become false positive between wbt_wait() and wbt_track()
when submit write request.

The following is the scenario that triggered the problem.

T1 T2 T3
elevator_switch_mq
bfq_init_queue
wbt_disable_default <= Set
rwb->enable_state (OFF)
Submit_bio
blk_mq_make_request
rq_qos_throttle
<= rwb->enable_state (OFF)
scsi_remove_device
sd_remove
del_gendisk
blk_unregister_queue
elv_unregister_queue
wbt_enable_default
<= Set rwb->enable_state (ON)
q_qos_track
<= rwb->enable_state (ON)
^^^^^^ this request will mark WBT_TRACKED without inflight add and will
lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.

Fix this by move wbt_enable_default() from elv_unregister to
bfq_exit_queue(). Only re-enable wbt when bfq exit.

Fixes: 76a8040817b4b ("blk-wbt: make sure throttle is enabled properly")

Remove oneline stale comment, and kill one oneshot local variable.

Signed-off-by: Ming Lei <ming.lei@rehdat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0c6cb3a2 23-Nov-2021 Christoph Hellwig <hch@lst.de>

block: remove the e argument to elevator_exit

All callers pass q->elevator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f46b81c5 23-Nov-2021 Christoph Hellwig <hch@lst.de>

block: remove elevator_exit

Open code elevator_exit in it's only caller, and rename __elevator_exit to
elevator_exit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 245a489e 17-Nov-2021 Ming Lei <ming.lei@redhat.com>

block: avoid to quiesce queue in elevator_init_mq

elevator_init_mq() is only called before adding disk, when there isn't
any FS I/O, only passthrough requests can be queued, so freezing queue
plus canceling dispatch work is enough to drain any dispatch activities,
then we can avoid synchronize_srcu() in blk_mq_quiesce_queue().

Long boot latency issue can be fixed in case of lots of disks added
during booting.

Fixes: 737eb78e82d5 ("block: Delay default elevator initialization")
Reported-by: yangerkun <yangerkun@huawei.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211117115502.1600950-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 079a2e3e 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Change shared sbitmap naming to shared tags

Now that shared sbitmap support really means shared tags, rename symbols
to match that.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2e9bc346 20-Sep-2021 Christoph Hellwig <hch@lst.de>

block: move elevator.h to block/

Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer. Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 866663b7 28-Jul-2021 Ming Lei <ming.lei@redhat.com>

block: return ELEVATOR_DISCARD_MERGE if possible

When merging one bio to request, if they are discard IO and the queue
supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
because both block core and related drivers(nvme, virtio-blk) doesn't
handle mixed discard io merge(traditional IO merge together with
discard merge) well.

Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
so both blk-mq and drivers just need to handle multi-range discard.

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Fixes: 2705dfb20947 ("block: fix discard request merge")
Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d1254a87 04-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove support for delayed queue registrations

Now that device mapper has been changed to register the disk once
it is fully ready all this code is unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 90b71980 05-Aug-2021 Bart Van Assche <bvanassche@acm.org>

blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flag

elevator_get_default() uses the following algorithm to select an I/O
scheduler from inside add_disk():
- In case of a single hardware queue or if sharing hardware queues across
multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
- Otherwise, use 'none'.

This is a good choice for most but not for all block drivers. Make it
possible to override the selection of mq-deadline with a new flag,
namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Martijn Coenen <maco@android.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fd2ef39c 23-Jun-2021 Jan Kara <jack@suse.cz>

blk: Fix lock inversion between ioc lock and bfqd lock

Lockdep complains about lock inversion between ioc->lock and bfqd->lock:

bfqd -> ioc:
put_io_context+0x33/0x90 -> ioc->lock grabbed
blk_mq_free_request+0x51/0x140
blk_put_request+0xe/0x10
blk_attempt_req_merge+0x1d/0x30
elv_attempt_insert_merge+0x56/0xa0
blk_mq_sched_try_insert_merge+0x4b/0x60
bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
blk_mq_sched_insert_requests+0xd6/0x2b0
blk_mq_flush_plug_list+0x154/0x280
blk_finish_plug+0x40/0x60
ext4_writepages+0x696/0x1320
do_writepages+0x1c/0x80
__filemap_fdatawrite_range+0xd7/0x120
sync_file_range+0xac/0xf0

ioc->bfqd:
bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
exit_io_context+0x48/0x50
do_exit+0x7e9/0xdd0
do_group_exit+0x54/0xc0

To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
free the merged request but rather leave that upto the caller similarly
to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
to free all the merged requests after dropping bfqd->lock.

Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e42cfb1d 17-Jun-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: Remove unnecessary elevator operation checks

The insert_requests and dispatch_request elevator operations are
mandatory for the correct execution of an elevator, and all implemented
elevators (bfq, kyber and mq-deadline) implement them. As a result,
there is no need to check for these operations before calling them when
a queue has an elevator set. This simplifies the code in
__blk_mq_sched_dispatch_requests() and blk_mq_sched_insert_request().

To avoid out-of-tree elevators to crash the kernel in case of bad
implementation, add a check in elv_register() to verify that these
operations are implemented.

A small, probably not significant, IOPS improvement of 0.1% is observed
with this patch applied (4.117 MIOPS to 4.123 MIOPS, average of 20 fio
runs doing 4K random direct reads with psync and 32 jobs).

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210618015922.713999-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 26a9750a 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: improve the blk_mq_init_allocated_queue interface

Don't return the passed in request_queue but a normal error code, and
drop the elevator_init argument in favor of just calling elevator_init_mq
directly from dm-rq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 580dca81 05-Apr-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: set default elevator as deadline in case of hctx shared tagset

Yanhui found that write performance is degraded a lot after applying
hctx shared tagset on one test machine with megaraid_sas. And turns out
it is caused by none scheduler which becomes default elevator caused by
hctx shared tagset patchset.

Given more scsi HBAs will apply hctx shared tagset, and the similar
performance exists for them too.

So keep previous behavior by still using default mq-deadline for queues
which apply hctx shared tagset, just like before.

Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset")
Reported-by: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210406031933.767228-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f0c6ae09 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: fix comment and add lockdep assert

After commit b89f625e28d4 ("block: don't release queue's sysfs
lock during switching elevator"), whole elevator register and
unregister function are covered by sysfs_lock. So, remove wrong
comment and add lockdep assert.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 75e6c00f 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: use helper function to test queue register

We have defined common interface blk_queue_registered() to
test QUEUE_FLAG_REGISTERED. Just use it.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6251b754 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: remove redundant mq check

elv_support_iosched() will check queue_is_mq() for us. So, remove
the redundant check to clean code.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dd1c372d 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: invoke blk_mq_exit_sched no matter whether have .exit_sched

We will register debugfs for scheduler no matter whether it have
defined callback funciton .exit_sched. So, blk_mq_exit_sched()
is always needed to unregister debugfs. Also, q->elevator should
be set as NULL after exiting scheduler.

For now, since all register scheduler have defined .exit_sched,
it will not cause any actual problem. But It will be more reasonable
to do this change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5b8f65e1 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: elevator: delete duplicated word and fix typos

Drop the repeated word "the".
Fix typos of "features" and "specified".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f8db3835 06-Nov-2019 Jan Kara <jack@suse.cz>

block: Warn if elevator= parameter is used

With transition to blk-mq, the elevator= kernel argument was removed as
it makes less and less sense with the current variety of devices. Since
this may surprise some users and there are advices on the Internet that
still suggest to use it, let's at least warn if the parameter is used.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7a7c5e71 08-Oct-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Fix elv_support_iosched()

A BIO based request queue does not have a tag_set, which prevent testing
for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not
require an elevator. This leads to an incorrect initialization of a
default elevator in some cases such as BIO based null_blk
(queue_mode == BIO) with zoned mode enabled as the default elevator in
this case is mq-deadline instead of "none".

Fix this by testing for a NULL queue mq_ops field which indicates that
the queue is BIO based and should not have an elevator.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b89f625e 23-Sep-2019 Ming Lei <ming.lei@redhat.com>

block: don't release queue's sysfs lock during switching elevator

cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
release & acquire sysfs_lock before registering/un-registering elevator
queue during switching elevator for avoiding potential deadlock from
showing & storing 'queue/iosched' attributes and removing elevator's
kobject.

Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
required in .show & .store of queue/iosched's attributes, and just
elevator's sysfs lock is acquired in elv_iosched_store() and
elv_iosched_show(). So it is safe to hold queue's sysfs lock when
registering/un-registering elevator queue.

The biggest issue is that commit cecf5d87ff20 assumes that concurrent
write on 'queue/scheduler' can't happen. However, this assumption isn't
true, because kernfs_fop_write() only guarantees that concurrent write
aren't called on the same open file, but the write could be from
different open on the file. So we can't release & re-acquire queue's
sysfs lock during switching elevator, otherwise use-after-free on
elevator could be triggered.

Fixes the issue by not releasing queue's sysfs lock during switching
elevator.

Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a2614255 06-Sep-2019 Jens Axboe <axboe@kernel.dk>

block: fix elevator_get_by_features()

The lookup logic is broken - 'e' will never be NULL, even if the
list is empty. Maintain lookup hit in a separate variable instead.

Fixes: a0958ba7fcdc ("block: Improve default elevator selection")
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 737eb78e 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Delay default elevator initialization

When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
case, a call to elevator_init_mq() is added to __device_add_disk(),
resulting in the delayed initialization of the queue elevator
after the device driver finished probing the device information. This
effectively allows elevator_init_mq() access to more information
about the device.
2) elevator_init = true preserves the current behavior of initializing
the elevator directly from blk_mq_init_allocated_queue(). This case
is used for the special request based DM devices where the device
gendisk is created before the queue initialization and device
information (e.g. queue limits) is already known when the queue
initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a0958ba7 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Improve default elevator selection

For block devices that do not specify required features, preserve the
current default elevator selection (mq-deadline for single queue
devices, none for multi-queue devices). However, for devices specifying
required features (e.g. zoned block devices ELEVATOR_F_ZBD_SEQ_WRITE
feature), select the first available elevator providing the required
features.

In all cases, default to "none" if no elevator is available or if the
initialization of the default elevator fails.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 68c43f13 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Introduce elevator features

Introduce the definition of elevator features through the
elevator_features flags in the elevator_type structure. Each flag can
represent a feature supported by an elevator. The first feature defined
by this patch is support for zoned block device sequential write
constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
by the mq-deadline elevator using zone write locking.

Other possible features are IO priorities, write hints, latency targets
or single-LUN dual-actuator disks (for which the elevator could maintain
one LBA ordered list per actuator).

The required_elevator_features field is also added to the request_queue
structure to allow a device driver to specify elevator feature flags
that an elevator must support for the correct operation of the device
(e.g. device drivers for zoned block devices can have the
ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
The helper function blk_queue_required_elevator_features() is
defined for setting this new field.

With these two new fields in place, the elevator functions
elevator_match() and elevator_find() are modified to allow a user to set
only an elevator with a set of features that satisfies the device
required features. Elevators not matching the device requirements are
not shown in the device sysfs queue/scheduler file to prevent their use.

The "none" elevator can always be selected as before.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 954b4a5c 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Change elevator_init_mq() to always succeed

If the default elevator chosen is mq-deadline, elevator_init_mq() may
return an error if mq-deadline initialization fails, leading to
blk_mq_init_allocated_queue() returning an error, which in turn will
cause the block device initialization to fail and the device not being
exposed.

Instead of taking such extreme measure, handle mq-deadline
initialization failures in the same manner as when mq-deadline is not
available (no module to load), that is, default to the "none" scheduler.
With this change, elevator_init_mq() return type can be changed to void.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 61db437d 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Cleanup elevator_init_mq() use

Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
calling elevator_init_mq() to make sure that the queue supports IO
scheduling, use the elevator.c function elv_support_iosched() in
elevator_init_mq(). This does not introduce any functional change but
ensure that elevator_init_mq() does the right thing based on the queue
settings.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 85c0a037 27-Aug-2019 Marcos Paulo de Souza <marcos.souza.org@gmail.com>

block: elevator.c: Remove now unused elevator= argument

Since the inclusion of blk-mq, elevator argument was not being
considered anymore, and it's utility died long with the legacy IO path,
now removed too.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>

Fold with doc removal patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cecf5d87 27-Aug-2019 Ming Lei <ming.lei@redhat.com>

block: split .sysfs_lock into two locks

The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.

However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.

So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.

sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.

[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);

*** DEADLOCK ***

2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 58c898ba 27-Aug-2019 Ming Lei <ming.lei@redhat.com>

block: add helper for checking if queue is registered

There are 4 users which check if queue is registered, so add one helper
to check it.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c48dac13 27-Aug-2019 Ming Lei <ming.lei@redhat.com>

block: don't hold q->sysfs_lock in elevator_init_mq

The original comment says:

q->sysfs_lock must be held to provide mutual exclusion between
elevator_switch() and here.

Which is simply wrong. elevator_init_mq() is only called from
blk_mq_init_allocated_queue, which is always called before the request
queue is registered via blk_register_queue(), for dm-rq or normal rq
based driver. However, queue's kobject is only exposed and added to sysfs
in blk_register_queue(). So there isn't such race between elevator_switch()
and elevator_init_mq().

So avoid to hold q->sysfs_lock in elevator_init_mq().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c3e22192 04-Jun-2019 Ming Lei <ming.lei@redhat.com>

block: free sched's request pool in blk_cleanup_queue

In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.

However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.

Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().

Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().

Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3dcf60bc 30-Apr-2019 Christoph Hellwig <hch@lst.de>

block: add SPDX tags to block layer files missing licensing information

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d0b0a81a 07-Apr-2019 Hisao Tanabe <xtanabe@gmail.com>

block: remove unused variable 'def'

The 'def' local variable became unused after commit f382fb0bcef4 ("block: remove
legacy IO schedulers"), let's remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hisao Tanabe <xtanabe@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fbd72127 11-Feb-2019 Aleksei Zakharov <zakharov.a.g@yandex.ru>

block: avoid setting none scheduler if it's already none

There's no reason to freeze queue and remove scheduler
if there's no scheduler already.

Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 344e9ffc 15-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: add queue_is_mq() helper

Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.

Where the ->mq_ops != NULL check is redundant, remove it.

Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 98c98cb7 14-Nov-2018 Colin Ian King <colin.king@canonical.com>

block: clean up dead code that is now redundant

The boolean next_sorted is set to false and is never changed, hence
the code that checks if it is true is dead code and can now be
removed. This dead code occurred from a previous commit that cleaned
up the elevator and removed the setting of next_sorted to true.

Detected by CoverityScan, CID#1475401 ("'Constant' variable guards
dead code")

Fixes: a1ce35fa4985 ("block: remove dead elevator code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f9cd4bfe 01-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: get rid of MQ scheduler ops union

This is a remnant of when we had ops for both SQ and MQ
schedulers. Now it's just MQ, so get rid of the union.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a1ce35fa 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

block: remove dead elevator code

This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.

Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.

Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.

Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f382fb0b 12-Oct-2018 Jens Axboe <axboe@kernel.dk>

block: remove legacy IO schedulers

Retain the deadline documentation, as that carries over to mq-deadline
as well.

Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 854f31cc 26-Sep-2018 Damien Le Moal <damien.lemoal@wdc.com>

block: fix deadline elevator drain for zoned block devices

When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:

deadline: forced dispatching is broken (nr_sorted=X), please report this

to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.

Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f9c92 ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0d25bd07 26-Sep-2018 Bart Van Assche <bvanassche@acm.org>

block: Schedule runtime resume earlier

Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 154b00d5 26-Sep-2018 Bart Van Assche <bvanassche@acm.org>

block: Split blk_pm_add_request() and blk_pm_put_request()

Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bca6b067 26-Sep-2018 Bart Van Assche <bvanassche@acm.org>

block: Move power management code into a new source file

Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
<linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file <linux/blkdev.h>
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 62d2a194 27-Aug-2018 Chengguang Xu <cgxu519@gmx.com>

block: remove unnecessary condition check

kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d48ece20 21-Aug-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: init hctx sched after update ctx and hctx mapping

Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 131d08e1 31-May-2018 Christoph Hellwig <hch@lst.de>

block: split the blk-mq case from elevator_init

There is almost no shared logic, which leads to a very confusing code
flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# acddf3b3 31-May-2018 Christoph Hellwig <hch@lst.de>

block: move sysfs_lock into elevator_init

Both callers take just around so function call, so move it in.
Also remove the now pointless blk_mq_sched_init wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ddb72532 31-May-2018 Christoph Hellwig <hch@lst.de>

block: remove the always unused name argument to elevator_init

Reported-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a8a275c9 31-May-2018 Christoph Hellwig <hch@lst.de>

block: unexport elevator_init/exit

These are only used by the block core. Also move the declarations to
block/blk.h.

Reported-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cbf62af3 31-May-2018 Christoph Hellwig <hch@lst.de>

block: move initialization of elevator-related fields to blk_alloc_queue_node

No point in doing this in elevator_init.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 14a23498 17-Jan-2018 Bart Van Assche <bvanassche@acm.org>

block: Document scheduler modification locking requirements

This patch does not change any functionality.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 83d016ac 17-Jan-2018 Bart Van Assche <bvanassche@acm.org>

block: Unexport elv_register_queue() and elv_unregister_queue()

These two functions are only called from inside the block layer so
unexport them.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 24f5a90f 06-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: quiesce queue during switching io sched and updating nr_requests

Dispatch may still be in-progress after queue is frozen, so we have to
quiesce queue before switching IO scheduler and updating nr_requests.

Also when switching io schedulers, blk_mq_run_hw_queue() may still be
called somewhere(such as from nvme_reset_work()), and io scheduler's
per-hctx data may not be setup yet, so cause oops even inside
blk_mq_hctx_has_pending(), such as it can be run just between:

ret = e->ops.mq.init_sched(q, e);
AND
ret = e->ops.mq.init_hctx(hctx, i)

inside blk_mq_init_sched().

This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue()
after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
won't be called if queue is quiesced.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen)
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ac0d9a8 25-Oct-2017 Jens Axboe <axboe@kernel.dk>

elevator: allow name aliases

Since we now lookup elevator types with the appropriate multiqueue
capability, allow schedulers to register with an alias alongside
the real name. This is in preparation for allowing 'mq-deadline'
to register an alias of 'deadline' as well.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2527d997 25-Oct-2017 Jens Axboe <axboe@kernel.dk>

elevator: lookup mq vs non-mq elevators

If an IO scheduler is selected via elevator= and it doesn't match
the driver in question wrt blk-mq support, then we fail to boot.

The elevator= parameter is deprecated and only supported for
non-mq devices. Augment the elevator lookup API so that we
pass in if we're looking for an mq capable scheduler or not,
so that we only ever return a valid type for the queue in
question.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5fdee212 05-Oct-2017 Christoph Hellwig <hch@lst.de>

block: remove QUEUE_FLAG_STACKABLE

We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e9a823fb 28-Aug-2017 David Jeffery <djeffery@redhat.com>

block: fix warning when I/O elevator is changed as request_queue is being removed

There is a race between changing I/O elevator and request_queue removal
which can trigger the warning in kobject_add_internal. A program can
use sysfs to request a change of elevator at the same time another task
is unregistering the request_queue the elevator would be attached to.
The elevator's kobject will then attempt to be connected to the
request_queue in the object tree when the request_queue has just been
removed from sysfs. This triggers the warning in kobject_add_internal
as the request_queue no longer has a sysfs directory:

kobject_add_internal failed for iosched (error: -2 parent: queue)
------------[ cut here ]------------
WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0

To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
changing the elevator and use the request_queue's sysfs_lock to
serialize between clearing the flag and the elevator testing the flag.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e29387eb 21-Jun-2017 Bart Van Assche <bvanassche@acm.org>

block: Add fallthrough markers to switch statements

This patch suppresses gcc 7 warnings about falling through in switch
statements when building with W=1. From the gcc documentation: The
-Wimplicit-fallthrough=3 warning is enabled by -Wextra. See also
https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Warning-Options.html.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 340ff321 10-May-2017 Jens Axboe <axboe@fb.com>

elevator: remove redundant warnings on IO scheduler switch

We warn twice for switching to a scheduler, if that switch fails.
As we also report the failure in the return value to the
sysfs write, remove the dmesg induced failures.

Keep the failure print for warning to switch to the kconfig
selected IO scheduler, as we can't report errors for that in
any other way.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 7a148c2f 01-May-2017 Ming Lei <ming.lei@redhat.com>

block: don't call blk_mq_quiesce_queue() after queue is frozen

After queue is frozen, no request in this queue can be in use at all, so
there can't be any .queue_rq() running on this queue. It isn't
necessary to call blk_mq_quiesce_queue() any more, so remove it in both
elevator_switch_mq() and blk_mq_update_nr_requests().

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Fixed up the description a bit.

Signed-off-by: Jens Axboe <axboe@fb.com>


# c0332694 01-May-2017 Bart Van Assche <bvanassche@acm.org>

block: Remove elevator_change()

Since commit 84253394927c ("remove the mg_disk driver") removed the
only caller of elevator_change(), also remove the elevator_change()
function itself.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3a5088c8 15-Apr-2017 Ming Lei <ming.lei@redhat.com>

block: respect BLK_MQ_F_NO_SCHED

If one driver claims that it doesn't support io scheduler via
BLK_MQ_F_NO_SCHED, we should not allow to change and show the
availabe io schedulers.

This patch adds check to enhance this behaviour.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8330cdb0 19-Apr-2017 Jan Kara <jack@suse.cz>

block: Make writeback throttling defaults consistent for SQ devices

When CFQ is used as an elevator, it disables writeback throttling
because they don't play well together. Later when a different elevator
is chosen for the device, writeback throttling doesn't get enabled
again as it should. Make sure CFQ enables writeback throttling (if it
should be enabled by default) when we switch from it to another IO
scheduler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 54d5329d 07-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq-sched: fix crash in switch error path

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6917ff0b 05-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq-sched: refactor scheduler initialization

Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7b36a718 02-Mar-2017 Jens Axboe <axboe@fb.com>

block: don't call ioc_exit_icq() with the queue lock held for blk-mq

For legacy scheduling, we always call ioc_exit_icq() with both the
ioc and queue lock held. This poses a problem for blk-mq with
scheduling, since the queue lock isn't what we use in the scheduler.
And since we don't need the queue lock held for ioc exit there,
don't grab it and leave any extra locking up to the blk-mq scheduler.

Reported-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b86dd815 22-Feb-2017 Jens Axboe <axboe@fb.com>

block: get rid of blk-mq default scheduler choice Kconfig entries

The wording in the entries were poor and not understandable
by even deities. Kill the selection for default block scheduler,
and impose a policy with sane defaults.

Architected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d1a987f3 14-Feb-2017 Jens Axboe <axboe@fb.com>

elevator: fix loading wrong elevator type for blk-mq devices

The old elevator= boot parameter blindly attempts to load the
same scheduler for mq and !mq devices, leading to a crash if
we specify the wrong one.

Ensure that we only apply this boot parameter to old !mq devices.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 34fe7c05 08-Feb-2017 Christoph Hellwig <hch@lst.de>

block: enumify ELEVATOR_*_MERGE

Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a1ae0f74 01-Feb-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq-debug: Avoid that sparse complains about req_flags_t usage

Avoid that sparse reports the following complaints:

block/elevator.c:541:29: warning: incorrect type in assignment (different base types)
block/elevator.c:541:29: expected bool [unsigned] [usertype] next_sorted
block/elevator.c:541:29: got restricted req_flags_t

block/blk-mq-debugfs.c:92:54: warning: cast from restricted req_flags_t

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 57292b58 31-Jan-2017 Christoph Hellwig <hch@lst.de>

block: introduce blk_rq_is_passthrough

This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 610d886c 19-Jan-2017 Jens Axboe <axboe@fb.com>

elevator: fix unnecessary put of elevator in failure case

We already checked that e is NULL, so no point in calling
elevator_put() to free it.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: dc877dbd088f ("blk-mq-sched: add framework for MQ capable IO schedulers")
Signed-off-by: Jens Axboe <axboe@fb.com>


# d3484991 13-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: allow setting of default IO scheduler

Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# bd166ef1 17-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: add framework for MQ capable IO schedulers

This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# c51ca6cf 10-Dec-2016 Jens Axboe <axboe@fb.com>

block: move existing elevator ops to union

Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 70b3ea05 07-Dec-2016 Jens Axboe <axboe@fb.com>

elevator: make the rqhash helpers exported

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>


# ef295ecf 28-Oct-2016 Christoph Hellwig <hch@lst.de>

block: better op and flags encoding

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e8064021 20-Oct-2016 Christoph Hellwig <hch@lst.de>

block: split out request-only flags into a new namespace

A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7afafc8a 16-Aug-2016 Adrian Hunter <adrian.hunter@intel.com>

block: Fix secure erase

Commit 288dab8a35a0 ("block: add a separate operation type for secure
erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
all the places REQ_OP_DISCARD was being used to mean either. Fix those.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
Signed-off-by: Jens Axboe <axboe@fb.com>


# 72ef799b 07-Jul-2016 Tahsin Erdogan <tahsin@google.com>

block: do not merge requests without consulting with io scheduler

Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.

In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.

Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.

cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e

IO_ROOT=/mnt-cgroup/io

mkdir -p $IO_ROOT

if ! mount | grep -qw $IO_ROOT; then
mount -t cgroup none -oblkio $IO_ROOT
fi

cd $IO_ROOT

for i in g1 g2 g3; do
if [ -d $i ]; then
rmdir $i
fi
done

mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight

RUNTIME=10

(echo $BASHPID > g1/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=0k &> /dev/null)&

(echo $BASHPID > g2/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=64k &> /dev/null)&

(echo $BASHPID > g3/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=256k &> /dev/null)&

sleep $((RUNTIME+1))

for i in g1 g2 g3; do
echo ---- $i ----
cat $i/blkio.time
done

EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686

After applying the patch:

# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ba568ea0 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block: prepare elevator to use REQ_OPs.

This patch converts the elevator code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7460d389 20-Oct-2015 Ming Lei <ming.lei@canonical.com>

block: check bio_mergeable() early before merging

After bio splitting is introduced, one bio can be splitted
and it is marked as NOMERGE because it is too fat to be merged,
so check bio_mergeable() earlier to avoid to try to merge it
unnecessarily.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0bb97947 10-Jun-2015 Jens Axboe <axboe@fb.com>

cfq-iosched: fix the setting of IOPS mode on SSDs

A previous commit wanted to make CFQ default to IOPS mode on
non-rotational storage, however it did so when the queue was
initialized and the non-rotational flag is only set later on
in the probe.

Add an elevator hook that gets called off the add_disk() path,
at that point we know that feature probing has finished, and
we can reliably check for the various flags that drivers can
set.

Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
Tested-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# eea8f41c 22-May-2015 Tejun Heo <tj@kernel.org>

blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h

cgroup aware writeback support will require exposing some of blkcg
details. In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h. This patch is pure file move.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8406a4d5 23-Apr-2015 Chao Yu <chao@kernel.org>

elevator: fix double release of elevator module

Our issue is descripted in below call path:
->elevator_init
->elevator_init_fn
->{cfq,deadline,noop}_init_queue
->elevator_alloc
->kzalloc_node
fail to call kzalloc_node and then put module in elevator_alloc;
fail to call elevator_init_fn and then put module again in elevator_init.

Remove elevator_put invoking in error path of elevator_alloc to avoid
double release issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 47fafbc7 03-Dec-2014 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM

After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks
depending on CONFIG_PM_RUNTIME may now be changed to depend on
CONFIG_PM.

Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core.

Reviewed-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d32f6b57 23-Oct-2014 Sudip Mukherjee <sudipm.mukherjee@gmail.com>

block: fix wrong error return in elevator_init()

while compiling integer err was showing as a set but unused variable.
elevator_init_fn can be either cfq_init_queue or deadline_init_queue
or noop_init_queue.
all three of these functions are returning -ENOMEM if they fail to
allocate the queue.
so we should actually be returning the error code rather than
returning 0 always.

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e567bf71 22-Jun-2014 Jens Axboe <axboe@fb.com>

Revert "block: add __init to elv_register"

This reverts commit b5097e956a4d2919ee248d6481e4204c5568ed5c.

The original commit is buggy, we do use the registration functions
at runtime, for instance when loading IO schedulers through sysfs.

Reported-by: Damien Wyart <damien.wyart@gmail.com>


# 2940474a 11-Jun-2014 Christoph Hellwig <hch@lst.de>

block: remove elv_abort_queue and blk_abort_flushes

elv_abort_queue has no callers, and blk_abort_flushes is only called by
elv_abort_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b5097e95 10-Jun-2014 Fabian Frederick <fabf@skynet.be>

block: add __init to elv_register

elv_register is only called by elevator init functions:

__init cfq_init
__init deadline_init
__init noop_init

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 360f92c2 09-Apr-2014 Jens Axboe <axboe@fb.com>

block: fix regression with block enabled tagging

Martin reported that his test system would not boot with
current git, it oopsed with this:

BUG: unable to handle kernel paging request at ffff88046c6c9e80
IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
Workqueue: events_unbound async_run_entry_fn
task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
RIP: 0010:[<ffffffff812971e0>] [<ffffffff812971e0>]
blk_queue_start_tag+0x90/0x150
RSP: 0018:ffff880273d03a58 EFLAGS: 00010092
RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
FS: 0000000000000000(0000) GS:ffff880277b00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
Stack:
ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
Call Trace:
[<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
[<ffffffff81293167>] __blk_run_queue+0x37/0x50
[<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
[<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
[<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
[<ffffffff813bba35>] scsi_execute+0xe5/0x180
[<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
[<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
[<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
[<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
[<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
[<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
[<ffffffff8106a1c5>] process_one_work+0x175/0x410
[<ffffffff8106b373>] worker_thread+0x123/0x400
[<ffffffff8106b250>] ? manage_workers+0x160/0x160
[<ffffffff8107104e>] kthread+0xce/0xf0
[<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
[<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
RIP [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
RSP <ffff880273d03a58>
CR2: ffff88046c6c9e80

Martin bisected and found this to be the problem patch;

commit 6d113398dcf4dfcd9787a4ead738b186f7b7ff0f
Author: Jan Kara <jack@suse.cz>
Date: Mon Feb 24 16:39:54 2014 +0100

block: Stop abusing rq->csd.list in blk-softirq

and the problem was immediately apparent. The patch states that
it is safe to reuse queuelist at completion time, since it is
no longer used. However, that is not true if a device is using
block enabled tagging. If that is the case, then the queuelist
is reused to keep track of busy tags. If a device also ended
up using softirq completions, we'd reuse ->queuelist for the
IPI handling while block tagging was still using it. Boom.

Fix this by adding a new ipi_list list head, and share the
memory used with the request hash table. The hash table is
never used after the request is moved to the dispatch list,
which happens long before any potential completion of the
request. Add a new request bit for this, so we don't have
cases that check rq->hash while it could potentially have
been reused for the IPI completion.

Reported-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4f024f37 11-Oct-2013 Kent Overstreet <kmo@daterainc.com>

block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6


# 7c8a3679 15-Oct-2013 Tomoki Sekiyama <tomoki.sekiyama@hds.com>

elevator: acquire q->sysfs_lock in elevator_change()

Add locking of q->sysfs_lock into elevator_change() (an exported function)
to ensure it is held to protect q->elevator from elevator_init(), even if
elevator_change() is called from non-sysfs paths.
sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
version, as the lock is already taken by elv_iosched_store().

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# eb1c160b 15-Oct-2013 Tomoki Sekiyama <tomoki.sekiyama@hds.com>

elevator: Fix a race in elevator switching and md device initialization

The soft lockup below happens at the boot time of the system using dm
multipath and the udev rules to switch scheduler.

[ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
[ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
...
[ 356.127001] Call Trace:
[ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
[ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
[ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
[ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
[ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
[ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
[ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
[ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
[ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
[ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
[ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
[ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b

This is caused by a race between md device initialization by multipathd and
shell script to switch the scheduler using sysfs.

- multipathd:
SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
-> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
q->elevator = elevator_alloc(q, e); // not yet initialized

- sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
elevator_switch (in the call trace above)
struct elevator_queue *old = q->elevator;
q->elevator = elevator_alloc(q, new_e);
elevator_exit(old); // lockup! (*)

- multipathd: (cont.)
err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified

(*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
while timer->base == NULL. In this case, as timer will never initialized,
it results in lockup.

This patch introduces acquisition of q->sysfs_lock around elevator_init()
into blk_init_allocated_queue(), to provide mutual exclusion between
initialization of the q->scheduler and switching of the scheduler.

This should fix this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=902012

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c1b511eb 29-Aug-2013 Joe Perches <joe@perches.com>

block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)

Use the helper function instead of __GFP_ZERO.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d50235b7 03-Jul-2013 Jianpeng Ma <majianpeng@gmail.com>

elevator: Fix a race in elevator switching

There's a race between elevator switching and normal io operation.
Because the allocation of struct elevator_queue and struct elevator_data
don't in a atomic operation.So there are have chance to use NULL
->elevator_data.
For example:
Thread A: Thread B
blk_queu_bio elevator_switch
spin_lock_irq(q->queue_block) elevator_alloc
elv_merge elevator_init_fn

Because call elevator_alloc, it can't hold queue_lock and the
->elevator_data is NULL.So at the same time, threadA call elv_merge and
nedd some info of elevator_data.So the crash happened.

Move the elevator_alloc into func elevator_init_fn, it make the
operations in a atomic operation.

Using the follow method can easy reproduce this bug
1:dd if=/dev/sdb of=/dev/null
2:while true;do echo noop > scheduler;echo deadline > scheduler;done

The test method also use this method.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c8158819 22-Mar-2013 Lin Ming <ming.m.lin@intel.com>

block: implement runtime pm strategy

When a request is added:
If device is suspended or is suspending and the request is not a
PM request, resume the device.

When the last request finishes:
Call pm_runtime_mark_last_busy().

When pick a request:
If device is resuming/suspending, then only PM request is allowed
to go.

The idea and API is designed by Alan Stern and described here:
http://marc.info/?l=linux-scsi&m=133727953625963&w=2

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b67bfe0d 27-Feb-2013 Sasha Levin <sasha.levin@oracle.com>

hlist: drop the node parameter from iterators

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 21c3c5d2 22-Jan-2013 Tejun Heo <tj@kernel.org>

block: don't request module during elevator init

Block layer allows selecting an elevator which is built as a module to
be selected as system default via kernel param "elevator=". This is
achieved by automatically invoking request_module() whenever a new
block device is initialized and the elevator is not available.

This led to an interesting deadlock problem involving async and module
init. Block device probing running off an async job invokes
request_module(). While the module is being loaded, it performs
async_synchronize_full() which ends up waiting for the async job which
is already waiting for request_module() to finish, leading to
deadlock.

Invoking request_module() from deep in block device init path is
already nasty in itself. It seems best to avoid these situations from
the beginning by moving on-demand module loading out of block init
path.

The previous patch made sure that the default elevator module is
loaded early during boot if available. This patch removes on-demand
loading of the default elevator from elevator init path. As the
module would have been loaded during boot, userland-visible behavior
difference should be minimal.

For more details, please refer to the following thread.

http://thread.gmane.org/gmane.linux.kernel/1420814

v2: The bool parameter was named @request_module which conflicted with
request_module(). This built okay w/ CONFIG_MODULES because
request_module() was defined as a macro. W/o CONFIG_MODULES, it
causes build breakage. Rename the parameter to @try_loading.
Reported by Fengguang.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>


# bb813f4c 18-Jan-2013 Tejun Heo <tj@kernel.org>

init, block: try to load default elevator module early during boot

This patch adds default module loading and uses it to load the default
block elevator. During boot, it's called right after initramfs or
initrd is made available and right before control is passed to
userland. This ensures that as long as the modules are available in
the usual places in initramfs, initrd or the root filesystem, the
default modules are loaded as soon as possible.

This will replace the on-demand elevator module loading from elevator
init path.

v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang We <fengguang.wu@intel.com>


# 242d98f0 17-Dec-2012 Sasha Levin <sasha.levin@oracle.com>

block,elevator: use new hashtable implementation

Switch elevator to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the elevator.

This also removes the dymanic allocation of the hash table. The size of the table is
constant so there's no point in paying the price of an extra dereference when accessing
it.

This patch depends on d9b482c ("hashtable: introduce a small and naive
hashtable") which was merged in v3.6.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bee0393c 09-Nov-2012 Shaohua Li <shli@fusionio.com>

block: recursive merge requests

In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.

If we do recursive merge for such interleave access, some workloads throughput
get improvement. A recent worload I'm checking on is swap, below change
boostes the throughput around 5% ~ 10%.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e2a60da7 17-Sep-2012 Martin K. Petersen <martin.petersen@oracle.com>

block: Clean up special command handling logic

Remove special-casing of non-rw fs style requests (discard). The nomerge
flags are consolidated in blk_types.h, and rq_mergeable() and
bio_mergeable() have been modified to use them.

bio_is_rw() is used in place of bio_has_data() a few places. This is
done to to distinguish true reads and writes from other fs type requests
that carry a payload (e.g. write same).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a2b1693b 13-Apr-2012 Tejun Heo <tj@kernel.org>

blkcg: implement per-queue policy activation

All blkcg policies were assumed to be enabled on all request_queues.
Due to various implementation obstacles, during the recent blkcg core
updates, this was temporarily implemented as shooting down all !root
blkgs on elevator switch and policy [de]registration combined with
half-broken in-place root blkg updates. In addition to being buggy
and racy, this meant losing all blkcg configurations across those
events.

Now that blkcg is cleaned up enough, this patch replaces the temporary
implementation with proper per-queue policy activation. Each blkcg
policy should call the new blkcg_[de]activate_policy() to enable and
disable the policy on a specific queue. blkcg_activate_policy()
allocates and installs policy data for the policy for all existing
blkgs. blkcg_deactivate_policy() does the reverse. If a policy is
not enabled for a given queue, blkg printing / config functions skip
the respective blkg for the queue.

blkcg_activate_policy() also takes care of root blkg creation, and
cfq_init_queue() and blk_throtl_init() are updated accordingly.

This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
unnecessary. Dropped.

v2: cfq_init_queue() was returning uninitialized @ret on root_group
alloc failure if !CONFIG_CFQ_GROUP_IOSCHED. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 852c788f 05-Mar-2012 Tejun Heo <tj@kernel.org>

block: implement bio_associate_current()

IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current. Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.

For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup. This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.

Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs). As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.

This patch implements bio_associate_current() which associates the
specified bio with %current. The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio. bio
release puts the associated ioc and blkcg.

It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.

elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.

This doesn't update block cgroup policies yet. Further patches will
implement the support.

-v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
rq_ioc() to fix build breakage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e8989fae 05-Mar-2012 Tejun Heo <tj@kernel.org>

blkcg: unify blkg's for blkcg policies

Currently, blkg is per cgroup-queue-policy combination. This is
unnatural and leads to various convolutions in partially used
duplicate fields in blkg, config / stat access, and general management
of blkgs.

This patch make blkg's per cgroup-queue and let them serve all
policies. blkgs are now created and destroyed by blkcg core proper.
This will allow further consolidation of common management logic into
blkcg core and API with better defined semantics and layering.

As a transitional step to untangle blkg management, elvswitch and
policy [de]registration, all blkgs except the root blkg are being shot
down during elvswitch and bypass. This patch adds blkg_root_update()
to update root blkg in place on policy change. This is hacky and racy
but should be good enough as interim step until we get locking
simplified and switch over to proper in-place update for all blkgs.

-v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
comment wasn't updated according to the function change. Fixed.
Both pointed out by Vivek.

-v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
all policies. This freed root pd during elvswitch before the
last queue finished exiting and led to oops. Directly invoke
update_root_blkg_pd() only on BLKIO_POLICY_PROP from
cfq_exit_queue(). This also is closer to what will be done with
proper in-place blkg update. Reported by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 03aa264a 05-Mar-2012 Tejun Heo <tj@kernel.org>

blkcg: let blkcg core manage per-queue blkg list and counter

With the previous patch to move blkg list heads and counters to
request_queue and blkg, logic to manage them in both policies are
almost identical and can be moved to blkcg core.

This patch moves blkg link logic into blkg_lookup_create(), implements
common blkg unlink code in blkg_destroy(), and updates
blkg_destory_all() so that it's policy specific and can skip root
group. The updated blkg_destroy_all() is now used to both clear queue
for bypassing and elv switching, and release all blkgs on q exit.

This patch introduces a race window where policy [de]registration may
race against queue blkg clearing. This can only be a problem on cfq
unload and shouldn't be a real problem in practice (and we have many
other places where this race already exists). Future patches will
remove these unlikely races.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 72e06c25 05-Mar-2012 Tejun Heo <tj@kernel.org>

blkcg: shoot down blkio_groups on elevator switch

Elevator switch may involve changes to blkcg policies. Implement
shoot down of blkio_groups.

Combined with the previous bypass updates, the end goal is updating
blkcg core such that it can ensure that blkcg's being affected become
quiescent and don't have any per-blkg data hanging around before
commencing any policy updates. Until queues are made aware of the
policies that applies to them, as an interim step, all per-policy blkg
data will be shot down.

* blk-throtl doesn't need this change as it can't be disabled for a
live queue; however, update it anyway as the scheduled blkg
unification requires this behavior change. This means that
blk-throtl configuration will be unnecessarily lost over elevator
switch. This oddity will be removed after blkcg learns to associate
individual policies with request_queues.

* blk-throtl dosen't shoot down root_tg. This is to ease transition.
Unified blkg will always have persistent root group and not shooting
down root_tg for now eases transition to that point by avoiding
having to update td->root_tg and is safe as blk-throtl can never be
disabled

-v2: Vivek pointed out that group list is not guaranteed to be empty
on return from clear function if it raced cgroup removal and
lost. Fix it by waiting a bit and retrying. This kludge will
soon be removed once locking is updated such that blkg is never
in limbo state between blkcg and request_queue locks.

blk-throtl no longer shoots down root_tg to avoid breaking
td->root_tg.

Also, Nest queue_lock inside blkio_list_lock not the other way
around to avoid introduce possible deadlock via blkcg lock.

-v3: blkcg_clear_queue() repositioned and renamed to
blkg_destroy_all() to increase consistency with later changes.
cfq_clear_queue() updated to check q->elevator before
dereferencing it to avoid NULL dereference on not fully
initialized queues (used by later change).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d732580b 05-Mar-2012 Tejun Heo <tj@kernel.org>

block: implement blk_queue_bypass_start/end()

Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
state.

This will be further extended and used for blkio_group management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b2fab5ac 05-Mar-2012 Tejun Heo <tj@kernel.org>

elevator: make elevator_init_fn() return 0/-errno

elevator_ops->elevator_init_fn() has a weird return value. It returns
a void * which the caller should assign to q->elevator->elevator_data
and %NULL return denotes init failure.

Update such that it returns integer 0/-errno and sets elevator_data
directly as necessary.

This makes the interface more conventional and eases further cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a5bafdc 05-Mar-2012 Tejun Heo <tj@kernel.org>

elevator: clear auxiliary data earlier during elevator switch

Elevator switch tries hard to keep as much as context until new
elevator is ready so that it can revert to the original state if
initializing the new elevator fails for some reason. Unfortunately,
with more auxiliary contexts to manage, this makes elevator init and
exit paths too complex and fragile.

This patch makes elevator_switch() unregister the current elevator and
flush icq's before start initializing the new one. As we still keep
the old elevator itself, the only difference is that we lose icq's on
rare occassions of switching failure, which isn't critical at all.

Note that this makes explicit elevator parameter to
elevator_init_queue() and __elv_register_queue() unnecessary as they
always can use the current elevator.

This patch enables block cgroup cleanups.

-v2: blk_add_trace_msg() prints elevator name from @new_e instead of
@e->type as the local variable no longer exists. This caused
build failure on CONFIG_BLK_DEV_IO_TRACE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 050c8ea8 08-Feb-2012 Tejun Heo <tj@kernel.org>

block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions

blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test. blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.

elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge(). elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge(). While at it, make rq_merge_ok() functions return
bool.

This is to prepare for plug merge update and doesn't introduce any
behavior change.

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5d381efb 15-Jan-2012 Jens Axboe <axboe@kernel.dk>

Revert "block: recursive merge requests"

This reverts commit 274193224cdabd687d804a26e0150bb20f2dd52c.

We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 27419322 16-Dec-2011 Shaohua Li <shaohua.li@intel.com>

block: recursive merge requests

In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
a+3,.... When the requests are flushed to queue, a and a+1 are merged
to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
aren't merged.
With recursive merge below, the workload throughput gets improved 20%
and context switch drops 60%.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7e5a8794 13-Dec-2011 Tejun Heo <tj@kernel.org>

block, cfq: move io_cq exit/release to blk-ioc.c

With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3d3c2379 13-Dec-2011 Tejun Heo <tj@kernel.org>

block, cfq: move icq cache management to block core

Let elevators set ->icq_size and ->icq_align in elevator_type and
elv_register() and elv_unregister() respectively create and destroy
kmem_cache for icq.

* elv_register() now can return failure. All callers updated.

* icq caches are automatically named "ELVNAME_io_cq".

* cfq_slab_setup/kill() are collapsed into cfq_init/exit().

* While at it, minor indentation change for iosched_cfq.elevator_name
for consistency.

This will help moving icq management to block core. This doesn't
introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a612fddf 13-Dec-2011 Tejun Heo <tj@kernel.org>

block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq

Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.

* Move cfqd->icq_list to request_queue->icq_list

* Make request explicitly point to icq instead of through elevator
private data. ->elevator_private[3] is replaced with sub struct elv
which contains icq pointer and priv[2]. cfq is updated accordingly.

* Meaningless clearing of ->elevator_private[0] removed from
elv_set_request(). At that point in code, the field was guaranteed
to be %NULL anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 22f746e2 13-Dec-2011 Tejun Heo <tj@kernel.org>

block: remove elevator_queue->ops

elevator_queue->ops points to the same ops struct ->elevator_type.ops
is pointing to. The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.

Relocate elevator_type->list which used only during module init/exit
to the end of the structure, rename elevator_queue->elevator_type to
->type, and replace elevator_queue->ops with elevator_queue->type.ops.

This doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f8fc877d 13-Dec-2011 Tejun Heo <tj@kernel.org>

block: reorder elevator switch sequence

Elevator switch sequence first attached the new elevator, then tried
registering it (sysfs) and if that failed attached back the old
elevator. However, sysfs registration doesn't require the elevator to
be attached, so there is no reason to do the "detach, attach new,
register, maybe re-attach old" sequence. It can just do "register,
detach, attach".

* elevator_init_queue() is updated to set ->elevator_data directly and
return 0 / -errno. This allows elevator_exit() on an unattached
elevator.

* __elv_unregister_queue() which was necessary to unregister
unattached q is removed in favor of __elv_register_queue() which can
register unattached q.

* elevator_attach() becomes a single assignment and obscures more then
it helps. Dropped.

This will help cleaning up io_context handling across elevator switch.

This patch doesn't introduce visible behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b9a19208 13-Dec-2011 Tejun Heo <tj@kernel.org>

block, cfq: remove delayed unlink

Now that all cic's are immediately unlinked from both ioc and queue,
lazy dropping from lookup path and trimming on elevator unregister are
unnecessary. Kill them and remove now unused elevator_ops->trim().

This also leaves call_for_each_cic() without any user. Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c9a929dd 19-Oct-2011 Tejun Heo <tj@kernel.org>

block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown

request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:

Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.

* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e3c78ca5 19-Oct-2011 Tejun Heo <tj@kernel.org>

block: reorganize queue draining

Reorganize queue draining related code in preparation of queue exit
changes.

* Factor out actual draining from elv_quiesce_start() to
blk_drain_queue().

* Make elv_quiesce_start/end() responsible for their own locking.

* Replace open-coded ELVSWITCH clearing in elevator_switch() with
elv_quiesce_end().

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 484fc254 07-Sep-2011 Wang Sheng-Hui <shhuiw@gmail.com>

elevator: use ELV_NAME_MAX instead of magic number 16 for chosen_elevator

We have ELV_NAME_MAX defined to 16, and hence we should use it
instead of the magic nubmer 16 for elevator's name string.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 796d5116 02-Jun-2011 Jeff Moyer <jmoyer@redhat.com>

iosched: prevent aliased requests from starving other I/O

Hi, Jens,

If you recall, I posted an RFC patch for this back in July of last year:
http://lkml.org/lkml/2010/7/13/279

The basic problem is that a process can issue a never-ending stream of
async direct I/Os to the same sector on a device, thus starving out
other I/O in the system (due to the way the alias handling works in both
cfq and deadline). The solution I proposed back then was to start
dispatching from the fifo after a certain number of aliases had been
dispatched. Vivek asked why we had to treat aliases differently at all,
and I never had a good answer. So, I put together a simple patch which
allows aliases to be added to the rb tree (it adds them to the right,
though that doesn't matter as the order isn't guaranteed anyway). I
think this is the preferred solution, as it doesn't break up time slices
in CFQ or batches in deadline. I've tested it, and it does solve the
starvation issue. Let me know what you think.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 771949d0 20-May-2011 Jens Axboe <jaxboe@fusionio.com>

block: get rid of on-stack plugging debug checks

We don't need them anymore, so kill:

- REQ_ON_PLUG checks in various places
- !rq_mergeable() check in plug merging

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 490b94be 05-May-2011 Kees Cook <keescook@chromium.org>

iosched: remove redundant sprintf

After the anticipatory scheduler was dropped, there was no need to
special-case the request_module string. As such, drop the redundant
sprintf and stack variable.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 3aa72873 21-Apr-2011 Jens Axboe <jaxboe@fusionio.com>

elevator: check for ELEVATOR_INSERT_SORT_MERGE in !elvpriv case too

The sort insert is the one that goes to the IO scheduler. With
the SORT_MERGE addition, we could bypass IO scheduler setup
but still ask the IO scheduler to insert the request. This would
cause an oops on switching IO schedulers through the sysfs
interface, unless the disk just happened to be idle while it
occured.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 24ecfbe2 18-Apr-2011 Christoph Hellwig <hch@infradead.org>

block: add blk_run_queue_async

Instead of overloading __blk_run_queue to force an offload to kblockd
add a new blk_run_queue_async helper to do it explicitly. I've kept
the blk_queue_stopped check for now, but I suspect it's not needed
as the check we do when the workqueue items runs should be enough.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# b710a480 30-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: get rid of elv_insert() interface

Merge it with __elv_add_request(), it's pretty pointless to
have a function with only two callers. The main interface
is elv_add_request()/__elv_add_request().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 5e84ea3a 21-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: attempt to merge with existing requests on plug flush

One of the disadvantages of on-stack plugging is that we potentially
lose out on merging since all pending IO isn't always visible to
everybody. When we flush the on-stack plugs, right now we don't do
any checks to see if potential merge candidates could be utilized.

Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
It works just ELEVATOR_INSERT_SORT, but first checks whether we can
merge with an existing request before doing the insertion (if we fail
merging).

This fixes a regression with multiple processes issuing IO that
can be merged.

Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing
an accounting bug.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 7eaceacc 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 73c10101 08-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: initial patch for on-stack per-task plugging

This patch adds support for creating a queuing context outside
of the queue itself. This enables us to batch up pieces of IO
before grabbing the block device queue lock and submitting them to
the IO scheduler.

The context is created on the stack of the process and assigned in
the task structure, so that we can auto-unplug it if we hit a schedule
event.

The current queue plugging happens implicitly if IO is submitted to
an empty device, yet callers have to remember to unplug that IO when
they are going to wait for it. This is an ugly API and has caused bugs
in the past. Additionally, it requires hacks in the vm (->sync_page()
callback) to handle that logic. By switching to an explicit plugging
scheme we make the API a lot nicer and can get rid of the ->sync_page()
hack in the vm.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 1654e741 02-Mar-2011 Tejun Heo <tj@kernel.org>

block: add @force_kblockd to __blk_run_queue()

__blk_run_queue() automatically either calls q->request_fn() directly
or schedules kblockd depending on whether the function is recursed.
blk-flush implementation needs to be able to explicitly choose
kblockd. Add @force_kblockd.

All the current users are converted to specify %false for the
parameter and this patch doesn't introduce any behavior change.

stable: This is prerequisite for fixing ide oops caused by the new
blk-flush implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# c186794d 11-Feb-2011 Mike Snitzer <snitzer@redhat.com>

block: share request flush fields with elevator_private

Flush requests are never put on the IO scheduler. Convert request
structure's elevator_private* into an array and have the flush fields
share a union with it.

Reclaim the space lost in 'struct request' by moving 'completion_data'
back in the union with 'rb_node'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# ae1b1539 24-Jan-2011 Tejun Heo <tj@kernel.org>

block: reimplement FLUSH/FUA to support merge

The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining. As such, sequencing is done
queue-wide one flush request after another. However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.

This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually. The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue. Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.

This allows arbitrary merging of different type of flushes. How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.

This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.

* As flush requests are never put on the IO scheduler, request fields
used for flush share space with rq->rb_node. rq->completion_data is
moved out of the union. This increases the request size by one
pointer.

As rq->elevator_private* are used only by the iosched too, it is
possible to reduce the request size further. However, to do that,
we need to modify request allocation path such that iosched data is
not allocated for flush requests.

* FLUSH/FUA processing happens on insertion now instead of dispatch.

- Comments updated as per Vivek and Mike.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 02e031cb 10-Nov-2010 Christoph Hellwig <hch@lst.de>

block: remove REQ_HARDBARRIER

REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
at this point is:

- various checks inside the block layer.
- sanity checks in bio based drivers.
- now unused bio_empty_barrier helper.
- Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
but Xen really needs to sort out it's barrier situaton.
- setting of ordered tags in uas - dead code copied from old scsi
drivers.
- scsi different retry for barriers - it's dead and should have been
removed when flushes were converted to FS requests.
- blktrace handling of barriers - removed. Someone who knows blktrace
better should add support for REQ_FLUSH and REQ_FUA, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 430c62fb 07-Oct-2010 Jens Axboe <jaxboe@fusionio.com>

elevator: fix oops on early call to elevator_change()

2.6.36 introduces an API for drivers to switch the IO scheduler
instead of manually calling the elevator exit and init functions.
This API was added since q->elevator must be cleared in between
those two calls. And since we already have this functionality
directly from use by the sysfs interface to switch schedulers
online, it was prudent to reuse it internally too.

But this API needs the queue to be in a fully initialized state
before it is called, or it will attempt to unregister elevator
kobjects before they have been added. This results in an oops
like this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
PGD 47ddfc067 PUD 47c6a1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
CPU 2
Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb

Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
RIP: 0010:[<ffffffff8116f15e>] [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
RSP: 0018:ffff88027da25d08 EFLAGS: 00010246
RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
FS: 00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
Stack:
ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
<0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
<0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
Call Trace:
[<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0
[<ffffffff8123fd98>] kobject_add_varg+0x38/0x60
[<ffffffff8123feb9>] kobject_add+0x69/0x90
[<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
[<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0
[<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50
[<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
[<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0
[<ffffffff81224204>] elv_register_queue+0x34/0xa0
[<ffffffff81224aad>] elevator_change+0xfd/0x250
[<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
[<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
[<ffffffffa007e0a8>] t_init+0xa8/0x361 [t]
[<ffffffff810001de>] do_one_initcall+0x3e/0x170
[<ffffffff8108c3fd>] sys_init_module+0xbd/0x220
[<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b
Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
RIP [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
RSP <ffff88027da25d08>
CR2: 0000000000000051
---[ end trace a6541d3bf07945df ]---

Fix this by adding a registered bit to the elevator queue, which is
set when the sysfs kobjects have been registered.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 28e7d184 03-Sep-2010 Tejun Heo <tj@kernel.org>

block: drop barrier ordering by queue draining

Filesystems will take all the responsibilities for ordering requests
around commit writes and will only indicate how the commit writes
themselves should be handled by block layers. This patch drops
barrier ordering by queue draining from block layer. Ordering by
draining implementation was somewhat invasive to request handling.
List of notable changes follow.

* Each queue has 1 bit color which is flipped on each barrier issue.
This is used to track whether a given request is issued before the
current barrier or not. REQ_ORDERED_COLOR flag and coloring
implementation in __elv_add_request() are removed.

* Requests which shouldn't be processed yet for draining were stalled
by returning -EAGAIN from blk_do_ordered() according to the test
result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
This logic is removed.

* Draining completion logic in elv_completed_request() removed.

* All barrier sequence requests were queued to request queue and then
trckled to lower layer according to progress and thus maintaining
request orders during requeue was necessary. This is replaced by
queueing the next request in the barrier sequence only after the
current one is complete from blk_ordered_complete_seq(), which
removes the need for multiple proxy requests in struct request_queue
and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
elv_insert().

* As barriers no longer have ordering constraints, there's no need to
dump the whole elevator onto the dispatch queue on each barrier.
Insert barriers at the front instead.

* If other barrier requests come to the front of the dispatch queue
while one is already in progress, they are stored in
q->pending_barriers and restored to dispatch queue one-by-one after
each barrier completion from blk_ordered_complete_seq().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 5dd531a0 23-Aug-2010 Jens Axboe <jaxboe@fusionio.com>

block: add function call to switch the IO scheduler from a driver

Currently drivers must do an elevator_exit() + elevator_init()
to switch IO schedulers. There are a few problems with this:

- Since commit 1abec4fdbb142e3ccb6ce99832fae42129134a96,
elevator_init() requires a zeroed out q->elevator
pointer. The two existing in-kernel users don't do that.

- It will only work at initialization time, since using the
above two-staged construct does not properly quisce the queue.

So add elevator_change() which takes care of this, and convert
the elv_iosched_store() sysfs interface to use this helper as well.

Reported-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Reported-by: Kevin Vigor <kevin@vigor.nu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 8d57a98c 11-Aug-2010 Adrian Hunter <adrian.hunter@nokia.com>

block: add secure discard

Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7b6d91da 07-Aug-2010 Christoph Hellwig <hch@lst.de>

block: unify flags for struct bio and struct request

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 33659ebb 07-Aug-2010 Christoph Hellwig <hch@lst.de>

block: remove wrappers for request type/flags

Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests. This allows much easier grepping for different request
types instead of unwinding through macros.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 1abec4fd 25-May-2010 Mike Snitzer <snitzer@redhat.com>

block: make blk_init_free_list and elevator_init idempotent

blk_init_allocated_queue_node may fail and the caller _could_ retry.
Accommodate the unlikely event that blk_init_allocated_queue_node is
called on an already initialized (possibly partially) request_queue.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# e36f724b 24-May-2010 Mike Snitzer <snitzer@redhat.com>

block: Adjust elv_iosched_show to return "none" for bio-based DM

Bio-based DM doesn't use an elevator (queue is !blk_queue_stackable()).

Longer-term DM will not allocate an elevator for bio-based DM. But even
then there will be small potential for an elevator to be allocated for
a request-based DM table only to have a bio-based table be loaded in the
end.

Displaying "none" for bio-based DM will help avoid user confusion.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 01effb0d 11-May-2010 Mike Snitzer <snitzer@redhat.com>

block: allow initialization of previously allocated request_queue

blk_init_queue() allocates the request_queue structure and then
initializes it as needed (request_fn, elevator, etc).

Split initialization out to blk_init_allocated_queue_node.
Introduce blk_init_allocated_queue wrapper function to model existing
blk_init_queue and blk_init_queue_node interfaces.

Export elv_register_queue to allow a newly added elevator to be
registered with sysfs. Export elv_unregister_queue for symmetry.

These changes allow DM to initialize a device's request_queue with more
precision. In particular, DM no longer unconditionally initializes a
full request_queue (elevator et al). It only does so for a
request-based DM device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 812d4026 08-Apr-2010 Divyesh Shah <dpshah@google.com>

blkio: Add io_merged stat

This includes both the number of bios merged into requests belonging to this
cgroup as well as the number of requests merged together.
In the past, we've observed different merging behavior across upstream kernels,
some by design some actual bugs. This stat helps a lot in debugging such
problems when applications report decreased throughput with a new kernel
version.

This needed adding an extra elevator function to capture bios being merged as I
did not want to pollute elevator code with blkiocg knowledge and hence needed
the accounting invocation to come from CFQ.

Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# a506aedc 02-Apr-2010 wzt.wzt@gmail.com <wzt.wzt@gmail.com>

Block: Fix block/elevator.c elevator_get() off-by-one error

elevator_get() not check the name length, if the name length > sizeof(elv),
elv will miss the '\0'. And elv buffer will be replace "-iosched" as something
like aaaaaaaaa, then call request_module() can load an not trust module.

Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 52cf25d0 18-Jan-2010 Emese Revfy <re.emese@gmail.com>

Driver core: Constify struct sysfs_ops in struct kobj_type

Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime

* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime

* potentially better optimized code as the compiler
can assume that the const data cannot be changed

* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 488991e2 29-Jan-2010 Alan D. Brunelle <Alan.Brunelle@hp.com>

block: Added in stricter no merge semantics for block I/O

Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
merges at all are to be attempted (not even the simple one-hit cache).

The following table illustrates the additional benefit - 5 minute runs of
a random I/O load were applied to a dozen devices on a 16-way x86_64 system.

nomerges Throughput %System Improvement (tput / %sys)
-------- ------------ ----------- -------------------------
0 12.45 MB/sec 0.669365609
1 12.50 MB/sec 0.641519199 0.40% / 2.71%
2 12.52 MB/sec 0.639849750 0.56% / 2.96%

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 8c279598 09-Oct-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

elv_iosched_store(): fix strstrip() misuse

elv_iosched_store() ignore the return value of strstrip(). It makes small
inconsistent behavior.

This patch fixes it.

<before>
====================================
# cd /sys/block/{blockdev}/queue

case1:
# echo "anticipatory" > scheduler
# cat scheduler
noop [anticipatory] deadline cfq

case2:
# echo "anticipatory " > scheduler
# cat scheduler
noop [anticipatory] deadline cfq

case3:
# echo " anticipatory" > scheduler
bash: echo: write error: Invalid argument

<after>
====================================
# cd /sys/block/{blockdev}/queue

case1:
# echo "anticipatory" > scheduler
# cat scheduler
noop [anticipatory] deadline cfq

case2:
# echo "anticipatory " > scheduler
# cat scheduler
noop [anticipatory] deadline cfq

case3:
# echo " anticipatory" > scheduler
noop [anticipatory] deadline cfq

Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 492af635 03-Oct-2009 Jens Axboe <jens.axboe@oracle.com>

block: remove the anticipatory IO scheduler

AS is mostly a subset of CFQ, so there's little point in still
providing this separate IO scheduler. Hopefully at some point we
can get down to one single IO scheduler again, at least this brings
us closer by having only one intelligent IO scheduler.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 1f98a13f 11-Sep-2009 Jens Axboe <jens.axboe@oracle.com>

bio: first step in sanitizing the bio->bi_rw flag testing

Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# da6c5c72 11-Sep-2009 Tejun Heo <tj@kernel.org>

scsi,block: update SCSI to handle mixed merge failures

Update scsi_io_completion() such that it only fails requests till the
next error boundary and retry the leftover. This enables block layer
to merge requests with different failfast settings and still behave
correctly on errors. Allow merge of requests of different failfast
settings.

As SCSI is currently the only subsystem which follows failfast status,
there's no need to worry about other block drivers for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0a09f431 16-Jul-2009 Tejun Heo <tj@kernel.org>

block: fix failfast merge testing in elv_rq_merge_ok()

Commit ab0fd1debe730ec9998678a0c53caefbd121ed10 tries to prevent merge
of requests with different failfast settings. In elv_rq_merge_ok(),
it compares new bio's failfast flags against the merge target
request's. However, the flag testing accessors for bio and blk don't
return boolean but the tested bit value directly and FAILFAST on bio
and blk don't match, so directly comparing them with == results in
false negative unnecessary preventing merge of readahead requests.

This patch convert the results to boolean by negating them before
comparison.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jeff Garzik <jeff@garzik.org>


# ab0fd1de 02-Jul-2009 Tejun Heo <tj@kernel.org>

block: don't merge requests of different failfast settings

Block layer used to merge requests and bios with different failfast
settings. This caused regular IOs to fail prematurely when they were
merged into failfast requests for readahead.

Niel Lambrechts could trigger the problem semi-reliably on ext4 when
resuming from STR. ext4 uses readahead when reading inodes and
combined with the deterministic extra SATA PHY exception cycle during
resume on the specific configuration, non-readahead inode read would
fail causing ext4 errors. Please read the following thread for
details.

http://lkml.org/lkml/2009/5/23/21

This patch makes block layer reject merging if the failfast settings
don't match. This is correct but likely to lower IO performance by
preventing regular IOs from mingling into surrounding readahead
requests. Changes to allow such mixed merges and handle errors
correctly will be added later.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@carl.(none)>


# 55782138 08-Jun-2009 Li Zefan <lizf@cn.fujitsu.com>

tracing/events: convert block trace points to TRACE_EVENT()

TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...

Cons:

- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

This is mainly because we can't get the deivce from a request queue.
But this may change in the future.

- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.

Since pc requests should be rather rare, this is not a big issue.

- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.

The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]

unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1

remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]

getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]

bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]

ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 53c663ce 02-Jun-2009 Kiyoshi Ueda <k-ueda@ct.jp.nec.com>

block: fix a possible oops on elv_abort_queue()

I found one more mis-conversion to the 'request is always dequeued
when completing' model in elv_abort_queue() during code inspection.
Although I haven't hit any problem caused by this mis-conversion yet
and just done compile/boot test, please apply if you have no problem.

Request must be dequeued when it completes.
However, elv_abort_queue() completes requests without dequeueing.
This will cause oops in the __blk_end_request_all().
This patch fixes the oops.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# cd43e26f 22-May-2009 Martin K. Petersen <martin.petersen@oracle.com>

block: Expose stacked device queues in sysfs

Currently stacking devices do not have a queue directory in sysfs.
However, many of the I/O characteristics like sector size, maximum
request size, etc. are queue properties.

This patch enables the queue directory for MD/DM devices. The elevator
code has been modified to deal with queues that do not have an I/O
scheduler.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0a7ae2ff 20-May-2009 Jens Axboe <jens.axboe@oracle.com>

block: change the tag sync vs async restriction logic

Make them fully share the tag space, but disallow async requests using
the last any two slots.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 83096ebf 07-May-2009 Tejun Heo <tj@kernel.org>

block: convert to pos and nr_sectors accessors

With recent cleanups, there is no place where low level driver
directly manipulates request fields. This means that the 'hard'
request fields always equal the !hard fields. Convert all
rq->sectors, nr_sectors and current_nr_sectors references to
accessors.

While at it, drop superflous blk_rq_pos() < 0 test in swim.c.

[ Impact: use pos and nr_sectors accessors ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Mike Miller <mike.miller@hp.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Dario Ballabio <ballabio_dario@emc.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: unsik Kim <donari75@gmail.com>
Cc: Laurent Vivier <Laurent@lvivier.info>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 40cbbb78 22-Apr-2009 Tejun Heo <tj@kernel.org>

block: implement and use [__]blk_end_request_all()

There are many [__]blk_end_request() call sites which call it with
full request length and expect full completion. Many of them ensure
that the request actually completes by doing BUG_ON() the return
value, which is awkward and error-prone.

This patch adds [__]blk_end_request_all() which takes @rq and @error
and fully completes the request. BUG_ON() is added to to ensure that
this actually happens.

Most conversions are simple but there are a few noteworthy ones.

* cdrom/viocd: viocd_end_request() replaced with direct calls to
__blk_end_request_all().

* s390/block/dasd: dasd_end_request() replaced with direct calls to
__blk_end_request_all().

* s390/char/tape_block: tapeblock_end_request() replaced with direct
calls to blk_end_request_all().

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>


# 158dbda0 22-Apr-2009 Tejun Heo <tj@kernel.org>

block: reorganize request fetching functions

Impact: code reorganization

elv_next_request() and elv_dequeue_request() are public block layer
interface than actual elevator implementation. They mostly deal with
how requests interact with block layer and low level drivers at the
beginning of rqeuest processing whereas __elv_next_request() is the
actual eleveator request fetching interface.

Move the two functions to blk-core.c. This prepares for further
interface cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>


# a7f55792 22-Apr-2009 Tejun Heo <tj@kernel.org>

block: kill blk_start_queueing()

blk_start_queueing() is identical to __blk_run_queue() except that it
doesn't check for recursion. None of the current users depends on
blk_start_queueing() running request_fn directly. Replace usages of
blk_start_queueing() with [__]blk_run_queue() and kill it.

[ Impact: removal of mostly duplicate interface function ]

Signed-off-by: Tejun Heo <tj@kernel.org>


# f600abe2 08-Apr-2009 Jens Axboe <jens.axboe@oracle.com>

block: fix bad spelling of quiesce

Credit goes to Andrew Morton for spotting this one.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 26308eab 27-Mar-2009 Jerome Marchand <jmarchan@redhat.com>

block: fix inconsistency in I/O stat accounting code

This forces in_flight to be zero when turning off or on the I/O stat
accounting and stops updating I/O stats in attempt_merge() when
accounting is turned off.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 6c7e8cee 27-Mar-2009 Jens Axboe <jens.axboe@oracle.com>

block: elevator quiescing helpers

Simple helper functions to quiesce the request queue. These are
currently only used for switching IO schedulers on-the-fly, but
we can use them to properly switch IO accounting on and off as well.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 1faa16d2 06-Apr-2009 Jens Axboe <jens.axboe@oracle.com>

block: change the request allocation/congestion logic to be sync/async based

This makes sure that we never wait on async IO for sync requests, instead
of doing the split on writes vs reads.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b374d18a 31-Oct-2008 Jens Axboe <jens.axboe@oracle.com>

block: get rid of elevator_t typedef

Just use struct elevator_queue everywhere instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 58eea927 27-Nov-2008 Tejun Heo <tj@kernel.org>

block: simplify empty barrier implementation

Empty barrier required special handling in __elv_next_request() to
complete it without letting the low level driver see it.

With previous changes, barrier code is now flexible enough to skip the
BAR step using the same barrier sequence selection mechanism. Drop
the special handling and mask off q->ordered from start_ordered().

Remove blk_empty_barrier() test which now has no user.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 8f11b3e9 27-Nov-2008 Tejun Heo <tj@kernel.org>

block: make barrier completion more robust

Barrier completion had the following assumptions.

* start_ordered() couldn't finish the whole sequence properly. If all
actions are to be skipped, q->ordseq is set correctly but the actual
completion was never triggered thus hanging the barrier request.

* Drain completion in elv_complete_request() assumed that there's
always at least one request in the queue when drain completes.

Both assumptions are true but these assumptions need to be removed to
improve empty barrier implementation. This patch makes the following
changes.

* Make start_ordered() use blk_ordered_complete_seq() to mark skipped
steps complete and notify __elv_next_request() that it should fetch
the next request if the whole barrier has completed inside
start_ordered().

* Make drain completion path in elv_complete_request() check whether
the queue is empty. Empty queue also indicates drain completion.

* While at it, convert 0/1 return from blk_do_ordered() to false/true.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 53a08807 02-Dec-2008 Tejun Heo <tj@kernel.org>

block: internal dequeue shouldn't start timer

blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
both start the timeout timer. Barrier code dequeues the original
barrier request but doesn't passes the request itself to lower level
driver, only broken down proxy requests; however, as the original
barrier code goes through the same dequeue path and timeout timer is
started on it. If barrier sequence takes long enough, this timer
expires but the low level driver has no idea about this request and
oops follows.

Timeout timer shouldn't have been started on the original barrier
request as it never goes through actual IO. This patch unexports
elv_dequeue_request(), which has no external user anyway, and makes it
operate on elevator proper w/o adding the timer and make
blkdev_dequeue_request() call elv_dequeue_request() and add timer.
Internal users which don't pass the request to driver - barrier code
and end_that_request_last() - are converted to use
elv_dequeue_request().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0bfc2455 26-Nov-2008 Ingo Molnar <mingo@elte.hu>

blktrace: port to tracepoints, update

Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
sites. Spread them out to the usage sites, as suggested by
Mathieu Desnoyers.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>


# 5f3ea37c 30-Oct-2008 Arnaldo Carvalho de Melo <acme@redhat.com>

blktrace: port to tracepoints

This was a forward port of work done by Mathieu Desnoyers, I changed it to
encode the 'what' parameter on the tracepoint name, so that one can register
interest in specific events and not on classes of events to then check the
'what' parameter.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 2920ebbd 30-Oct-2008 Tejun Heo <tj@kernel.org>

block: add timer on blkdev_dequeue_request() not elv_next_request()

Block queue supports two usage models - one where block driver peeks
at the front of queue using elv_next_request(), processes it and
finishes it and the other where block driver peeks at the front of
queue, dequeue the request using blkdev_dequeue_request() and finishes
it. The latter is more flexible as it allows the driver to process
multiple commands concurrently.

These two inconsistent usage models affect the block layer
implementation confusing. For some, elv_next_request() is considered
the issue point while others consider blkdev_dequeue_request() the
issue point.

Till now the inconsistency mostly affect only accounting, so it didn't
really break anything seriously; however, with block layer timeout,
this inconsistency hits hard. Block layer considers
elv_next_request() the issue point and adds timer but SCSI layer
thinks it was just peeking and when the request can't process the
command right away, it's just left there without further processing.
This makes the request dangling on the timer list and, when the timer
goes off, the request which the SCSI layer and below think is still on
the block queue ends up in the EH queue, causing various problems - EH
hang (failed count goes over busy count and EH never wakes up),
WARN_ON() and oopses as low level driver trying to handle the unknown
command, etc. depending on the timing.

As SCSI midlayer is the only user of block layer timer at the moment,
moving blk_add_timer() to elv_dequeue_request() fixes the problem;
however, this two usage models definitely need to be cleaned up in the
future.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 80a4b58e 14-Oct-2008 Jens Axboe <jens.axboe@oracle.com>

block: only call ->request_fn when the queue is not stopped

Callers should use either blk_run_queue/__blk_run_queue, or
blk_start_queueing() to invoke request handling instead of calling
->request_fn() directly as that does not take the queue stopped
flag into account.

Also add appropriate comments on the above functions to detail
their usage.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# ee2e992c 14-Oct-2008 Li Zefan <lizf@cn.fujitsu.com>

block: simplify string handling in elv_iosched_store()

strlcpy() guarantees the dest buffer is NULL teminated.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 99cd3386 01-Oct-2008 Kiyoshi Ueda <k-ueda@ct.jp.nec.com>

block: change elevator to use __blk_end_request()

This patch converts elevator to use __blk_end_request() directly
so that end_{queued|dequeued}_request() can be removed.
Related 'uptodate' arguments is converted to 'error'.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 11914a53 13-Sep-2008 Mike Anderson <andmike@linux.vnet.ibm.com>

block: Add interface to abort queued requests

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 242f9dcb 14-Sep-2008 Jens Axboe <jens.axboe@oracle.com>

block: unify request timeout handling

Right now SCSI and others do their own command timeout handling.
Move those bits to the block layer.

Instead of having a timer per command, we try to be a bit more clever
and simply have one per-queue. This avoids the overhead of having to
tear down and setup a timer for each command, so it will result in a lot
less timer fiddling.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0835da67 26-Aug-2008 Jens Axboe <jens.axboe@oracle.com>

block: use linux/uaccess.h in elevator.c instead of asm variant

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 5df97b91 15-Aug-2008 Mikulas Patocka <mpatocka@redhat.com>

drop vmerge accounting

Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# e17fc0a1 09-Aug-2008 David Woodhouse <David.Woodhouse@intel.com>

Allow elevators to sort/merge discard requests

But blkdev_issue_discard() still emits requests which are interpreted as
soft barriers, because naïve callers might otherwise issue subsequent
writes to those same sectors, which might cross on the queue (if they're
reallocated quickly enough).

Callers still _can_ issue non-barrier discard requests, but they have to
take care of queue ordering for themselves.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# e180f594 01-Jul-2008 maximilian attems <max@stro.a>

block: request_module(): use format string

Avoid bad things happening if the module has a printk control string in
its name.

Signed-off-by: maximilian attems <max@stro.at>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 7ba1ba12 30-Jun-2008 Martin K. Petersen <martin.petersen@oracle.com>

block: Block layer data integrity support

Some block devices support verifying the integrity of requests by way
of checksums or other protection information that is submitted along
with the I/O.

This patch implements support for generating and verifying integrity
metadata, as well as correctly merging, splitting and cloning bios and
requests that have this extra information attached.

See Documentation/block/data-integrity.txt for more information.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 4722dc52 27-May-2008 Alan D. Brunelle <Alan.Brunelle@hp.com>

Added in elevator switch message to blktrace stream

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 24c03d47 01-May-2008 Harvey Harrison <harvey.harrison@gmail.com>

block: remove remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ac9fafa1 29-Apr-2008 Alan D. Brunelle <Alan.Brunelle@hp.com>

block: Skip I/O merges when disabled

The block I/O + elevator + I/O scheduler code spend a lot of time trying
to merge I/Os -- rightfully so under "normal" circumstances. However,
if one were to know that the incoming I/O stream was /very/ random in
nature, the cycles are wasted.

This patch adds a per-request_queue tunable that (when set) disables
merge attempts (beyond the simple one-hit cache check), thus freeing up
a non-trivial amount of CPU cycles.

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 72ed0bf6 29-Apr-2008 Adrian Bunk <bunk@kernel.org>

block/elevator.c:elv_rq_merge_ok() mustn't be inline

This patch fixes the following build error with UML and gcc 4.3:

<-- snip -->

...
CC block/elevator.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c: In function ‘elv_merge’:
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:103: sorry, unimplemented: called from here
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:495: sorry, unimplemented: called from here
make[2]: *** [block/elevator.o] Error 1
make[1]: *** [block] Error 2

<-- snip -->

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 75ad23bc 29-Apr-2008 Nick Piggin <npiggin@suse.de>

block: make queue flags non-atomic

We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# e1640949 19-Feb-2008 Jens Axboe <jens.axboe@oracle.com>

elevator: make elevator_get() attempt to load the appropriate module

Currently we fail if someone requests a valid io scheduler, but it's
modular and not currently loaded. That can happen from a driver init
asking for a different scheduler, or online switching through sysfs
as requested by a user.

This patch makes elevator_get() request_module() to attempt to load
the appropriate module, instead of requiring that done manually.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 4eb166d9 31-Jan-2008 Jens Axboe <jens.axboe@oracle.com>

block: make elevator lib checkpatch compliant

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# fa0ccd83 10-Jan-2008 James Bottomley <James.Bottomley@HansenPartnership.com>

block: implement drain buffers

These DMA drain buffer implementations in drivers are pretty horrible
to do in terms of manipulating the scatterlist. Plus they're being
done at least in drivers/ide and drivers/ata, so we now have code
duplication.

The one use case for this, as I understand it is AHCI controllers doing
PIO mode to mmc devices but translating this to DMA at the controller
level.

So, what about adding a callback to the block layer that permits the
adding of the drain buffer for the problem devices. The idea is that
you'd do this in slave_configure after you find one of these devices.

The beauty of doing it in the block layer is that it quietly adds the
drain buffer to the end of the sg list, so it automatically gets mapped
(and unmapped) without anything unusual having to be done to the
scatterlist in driver/scsi or drivers/ata and without any alteration to
the transfer length.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# f9cb074b 17-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: rename kobject_init_ng() to kobject_init()

Now that the old kobject_init() function is gone, rename
kobject_init_ng() to kobject_init() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# b2d6db58 17-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: rename kobject_add_ng() to kobject_add()

Now that the old kobject_add() function is gone, rename kobject_add_ng()
to kobject_add() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 29e3dd0d 17-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: convert block/elevator.c to use kobject_init/add_ng()

This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 2fdd82bd 12-Dec-2007 Adrian Bunk <bunk@kernel.org>

block: let elv_register() return void

elv_register() always returns 0, and there isn't anything it does where
it should return an error (the only error condition is so grave that
it's handled with a BUG_ON).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# dbe7f76d 19-Oct-2007 Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de>

fix typo "insted" -> "instead"

Signed-off-by: Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de>
Signed-off-by: Adrian Bunk <bunk@kernel.org>


# bf2de6f5 27-Sep-2007 Jens Axboe <jens.axboe@oracle.com>

block: Initial support for data-less (or empty) barrier support

This implements functionality to pass down or insert a barrier
in a queue, without having data attached to it. The ->prepare_flush_fn()
infrastructure from data barriers are reused to provide this
functionality.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# a0cd1285 21-Sep-2007 Jens Axboe <jens.axboe@oracle.com>

block: add end_queued_request() and end_dequeued_request() helpers

We can use this helper in the elevator core for BLKPREP_KILL, and it'll
also be useful for the empty barrier patch.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 19c38de8 12-Sep-2007 Greg Kroah-Hartman <gregkh@suse.de>

kobjects: fix up improper use of the kobject name field

A number of different drivers incorrect access the kobject name field
directly. This is not correct as the name might not be in the array.
Use the proper accessor function instead.


# 165125e1 24-Jul-2007 Jens Axboe <jens.axboe@oracle.com>

[BLOCK] Get rid of request_queue_t typedef

Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 94f6030c 17-Jul-2007 Christoph Lameter <clameter@sgi.com>

Slab allocators: Replace explicit zeroing with __GFP_ZERO

kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
variant in the past. But with __GFP_ZERO it is possible now to do zeroing
while allocating.

Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
we can.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 70cee26e 09-Jul-2007 Matthias Kaehlcke <matthias.kaehlcke@gmail.com>

Use list_for_each_entry() instead of list_for_each() in the block device

elevator

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 2a12dcd7 26-Apr-2007 Jens Axboe <jens.axboe@oracle.com>

[PATCH] elevator: elv_list_lock does not need irq disabling

It's never grabbed from irq context, so just make it plain spin_lock().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 1ffb96c5 14-Mar-2007 Thibaut VARENE <T-Bone@parisc-linux.org>

make elv_register() output atomic

Booting 2.6.21-rc3-g45592145 I noticed the following on one of my
machines in the bootlog:

io scheduler noop registered<6>Time: jiffies clocksource has been installed.

io scheduler deadline registered (default)

Looking at block/elevator.c, it appears that elv_register() uses two
consecutive printks in a non-atomic way, leading to the above glitch. The
attached trivial patch fixes this issue, by using a single printk.

Signed-off-by: Thibaut VARENE <varenet@parisc-linux.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 783660b2 18-Jan-2007 Jens Axboe <jens.axboe@oracle.com>

elevator: don't sort reads between writes

Don't allow elv_dispatch_sort() to mix reads and writes together,
it's rarely a good idea.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# cad97516 14-Jan-2007 Jens Axboe <jens.axboe@oracle.com>

elevator: abstract out the activate and deactivate functions

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 95543179 23-Jan-2007 Linas Vepstas <linas@austin.ibm.com>

[PATCH] elevator: move clearing of unplug flag earlier

A flag was recently added to the elevator code to avoid
performing an unplug when reuests are being re-queued.
The goal of this flag was to avoid a deep recursion that
can occur when re-queueing requests after a SCSI device/host
reset. See http://lkml.org/lkml/2006/5/17/254

However, that fix added the flag near the bottom of a case
statement, where an earlier break (in an if statement) could
transport one out of the case, without setting the flag.
This patch sets the flag earlier in the case statement.

I re-discovered the deep recursion recently during testing;
I was told that it was a known problem, and the fix to it was
in the kernel I was testing. Indeed it was ... but it didn't
fix the bug. With the patch below, I no longer see the bug.

Signed-off by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb4067e3 21-Dec-2006 Jens Axboe <jens.axboe@oracle.com>

[PATCH] elevator: fixup typo in merge logic

The recent io scheduler allow_merge commit left the block layer with
no merging, oops. This patch fixes that up.

That means the CFQ change needs to be verified again, it might not fix
the original bug now. But that's a seperate thing, I'll double check
that tomorrow.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# da775265 20-Dec-2006 Jens Axboe <jens.axboe@oracle.com>

[PATCH] cfq-iosched: don't allow sync merges across queues

Currently we allow any merge, even if the io originates from different
processes. This can cause really bad starvation and unfairness, if those
ios happen to be synchronous (reads or direct writes).

So add a allow_merge hook to the io scheduler ops, so an io scheduler can
help decide whether a bio/process combination may be merged with an
existing request.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# bb37b94c 01-Dec-2006 Jens Axboe <jens.axboe@oracle.com>

[BLOCK] Cleanup unused variable passing

- ->init_queue() does not need the elevator passed in
- ->put_request() is a hot path and need not have the queue passed in
- cfq_update_io_seektime() does not need cfqd passed in

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# c5841642 11-Oct-2006 Vasily Tarasov <vtaras@openvz.org>

[PATCH] block layer: elv_iosched_show should get elv_list_lock

elv_iosched_show function iterates other elv_list, hence
elv_list_lock should be got.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Vasily Tarasov <jens.axboe@oracle.com>


# a22b169d 11-Oct-2006 Vasily Tarasov <vtaras@openvz.org>

[PATCH] block layer: elevator_find function cleanup

We can easily produce search through the elevator list
without introducing additional elevator_type variable.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0fe23479 04-Sep-2006 Jens Axboe <axboe@kernel.dk>

[PATCH] Update axboe@suse.de email address

As people often look for the copyright in files to see who to mail,
update the link to a neutral one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b5deef90 19-Jul-2006 Jens Axboe <axboe@suse.de>

[PATCH] Make sure all block/io scheduler setups are node aware

Some were kmalloc_node(), some were still kmalloc(). Change them all to
kmalloc_node().

Signed-off-by: Jens Axboe <axboe@suse.de>


# cb78b285 28-Jul-2006 Jens Axboe <axboe@suse.de>

[PATCH] Drop useless bio passing in may_queue/set_request API

It's not needed for anything, so kill the bio passing.

Signed-off-by: Jens Axboe <axboe@suse.de>


# c00895ab 30-Sep-2006 Jens Axboe <axboe@kernel.dk>

[PATCH] Remove ->waiting member from struct request

As the comments indicates in blkdev.h, we can fold it into ->end_io_data
usage as that is really what ->waiting is. Fixup the users of
blk_end_sync_rq().

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2e662b65 13-Jul-2006 Jens Axboe <axboe@suse.de>

[PATCH] elevator: abstract out the rbtree sort handling

The rbtree sort/lookup/reposition logic is mostly duplicated in
cfq/deadline/as, so move it to the elevator core. The io schedulers
still provide the actual rb root, as we don't want to impose any sort
of specific handling on the schedulers.

Introduce the helpers and rb_node in struct request to help migrate the
IO schedulers.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 9817064b 28-Jul-2006 Jens Axboe <axboe@suse.de>

[PATCH] elevator: move the backmerging logic into the elevator core

Right now, every IO scheduler implements its own backmerging (except for
noop, which does no merging). That results in duplicated code for
essentially the same operation, which is never a good thing. This patch
moves the backmerging out of the io schedulers and into the elevator
core. We save 1.6kb of text and as a bonus get backmerging for noop as
well. Win-win!

Signed-off-by: Jens Axboe <axboe@suse.de>


# 4aff5e23 10-Aug-2006 Jens Axboe <axboe@suse.de>

[PATCH] Split struct request ->flags into two parts

Right now ->flags is a bit of a mess: some are request types, and
others are just modifiers. Clean this up by splitting it into
->cmd_type and ->cmd_flags. This allows introduction of generic
Linux block message types, useful for sending generic Linux commands
to block devices.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 2d8f6131 22-Aug-2006 Oleg Nesterov <oleg@tv-sign.ru>

elv_unregister: fix possible crash on module unload

An exiting task or process which didn't do I/O yet have no io context,
elv_unregister() should check it is not NULL.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# acf42175 12-Jun-2006 Dave Jones <davej@redhat.com>

[PATCH] remove dead code from elevator switching

We already drop the refcount in elevator_exit(), and as
we're setting 'e' to NULL, we'll never take that branch anyway.
Finally, as 'e' is a local var that isn't referenced afterwards,
setting it to NULL is pointless.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# bc1c1169 08-Jun-2006 Jens Axboe <axboe@suse.de>

[PATCH] elevator switching race

There's a race between shutting down one io scheduler and firing up the
next, in which a new io could enter and cause the io scheduler to be
invoked with bad or NULL data.

To fix this, we need to maintain the queue lock for a bit longer.
Unfortunately we cannot do that, since the elevator init requires to be
run without the lock held. This isn't easily fixable, without also
changing the mempool API. So split the initialization into two parts,
and alloc-init operation and an attach operation. Then we can
preallocate the io scheduler and related structures, and run the attach
inside the lock after we detach the old one.

This patch has survived 30 minutes of 1 second io scheduler switching
with a very busy io load.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# dac07ec1 11-May-2006 Jens Axboe <axboe@suse.de>

[BLOCK] limit request_fn recursion

Don't recurse back into the driver even if the unplug threshold is met,
when the driver asks for a requeue. This is both silly from a logical
point of view (requeues typically happen due to driver/hardware
shortage), and also dangerous since we could hit an endless request_fn
-> requeue -> unplug -> request_fn loop and crash on stack overrun.

Also limit blk_run_queue() to one level of recursion, similar to how
blk_start_queue() works.

This patch fixed a real problem with SLES10 and lpfc, and it could hit
any SCSI lld that returns non-zero from it's ->queuecommand() handler.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4f73247f 20-Apr-2006 Adrian Bunk <bunk@stusta.de>

[PATCH] block/elevator.c: remove unused exports

This patch removes the following unused EXPORT_SYMBOL's:
- elv_requeue_request
- elv_completed_request

They are only used by the block core, hence they need not be exported.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 9b41046c 31-Mar-2006 OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

[PATCH] Don't pass boot parameters to argv_init[]

The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).

And __setup() is used in obsolete_checksetup().

start_kernel()
-> parse_args()
-> unknown_bootoption()
-> obsolete_checksetup()

If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.

If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func(). If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].

Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.

This patch fixes a wrong usage of it, however fixes obvious one only.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ce524497 24-Mar-2006 Eric Sesterhenn <snakebyte@gmx.de>

BUG_ON() Conversion in block/elevator.c

this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 2056a782 23-Mar-2006 Jens Axboe <axboe@suse.de>

[PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23

Signed-off-by: Jens Axboe <axboe@suse.de>


# e572ec7e 18-Mar-2006 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] fix rmmod problems with elevator attributes, clean them up


# 3d1ab40f 18-Mar-2006 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] elevator_t lifetime rules and sysfs fixes


# e17a9489 18-Mar-2006 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] stop elv_unregister() from rogering other iosched's data, fix locking

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 30e9656c 08-Feb-2006 Tejun Heo <htejun@gmail.com>

[PATCH] block: implement elv_insert and use it (fix ordcolor flipping bug)

q->ordcolor must only be flipped on initial queueing of a hardbarrier
request.

Constructing ordered sequence and requeueing used to pass through
__elv_add_request() which flips q->ordcolor when it sees a barrier
request.

This patch separates out elv_insert() from __elv_add_request() and uses
elv_insert() when constructing ordered sequence and requeueing.
elv_insert() inserts the given request at the specified position and
does nothing else.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 238e7db9 05-Feb-2006 Tejun Heo <htejun@gmail.com>

[PATCH] block: request_queue->ordcolor must not be flipped on SOFTBARRIER

q->ordcolor must not be flipped on SOFTBARRIER.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 248d5ca5 24-Jan-2006 Nate Diller <nate.diller@gmail.com>

[BLOCK] elevator: allow default scheduler to potentially be modular

Jens has decided that allowing the default scheduler to be a module is
a bug, and should not be allowed under kconfig. However, I find that
scenario useful for debugging, and wish for the kernel to be able to
handle this situation without OOPSing, if I enable such an option in
the .config directly. This patch dynamically checks for the presence
of the compiled-in default, and falls back to no-op, emitting a
suitable error message, when the default is not available

Tested for a range of boot options on 2.6.16-rc1-mm2.

Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 5f003976 24-Jan-2006 Nate Diller <nate.diller@gmail.com>

[BLOCK] elevator: default choice selection

My previous default iosched patch did a poor job dealing with the
'elevator=' boot-time option. The old behavior falls back to the
compiled-in default if the requested one is not registered at boot
time. This patch dynamically evaluates which default
to use, and emits a suitable error message when the requested scheduler
is not available. It also does the 'as' -> 'anticipatory' conversion
before elevator registration, which along with a modified registration
function, allows it to correctly indicate which default scheduler is
in use.

Tested for a range of boot options on 2.6.16-rc1-mm2.

Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 4c264583 16-Jan-2006 Chuck Ebbert <76306.1226@compuserve.com>

[PATCH] elevator=as back-compatibility

As of 2.6.15 you need to use "anticipatory" instead of "as". Fix that up
so that `elevator=as' still works.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <axboe@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b7bfcf7c 16-Jan-2006 Jens Axboe <axboe@suse.de>

[BLOCK] elevator: if specified scheduler is not found, fall back to default

Signed-off-by: Jens Axboe <axboe@suse.de>


# 752a3b79 16-Jan-2006 Chuck Ebbert <76306.1226@compuserve.com>

[BLOCK] elevator: Make elevator=as work again for anticipatory

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 1bc691d3 12-Jan-2006 Tejun Heo <htejun@gmail.com>

[PATCH] fix queue stalling while barrier sequencing

If ordered tag isn't supported, request ordering for barrier
sequencing is performed by queue draining, which basically hangs the
request queue until elv_completed_request() reports completion of all
previous fs requests.

The condition check in elv_completed_request() was only performed for
fs requests. If a special request is queued between the last
to-be-drained request and the barrier sequence, draining is never
completed and the queue is stalled forever.

This patch moves the end-of-draining condition check such that it's
performed for all requests.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 769db45b 28-Dec-2005 Coywolf Qi Hunt <qiyong@fc-cn.com>

make elv_try_merge() static, kill the dead declaration of

elv_try_last_merge().

Signed-off-by: Coywolf Qi Hunt <qiyong@fc-cn.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 797e7dbb 06-Jan-2006 Tejun Heo <htejun@gmail.com>

[BLOCK] reimplement handling of barrier request

Reimplement handling of barrier requests.

* Flexible handling to deal with various capabilities of
target devices.
* Retry support for falling back.
* Tagged queues which don't support ordered tag can do ordered.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 8ffdc655 06-Jan-2006 Tejun Heo <htejun@gmail.com>

[BLOCK] add @uptodate to end_that_request_last() and @error to rq_end_io_fn()

add @uptodate argument to end_that_request_last() and @error
to rq_end_io_fn(). there's no generic way to pass error code
to request completion function, making generic error handling
of non-fs request difficult (rq->errors is driver-specific and
each driver uses it differently). this patch adds @uptodate
to end_that_request_last() and @error to rq_end_io_fn().

for fs requests, this doesn't really matter, so just using the
same uptodate argument used in the last call to
end_that_request_first() should suffice. imho, this can also
help the generic command-carrying request jens is working on.

Signed-off-by: tejun heo <htejun@gmail.com>
Signed-Off-By: Jens Axboe <axboe@suse.de>


# eb97b73d 16-Nov-2005 Coywolf Qi Hunt <qiyong@fc-cn.com>

[BLOCK] new block/ directory comment tidy

Some leftover comments referring to drivers/block that are now block/.
They don't add any information we don't already have, so kill them.

Signed-off-by: Coywolf Qi Hunt <qiyong@fc-cn.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 3beb2077 10-Nov-2005 Tejun Heo <htejun@gmail.com>

[BLOCK] elevator: elv_latter/former_request update

With generic dispatch queue update, implicit former/latter request
handling using rq->queuelist.prev/next doesn't work as expected
anymore. Also, the only iosched dependent on this feature was
noop-iosched and it has been reimplemented to have its own
latter/former methods. This patch removes implicit former/latter
handling.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# be561235 10-Nov-2005 Tejun Heo <htejun@gmail.com>

[BLOCK] fix string handling in elv_iosched_store

elv_iosched_store doesn't terminate string passed from userspace if
it's too long. Also, if the written length is zero (probably not
possible), it accesses elevator_name[-1]. This patch fixes both bugs.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 15853af9 10-Nov-2005 Tejun Heo <htejun@gmail.com>

[BLOCK] Implement elv_drain_elevator for improved switch error detection

This patch adds request_queue->nr_sorted which keeps the number of
requests in the iosched and implement elv_drain_elevator which
performs forced dispatching. elv_drain_elevator checks whether
iosched actually dispatches all requests it has and prints error
message if it doesn't. As buggy forced dispatching can result in
wrong barrier operations, I think this extra check is worthwhile.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 407df2aa 10-Nov-2005 Tejun Heo <htejun@gmail.com>

[BLOCK] elevator: run queue in elevator_switch

elevator_dispatch needs to run queue after forced dispatching;
otherwise, the queue might stall.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# cff3ba22 09-Nov-2005 Zachary Amsden <zach@vmware.com>

[BLOCK] elevator init fixes #2

In addition to the first patch, which is probably goodness, I found the
cause of my panic - applying this patch fixes it and now I am booting.
If the chosen_elevator[] is not found, fall back to noop.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jens Axboe <axboe@suse.de>


# b8ea2cb5 09-Nov-2005 Zachary Amsden <zach@vmware.com>

[BLOCK] elevator init fixes

I got a panic in the elevator code, backtrace :

Unable to handle kernel NULL pointer dereference at virtual address 00000060
..
EIP is at elevator_put+0x0/0x30 (null elevator_type passed)
..
elevator_init+0x38
blk_init_queu_node+0xc9
floppy_init+0xdb
do_initcalls+0x23
init+0x10a
init+0x0

Clearly if the kmalloc here fails, e->elevator_type is not yet set; this
appears to be the correct fix, but I think I probably hit the second case
due to a race condition. Someone more familiar with the elevator code
should look at this more closely until I can determine if I can reproduce.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 3a65dfe8 04-Nov-2005 Jens Axboe <axboe@suse.de>

[BLOCK] Move all core block layer code to new block/ directory

drivers/block/ is right now a mix of core and driver parts. Lets move
the core parts to a new top level directory. Al will move the fs/
related block parts to block/ next.

Signed-off-by: Jens Axboe <axboe@suse.de>