History log of /linux-master/block/blk-mq.c
Revision Date Author Comments
# 55251fbd 27-Mar-2024 Damien Le Moal <dlemoal@kernel.org>

block: Do not force full zone append completion in req_bio_endio()

This reverts commit 748dc0b65ec2b4b7b3dbd7befcc4a54fdcac7988.

Partial zone append completions cannot be supported as there is no
guarantees that the fragmented data will be written sequentially in the
same manner as with a full command. Commit 748dc0b65ec2 ("block: fix
partial zone append completion handling in req_bio_endio()") changed
req_bio_endio() to always advance a partially failed BIO by its full
length, but this can lead to incorrect accounting. So revert this
change and let low level device drivers handle this case by always
failing completely zone append operations. With this revert, users will
still see an IO error for a partially completed zone append BIO.

Fixes: 748dc0b65ec2 ("block: fix partial zone append completion handling in req_bio_endio()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240328004409.594888-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ec30b461 27-Feb-2024 Ming Lei <ming.lei@redhat.com>

blk-mq: don't change nr_hw_queues and nr_maps for kdump kernel

For most of ARCHs, 'nr_cpus=1' is passed for kdump kernel, so
nr_hw_queues for each mapping is supposed to be 1 already.

More importantly, this way may cause trouble for driver, because blk-mq and
driver see different queue mapping since driver should setup hardware
queue setting before calling into allocating blk-mq tagset.

So not overriding nr_hw_queues and nr_maps for kdump kernel.

Cc: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228040857.306483-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# af550e4c 23-Feb-2024 Qais Yousef <qyousef@layalina.io>

block/blk-mq: Don't complete locally if capacities are different

The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
CPUs have equal compute capacities and only LLC cache can make
a different on perceived performance. But this assumption falls apart on
HMP systems where LLC is shared, but the CPUs have different capacities.
Staying local then can have a big performance impact if the IO request
was done from a CPU with higher capacity but the interrupt is serviced
on a lower capacity CPU.

Use the new cpus_equal_capacity() function to check if we need to send
an IPI.

Without the patch I see the BLOCK softirq always running on little cores
(where the hardirq is serviced). With it I can see it running on all
cores.

This was noticed after the topology change [1] where now on a big.LITTLE
we truly get that the LLC is shared between all cores where as in the
past it was being misrepresented for historical reasons. The logic
exposed a missing dependency on capacities for such systems where there
can be a big performance difference between the CPUs.

This of course introduced a noticeable change in behavior depending on
how the topology is presented. Leading to regressions in some workloads
as the performance of the BLOCK softirq on littles can be noticeably
worse on some platforms.

Worth noting that we could have checked for capacities being greater
than or equal instead for equality. This will lead to favouring higher
performance always. But opted for equality instead to match the
performance of the requester without making an assumption that can lead
to power trade-offs which these systems tend to be sensitive about. If
the requester would like to run faster, it's better to rely on the
scheduler to give the IO requester via some facility to run on a faster
core; and then if the interrupt triggered on a CPU with different
capacity we'll make sure to match the performance the requester is
supposed to run at.

[1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf

Signed-off-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240223155749.2958009-3-qyousef@layalina.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0eb4db47 23-Feb-2024 Keith Busch <kbusch@kernel.org>

block: io wait hang check helper

This is the same in two places, and another will be added soon. Create a
helper for it.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240223155910.3622666-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 27e32cd2 13-Feb-2024 Christoph Hellwig <hch@lst.de>

block: pass a queue_limits argument to blk_mq_alloc_disk

Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL. This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9ac4dd8c 13-Feb-2024 Christoph Hellwig <hch@lst.de>

block: pass a queue_limits argument to blk_mq_init_queue

Pass a queue_limits to blk_mq_init_queue and apply it if non-NULL. This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Also rename the function to blk_mq_alloc_queue as that is a much better
name for a function that allocates a queue and always pass the queuedata
argument instead of having a separate version for the extra argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ad751ba1 13-Feb-2024 Christoph Hellwig <hch@lst.de>

block: pass a queue_limits argument to blk_alloc_queue

Pass a queue_limits to blk_alloc_queue and apply it after validating and
capping the values using blk_validate_limits. This will allow allocating
queues with valid queue limits instead of setting the values one at a
time later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 44981351 02-Feb-2024 Bart Van Assche <bvanassche@acm.org>

block, fs: Restore the per-bio/request data lifetime fields

Restore support for passing data lifetime information from filesystems to
block drivers. This patch reverts commit b179c98f7697 ("block: Remove
request.write_hint") and commit c75e707fe1aa ("block: remove the
per-bio/request write hint").

This patch does not modify the size of struct bio because the new
bi_write_hint member fills a hole in struct bio. pahole reports the
following for struct bio on an x86_64 system with this patch applied:

/* size: 112, cachelines: 2, members: 20 */
/* sum members: 110, holes: 1, sum holes: 2 */
/* last cacheline: 48 bytes */

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 08420cf7 15-Jan-2024 Jens Axboe <axboe@kernel.dk>

block: add blk_time_get_ns() and blk_time_get() helpers

Convert any user of ktime_get_ns() to use blk_time_get_ns(), and
ktime_get() to blk_time_get(), so we have a unified API for querying the
current time in nanoseconds or as ktime.

No functional changes intended, this patch just wraps ktime_get_ns()
and ktime_get() with a block helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 72e84e90 24-Jan-2024 Christoph Hellwig <hch@lst.de>

blk-mq: special case cached requests less

Share the main merge / split / integrity preparation code between the
cached request vs newly allocated request cases, and add comments
explaining the cached request handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240124092658.2258309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 337e89fe 24-Jan-2024 Christoph Hellwig <hch@lst.de>

blk-mq: introduce a blk_mq_peek_cached_request helper

Add a new helper to check if there is suitable cached request in
blk_mq_submit_bio. This removes open coded logic in blk_mq_submit_bio
and moves some checks that so far are in blk_mq_use_cached_rq to
be performed earlier. This avoids the case where we first do check
with the cached request but then later end up allocating a new one
anyway and need to grab a queue reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240124092658.2258309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0f299da5 24-Jan-2024 Christoph Hellwig <hch@lst.de>

blk-mq: move blk_mq_attempt_bio_merge out blk_mq_get_new_requests

blk_mq_attempt_bio_merge has nothing to do with allocating a new
request, it avoids allocating a new request. Move the call out of
blk_mq_get_new_requests and into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240124092658.2258309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f3c89983 30-Jan-2024 Hongyu Jin <hongyu.jin@unisoc.com>

block: Fix where bio IO priority gets set

Commit 82b74cac2849 ("blk-ioprio: Convert from rqos policy to direct
call") pushed setting bio I/O priority down into blk_mq_submit_bio()
-- which is too low within block core's submit_bio() because it
skips setting I/O priority for block drivers that implement
fops->submit_bio() (e.g. DM, MD, etc).

Fix this by moving bio_set_ioprio() up from blk-mq.c to blk-core.c and
call it from submit_bio(). This ensures all block drivers call
bio_set_ioprio() during initial bio submission.

Fixes: a78418e6a04c ("block: Always initialize bio IO priority on submit")
Co-developed-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
[snitzer: revised commit header]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240130202638.62600-2-snitzer@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7b4f36cd 12-Jan-2024 Jens Axboe <axboe@kernel.dk>

block: ensure we hold a queue reference when using queue limits

q_usage_counter is the only thing preventing us from the limits changing
under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold
it while calling into it.

Move the splitting inside the region where we know we've got a queue
reference. Ideally this could still remain a shared section of code, but
let's keep the fix simple and defer any refactoring here to later.

Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: 900e08075202 ("block: move queue enter logic into blk_mq_submit_bio()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 309ce674 11-Jan-2024 Christoph Hellwig <hch@lst.de>

blk-mq: rename blk_mq_can_use_cached_rq

blk_mq_can_use_cached_rq doesn't just check if we can use the request,
but also performs the work to actually use it. Remove the _can in the
naming, and improve the comment describing the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240111135705.2155518-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5266caaf 12-Jan-2024 Ming Lei <ming.lei@redhat.com>

blk-mq: fix IO hang from sbitmap wakeup race

In blk_mq_mark_tag_wait(), __add_wait_queue() may be re-ordered
with the following blk_mq_get_driver_tag() in case of getting driver
tag failure.

Then in __sbitmap_queue_wake_up(), waitqueue_active() may not observe
the added waiter in blk_mq_mark_tag_wait() and wake up nothing, meantime
blk_mq_mark_tag_wait() can't get driver tag successfully.

This issue can be reproduced by running the following test in loop, and
fio hang can be observed in < 30min when running it on my test VM
in laptop.

modprobe -r scsi_debug
modprobe scsi_debug delay=0 dev_size_mb=4096 max_queue=1 host_max_queue=1 submit_queues=4
dev=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
fio --filename=/dev/"$dev" --direct=1 --rw=randrw --bs=4k --iodepth=1 \
--runtime=100 --numjobs=40 --time_based --name=test \
--ioengine=libaio

Fix the issue by adding one explicit barrier in blk_mq_mark_tag_wait(), which
is just fine in case of running out of tag.

Cc: Jan Kara <jack@suse.cz>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Reported-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240112122626.4181044-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 748dc0b6 10-Jan-2024 Damien Le Moal <dlemoal@kernel.org>

block: fix partial zone append completion handling in req_bio_endio()

Partial completions of zone append request is not allowed but if a zone
append completion indicates a number of completed bytes different from
the original BIO size, only the BIO status is set to error. This leads
to bio_advance() not setting the BIO size to 0 and thus to not call
bio_endio() at the end of req_bio_endio().

Make sure a partially completed zone append is failed and completed
immediately by forcing the completed number of bytes (nbytes) to be
equal to the BIO size, thus ensuring that bio_endio() is called.

Fixes: 297db731847e ("block: fix req_bio_endio append error handling")
Cc: stable@kernel.vger.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240110092942.442334-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 847c5bcd 23-Nov-2023 Kundan Kumar <kundan.kumar@samsung.com>

block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io

Write-back throttling (WBT) enables QUEUE_FLAG_STATS on the request
queue. But WBT does not make sense for passthrough io, so skip
QUEUE_FLAG_STATS processing.

Also skip rq_qos_issue/done for passthrough io.

Overall, the change gives ~11% hike in peak performance.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20231123190331.7934-1-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0e4237ae 01-Dec-2023 Ming Lei <ming.lei@redhat.com>

blk-mq: don't count completed flush data request as inflight in case of quiesce

Request queue quiesce may interrupt flush sequence, and the original request
may have been marked as COMPLETE, but can't get finished because of
queue quiesce.

This way is fine from driver viewpoint, because flush sequence is block
layer concept, and it isn't related with driver.

However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count &
drain inflight requests, then the wait & drain never gets done because
the completed & not-finished flush request is counted as inflight.

Fix this issue by not counting completed flush data request as inflight in
case of quiesce.

Cc: Mike Snitzer <snitzer@kernel.org>
Cc: David Jeffery <djeffery@redhat.com>
Cc: John Pittman <jpittman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b0077e26 12-Nov-2023 Christoph Hellwig <hch@infradead.org>

blk-mq: make sure active queue usage is held for bio_integrity_prep()

blk_integrity_unregister() can come if queue usage counter isn't held
for one bio with integrity prepared, so this request may be completed with
calling profile->complete_fn, then kernel panic.

Another constraint is that bio_integrity_prep() needs to be called
before bio merge.

Fix the issue by:

- call bio_integrity_prep() with one queue usage counter grabbed reliably

- call bio_integrity_prep() before bio merge

Fixes: 900e080752025f00 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/r/20231113035231.2708053-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 217b613a 13-Sep-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: update driver tags request table when start request

Now we update driver tags request table in blk_mq_get_driver_tag(),
so the driver that support queue_rqs() have to update that inflight
table by itself.

Move it to blk_mq_start_request(), which is a better place where
we setup the deadline for request timeout check. And it's just
where the request becomes inflight.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 434097ee 13-Sep-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: support batched queue_rqs() on shared tags queue

Since active requests have been accounted when allocate driver tags,
we can remove this limit now.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-4-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 48554df6 13-Sep-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: remove RQF_MQ_INFLIGHT

Since the previous patch change to only account active requests when
we really allocate the driver tag, the RQF_MQ_INFLIGHT can be removed
and no double account problem.

1. none elevator: flush request will use the first pending request's
driver tag, won't double account.

2. other elevator: flush request will be accounted when allocate driver
tag when issue, and will be unaccounted when it put the driver tag.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b8643d68 13-Sep-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: account active requests when get driver tag

There is a limit that batched queue_rqs() can't work on shared tags
queue, since the account of active requests can't be done there.

Now we account the active requests only in blk_mq_get_driver_tag(),
which is not the time we get driver tag actually (with none elevator).

To support batched queue_rqs() on shared tags queue, we move the
account of active requests to where we get the driver tag:

1. none elevator: blk_mq_get_tags() and blk_mq_get_tag()
2. other elevator: __blk_mq_alloc_driver_tag()

This is clearer and match with the unaccount side, which just happen
when we put the driver tag.

The other good point is that we don't need RQF_MQ_INFLIGHT trick
anymore, which used to avoid double account of flush request.
Now we only account when actually get the driver tag, so all is good.
We will remove RQF_MQ_INFLIGHT in the next patch.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6be6d112 07-Sep-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: fix tags UAF when shrinking q->nr_hw_queues

When nr_hw_queues shrink, we free the excess tags before realloc'ing
hw_ctxs for each queue. During that resize, we may need to access those
tags, like blk_mq_tag_idle(hctx) will access queue shared tags.

This can cause a slab use-after-free, as reported by KASAN. Fix it by
moving the releasing of excess tags to the end.

Fixes: e1dd7bc93029 ("blk-mq: fix tags leak when shrink nr_hw_queues")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs_CK63uoDpGBGZ6DN4OCTpzkR3UaVgK=LX8Owr8ej2ieQ@mail.gmail.com/
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230908005702.2183908-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7222657e 21-Aug-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: prealloc tags when increase tagset nr_hw_queues

Just like blk_mq_alloc_tag_set(), it's better to prepare all tags before
using to map to queue ctxs in blk_mq_map_swqueue(), which now have to
consider empty set->tags[].

The good point is that we can fallback easily if increasing nr_hw_queues
fail, instead of just mapping to hctx[0] when fail in blk_mq_map_swqueue().

And the fallback path already has tags free & clean handling, so all
is good.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230821095602.70742-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2bc4d7a3 21-Aug-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: delete redundant tagset map update when fallback

When we increase nr_hw_queues fail, the fallback path will use
blk_mq_update_queue_map() to clear and update all maps.
Obviously, this line of update of HCTX_TYPE_DEFAULT only is not
needed, so delete it.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230821095602.70742-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e1dd7bc9 21-Aug-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: fix tags leak when shrink nr_hw_queues

Although we don't need to realloc set->tags[] when shrink nr_hw_queues,
we need to free them. Or these tags will be leaked.

How to reproduce:
1. mount -t configfs configfs /mnt
2. modprobe null_blk nr_devices=0 submit_queues=8
3. mkdir /mnt/nullb/nullb0
4. echo 1 > /mnt/nullb/nullb0/power
5. echo 4 > /mnt/nullb/nullb0/submit_queues
6. rmdir /mnt/nullb/nullb0

In step 4, will alloc 9 tags (8 submit queues and 1 poll queue), then
in step 5, new_nr_hw_queues = 5 (4 submit queues and 1 poll queue).
At last in step 6, only these 5 tags are freed, the other 4 tags leaked.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230821095602.70742-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 65a558f6 21-Jul-2023 Bart Van Assche <bvanassche@acm.org>

block: Improve performance for BLK_MQ_F_BLOCKING drivers

blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.

The following caller chains have been reviewed:

blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 660e802c 16-Jul-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: use percpu csd to remote complete instead of per-rq csd

If request need to be completed remotely, we insert it into percpu llist,
and smp_call_function_single_async() if llist is empty previously.

We don't need to use per-rq csd, percpu csd is enough. And the size of
struct request is decreased by 24 bytes.

This way is cleaner, and looks correct, given block softirq is guaranteed
to be scheduled to consume the list if one new request is added to this
percpu list, either smp_call_function_single_async() returns -EBUSY or 0.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e5c0ca13 13-Aug-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: release scheduler resource when request completes

Chuck reported [1] an IO hang problem on NFS exports that reside on SATA
devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
submission path for post-flush requests").

We analysed the IO hang problem, found there are two postflush requests
waiting for each other.

The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
failed to blk_kick_flush() because of the second postflush request which
is inflight waiting in scheduler queue.

The second postflush waiting in scheduler queue can't be dispatched because
the first postflush hasn't released scheduler resource even though it has
completed by itself.

Fix it by releasing scheduler resource when the first postflush request
completed, so the second postflush can be dispatched and completed, then
make blk_kick_flush() succeed.

While at it, remove the check for e->ops.finish_request, as all
schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE()
at scheduler registration time, just like we do for insert_requests and
dispatch_request.

[1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/

Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com
Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev
[axboe: folded in incremental fix and added tags]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 70904263 14-Jul-2023 Ross Lagerwall <ross.lagerwall@citrix.com>

blk-mq: Fix stall due to recursive flush plug

We have seen rare IO stalls as follows:

* blk_mq_plug_issue_direct() is entered with an mq_list containing two
requests.
* For the first request, it sets last == false and enters the driver's
queue_rq callback.
* The driver queue_rq callback indirectly calls schedule() which calls
blk_flush_plug(). This may happen if the driver has the
BLK_MQ_F_BLOCKING flag set and is allowed to sleep in ->queue_rq.
* blk_flush_plug() handles the remaining request in the mq_list. mq_list
is now empty.
* The original call to queue_rq resumes (with last == false).
* The loop in blk_mq_plug_issue_direct() terminates because there are no
remaining requests in mq_list.

The IO is now stalled because the last request submitted to the driver
had last == false and there was no subsequent call to commit_rqs().

Fix this by returning early in blk_mq_flush_plug_list() if rq_count is 0
which it will be in the recursive case, rather than checking if the
mq_list is empty. At the same time, adjust one of the callers to skip
the mq_list empty check as it is not necessary.

Fixes: dc5fc361d891 ("block: attempt direct issue of plug list")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230714101106.3635611-1-ross.lagerwall@citrix.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5c17f45e 10-Jul-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq

The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
state of the block device. Most of the time request is allocated after
rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.

But for plug batched allocation introduced by the commit 47c122e35d7e
("block: pre-allocate requests if plug is started and is a batch"), we can
rq_qos_throttle() after the allocation of the request. This is what the
blk_mq_get_cached_request() does.

In this case, the cached request alloc_time_ns or start_time_ns is much
ahead if blocked in any qos ->throttle().

Fix it by setting alloc_time_ns and start_time_ns to now when the allocated
request is actually used.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f6c80cff 12-Jun-2023 Keith Busch <kbusch@kernel.org>

block: add request polling helper

Provide a direct request polling will for drivers. The interface does
not require a bio, and can skip the overhead associated with polling
those. The biggest gain from skipping the relatively expensive xarray
lookup unnecessary when you already have the request.

With this, the simple rq/qc conversion functions have only one caller
each, so open code this and remove the helpers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230612190343.2087040-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6b7a3a2 24-Jun-2023 Ming Lei <ming.lei@redhat.com>

blk-mq: fix two misuses on RQF_USE_SCHED

Request allocated from sched tags can't be issued via ->queue_rqs()
directly, since driver tag isn't allocated yet. This is the 1st misuse
of RQF_USE_SCHED for figuring out plug->has_elevator.

Request allocated from sched tags can't be ended by
blk_mq_end_request_batch() too, fix the 2nd RQF_USE_SCHED misuse
in blk_mq_add_to_batch().

Without this patch, NVMe uring cmd passthrough IO workload can run into
hang easily with real io scheduler.

Fixes: dd6216bb16e8 ("blk-mq: make sure elevator callbacks aren't called for passthrough request")
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAGS2=YrBjpLPOKa-gzcKuuOG60AGth5794PNCDwatdnnscB9ug@mail.gmail.com/
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230624130105.1443879-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2293cae7 21-Jun-2023 Ming Lei <ming.lei@redhat.com>

blk-mq: don't insert passthrough request into sw queue

In case of real io scheduler, q->elevator is set, so blk_mq_run_hw_queue()
may just check if scheduler queue has request to dispatch, see
__blk_mq_sched_dispatch_requests(). Then IO hang may be caused because
all passthorugh requests may stay in sw queue.

And any passthrough request should have been inserted to hctx->dispatch
always.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: d97217e7f024 ("blk-mq: don't queue plugged passthrough requests into scheduler")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230621132208.1142318-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 24516565 16-Jun-2023 Ming Lei <ming.lei@redhat.com>

blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none

After grabbing q->sysfs_lock, q->elevator may become NULL because of
elevator switch.

Fix the NULL dereference on q->elevator by checking it with lock.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 30654614 13-Jun-2023 Ed Tsai <ed.tsai@mediatek.com>

blk-mq: check on cpu id when there is only one ctx mapping

commit f168420c62e7 ("blk-mq: don't redirect completion for hctx withs
only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx,
there will be no remote request.

But for ufs, the submission and completion queues could be asymmetric.
(e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and
ctx won't complete request on the submission cpu. In this situation,
this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result,
check on cpu id when there is only one ctx mapping.

Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com>
Suggested-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com
[axboe: fixed up indentation]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ddad5933 13-May-2023 Tian Lan <tian.lan@twosigma.com>

blk-mq: fix blk_mq_hw_ctx active request accounting

The nr_active counter continues to increase over time which causes the
blk_mq_get_tag to hang until the thread is rescheduled to a different
core despite there are still tags available.

kernel-stack

INFO: task inboundIOReacto:3014879 blocked for more than 2 seconds
Not tainted 6.1.15-amd64 #1 Debian 6.1.15~debian11
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:inboundIOReacto state:D stack:0 pid:3014879 ppid:4557 flags:0x00000000
Call Trace:
<TASK>
__schedule+0x351/0xa20
scheduler+0x5d/0xe0
io_schedule+0x42/0x70
blk_mq_get_tag+0x11a/0x2a0
? dequeue_task_stop+0x70/0x70
__blk_mq_alloc_requests+0x191/0x2e0

kprobe output showing RQF_MQ_INFLIGHT bit is not cleared before
__blk_mq_free_request being called.

320 320 kworker/29:1H __blk_mq_free_request rq_flags 0x220c0 in-flight 1
b'__blk_mq_free_request+0x1 [kernel]'
b'bt_iter+0x50 [kernel]'
b'blk_mq_queue_tag_busy_iter+0x318 [kernel]'
b'blk_mq_timeout_work+0x7c [kernel]'
b'process_one_work+0x1c4 [kernel]'
b'worker_thread+0x4d [kernel]'
b'kthread+0xe6 [kernel]'
b'ret_from_fork+0x1f [kernel]'

Signed-off-by: Tian Lan <tian.lan@twosigma.com>
Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230513221227.497327-1-tilan7663@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a80bd07 20-May-2023 Hengqi Chen <hengqi.chen@gmail.com>

block: introduce block_io_start/block_io_done tracepoints

Currently, several BCC ([0]) tools (biosnoop/biostacks/biotop) use
kprobes to blk_account_io_start/blk_account_io_done to implement
their functionalities. This is fragile because the target kernel
functions may be renamed ([1]) or inlined ([2]). So introduce two
new tracepoints for such use cases.

[0]: https://github.com/iovisor/bcc
[1]: https://github.com/iovisor/bcc/issues/3954
[2]: https://github.com/iovisor/bcc/issues/4261

Tested-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Tested-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230520084057.1467003-1-hengqi.chen@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9a67aa52 18-May-2023 Christoph Hellwig <hch@lst.de>

blk-mq: don't use the requeue list to queue flush commands

Currently both requeues of commands that were already sent to the driver
and flush commands submitted from the flush state machine share the same
requeue_list struct request_queue, despite requeues doing head
insertions and flushes not. Switch to using two separate lists instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230519044050.107790-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# be4c4278 18-May-2023 Bart Van Assche <bvanassche@acm.org>

blk-mq: use the I/O scheduler for writes from the flush state machine

Send write requests issued by the flush state machine through the normal
I/O submission path including the I/O scheduler (if present) so that I/O
scheduler policies are applied to writes with the FUA flag set.

Separate the I/O scheduler members from the flush members in struct
request since now a request may pass through both an I/O scheduler
and the flush machinery.

Note that the actual flush requests, which have no bio attached to the
request still bypass the I/O schedulers.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[hch: rebased]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230519044050.107790-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 360f2648 18-May-2023 Christoph Hellwig <hch@lst.de>

blk-mq: defer to the normal submission path for non-flush flush commands

If blk_insert_flush decides that a command does not need to use the
flush state machine, return false and let blk_mq_submit_bio handle
it the normal way (including using an I/O scheduler) instead of doing
a bypass insert.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230519044050.107790-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dd6216bb 17-May-2023 Christoph Hellwig <hch@lst.de>

blk-mq: make sure elevator callbacks aren't called for passthrough request

In case of q->elevator, passthrough request can still be marked as
RQF_ELV, so some elevator callbacks will be called for them.

Fix this by splitting RQF_SCHED_TAGS, which is set for all requests that
are issued on a queue that uses an I/O scheduler, and RQF_USE_SCHED for
non-flush, non-passthrough requests on such a queue.

Roughly based on two different patches from
Ming Lei <ming.lei@redhat.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230518053101.760632-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fdcab6cd 17-May-2023 Christoph Hellwig <hch@lst.de>

blk-mq: remove RQF_ELVPRIV

RQF_ELVPRIV is set for all non-flush requests that have RQF_ELV set.
Expand this condition in the two users of the flag and remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230518053101.760632-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d97217e7 17-May-2023 Ming Lei <ming.lei@redhat.com>

blk-mq: don't queue plugged passthrough requests into scheduler

Passthrough requests should never be queued to the I/O scheduler,
as scheduling these opaque requests doesn't make sense, and I/O
schedulers might require req->bio to be always valid.

We never let passthrough requests insert into the scheduler before
commit 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging"),
restore this behavior even for passthrough requests issued under a plug.

[hch: use blk_mq_insert_requests for passthrough requests,
fix up the commit message and comments]

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAGS2=YosaYaUTEMU3uaf+y=8MqSrhL7sYsJn8EwbaM=76p_4Qg@mail.gmail.com/
Investigated-by: Yu Kuai <yukuai1@huaweicloud.com>
Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230518053101.760632-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4d5bba5b 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: remove __blk_mq_run_hw_queue

__blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from
interrupt context and a blk_mq_run_dispatch_ops-protected call to
blk_mq_sched_dispatch_requests. Open code the call to
blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE
to blk_mq_run_hw_queue where it can be extended to all !async calls,
while the other call is from workqueue context and thus obviously does
not need the assert.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1aa8d875 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: move the !async handling out of __blk_mq_delay_run_hw_queue

Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with
async=false, so move the handling there.

With this __blk_mq_delay_run_hw_queue can be merged into
blk_mq_delay_run_hw_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cd735e11 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: move the blk_mq_hctx_stopped check in __blk_mq_delay_run_hw_queue

For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in
blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection.
For the async dispatch case having a check before scheduling the work
still makes sense to avoid needless workqueue scheduling, so just keep it
for that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c20a1a2c 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: remove the blk_mq_hctx_stopped check in blk_mq_run_work_fn

blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests
under blk_mq_run_dispatch_ops() protection, so remove the duplicate check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b12e5c6c 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: pass a flags argument to blk_mq_add_to_requeue_list

Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 93fffe16 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: pass a flags argument to elevator_type->insert_requests

Instead of passing a bool at_head, pass down the full flags from the
blk_mq_insert_request interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2b597613 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: pass a flags argument to blk_mq_request_bypass_insert

Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 710fa378 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: pass a flags argument to blk_mq_insert_request

Replace the at_head bool with a flags argument that so far only contains
a single BLK_MQ_INSERT_AT_HEAD value. This makes it much easier to grep
for head insertions into the blk-mq dispatch queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 214a4418 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: don't kick the requeue_list in blk_mq_add_to_requeue_list

blk_mq_add_to_requeue_list takes a bool parameter to control how to kick
the requeue list at the end of the function. Move the call to
blk_mq_kick_requeue_list to the callers that want it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2394395c 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: don't run the hw_queue from blk_mq_request_bypass_insert

blk_mq_request_bypass_insert takes a bool parameter to control how to run
the queue at the end of the function. Move the blk_mq_run_hw_queue call
to the callers that want it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f0dbe6e8 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: don't run the hw_queue from blk_mq_insert_request

blk_mq_insert_request takes two bool parameters to control how to run
the queue at the end of the function. Move the blk_mq_run_hw_queue call
to the callers that want it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e1f44ac0 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: fold __blk_mq_try_issue_directly into its two callers

Due to the wildly different behavior based on the bypass_insert argument,
not a whole lot of code in __blk_mq_try_issue_directly is actually shared
between blk_mq_try_issue_directly and blk_mq_request_issue_directly.

Remove __blk_mq_try_issue_directly and fold the code into the two callers
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2b71b877 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a blk_mq_get_budget_and_tag helper

Factor out a helper from __blk_mq_try_issue_directly in preparation
of folding that function into its two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a1e948b8 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: refactor the DONTPREP/SOFTBARRIER andling in blk_mq_requeue_work

Split the RQF_DONTPREP and RQF_SOFTBARRIER in separate branches to make
the code more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 53548d2a 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: refactor passthrough vs flush handling in blk_mq_insert_request

While both passthrough and flush requests call directly into
blk_mq_request_bypass_insert, the parameters aren't the same.
Split the handling into two separate conditionals and turn the whole
function into an if/elif/elif/else flow instead of the gotos.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4ec5c055 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: fold __blk_mq_insert_req_list into blk_mq_insert_request

Remove this very small helper and fold it into the only caller.

Note that this moves the trace_block_rq_insert out of ctx->lock, matching
the other calls to this tracepoint.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a88db1e0 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: fold __blk_mq_insert_request into blk_mq_insert_request

There is no good point in keeping the __blk_mq_insert_request around
for two function calls and a singler caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2bd215df 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: move blk_mq_sched_insert_request to blk-mq.c

blk_mq_sched_insert_request is the main request insert helper and not
directly I/O scheduler related. Move blk_mq_sched_insert_request to
blk-mq.c, rename it to blk_mq_insert_request and mark it static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 05a93117 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: fold blk_mq_sched_insert_requests into blk_mq_dispatch_plug_list

blk_mq_dispatch_plug_list is the only caller of
blk_mq_sched_insert_requests, and it makes sense to just fold it there
as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite
the name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 94aa228c 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: move more logic into blk_mq_insert_requests

Move all logic related to the direct insert (including the call to
blk_mq_run_hw_queue) into blk_mq_insert_requests to streamline the code
flow up a bit, and to allow marking blk_mq_try_issue_list_directly
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 90110e04 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: include <linux/blk-mq.h> in block/blk-mq.h

block/blk-mq.h needs various definitions from <linux/blk-mq.h>,
include it there instead of relying on the source files to include
both.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bebe84eb 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk-mq-tag.h

blk-mq-tag.h is always included by blk-mq.h, and causes recursive
inclusion hell with further changes. Just merge it into blk-mq.h
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 50947d7f 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: don't plug for head insertions in blk_execute_rq_nowait

Plugs never insert at head, so don't plug for head insertions.

Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 06965037 27-Mar-2023 Chaitanya Kulkarni <kch@nvidia.com>

block: open code __blk_account_io_done()

There is only one caller for __blk_account_io_done(), the function
is small enough to fit in its caller blk_account_io_done().

Remove the function and opencode in the its caller
blk_account_io_done().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e165fb4d 27-Mar-2023 Chaitanya Kulkarni <kch@nvidia.com>

block: open code __blk_account_io_start()

There is only one caller for __blk_account_io_start(), the function
is small enough to fit in its caller blk_account_io_start().

Remove the function and opencode in the its caller
blk_account_io_start().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 54bdd67d 20-Mar-2023 Keith Busch <kbusch@kernel.org>

blk-mq: remove hybrid polling

io_uring provides the only way user space can poll completions, and that
always sets BLK_POLL_NOSLEEP. This effectively makes hybrid polling dead
code, so remove it and everything supporting it.

Hybrid polling was effectively killed off with 9650b453a3d4b1, "block:
ignore RWF_HIPRI hint for sync dio", but still potentially reachable
through io_uring until d729cf9acb93119, "io_uring: don't sleep when
polling for I/O", but hybrid polling probably should not have been
reachable through that async interface from the beginning.

Fixes: 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio")
Fixes: d729cf9acb93 ("io_uring: don't sleep when polling for I/O")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20230320194926.3353144-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5b8562f0 15-Mar-2023 Eric Biggers <ebiggers@google.com>

blk-mq: return actual keyslot error in blk_insert_cloned_request()

To avoid hiding information, pass on the error code from
blk_crypto_rq_get_keyslot() instead of always using BLK_STS_IOERR.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 435c0e99 15-Mar-2023 Eric Biggers <ebiggers@google.com>

blk-crypto: remove blk_crypto_insert_cloned_request()

blk_crypto_insert_cloned_request() is the same as
blk_crypto_rq_get_keyslot(), so just use that directly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9cd1e566 15-Mar-2023 Eric Biggers <ebiggers@google.com>

blk-mq: release crypto keyslot before reporting I/O complete

Once all I/O using a blk_crypto_key has completed, filesystems can call
blk_crypto_evict_key(). However, the block layer currently doesn't call
blk_crypto_put_keyslot() until the request is being freed, which happens
after upper layers have been told (via bio_endio()) the I/O has
completed. This causes a race condition where blk_crypto_evict_key()
can see 'slot_refs != 0' without there being an actual bug.

This makes __blk_crypto_evict_key() hit the
'WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)' and return without
doing anything, eventually causing a use-after-free in
blk_crypto_reprogram_all_keys(). (This is a very rare bug and has only
been seen when per-file keys are being used with fscrypt.)

There are two options to fix this: either release the keyslot before
bio_endio() is called on the request's last bio, or make
__blk_crypto_evict_key() ignore slot_refs. Let's go with the first
solution, since it preserves the ability to report bugs (via
WARN_ON_ONCE) where a key is evicted while still in-use.

Fixes: a892c8d52c02 ("block: Inline encryption support for blk-mq")
Cc: stable@vger.kernel.org
Reviewed-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 81ea1222 20-Apr-2023 Ming Lei <ming.lei@redhat.com>

Revert "block: Merge bio before checking ->cached_rq"

This reverts commit 23f3e3272e7a4d9fb870485cd6df1e4f9539282c.

blk-mq sched bio merge still needs request to grab queue usage counter,
so we can't simply call blk_mq_attempt_bio_merge() when queue usage
counter isn't held.

Fixes: 23f3e3272e7a ("block: Merge bio before checking ->cached_rq")
Cc: Xiao Ni <xni@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230420112018.1108058-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 38a8c4d1 31-Mar-2023 Keith Busch <kbusch@kernel.org>

blk-mq: directly poll requests

Polling needs a bio with a valid bi_bdev, but neither of those are
guaranteed for polled driver requests. Make request based polling
directly use blk-mq's polling function instead.

When executing a request from a polled hctx, we know the request's
cookie, and that it's from a live blk-mq queue that supports polling, so
we can safely skip everything that bio_poll provides.

Cc: stable@kernel.org
Reported-by: Martin Belanger <Martin.Belanger@dell.com>
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: Daniel Wagner <dwagner@suse.de>
Revieded-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20230331180056.1155862-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 34e0a279 13-Mar-2023 Jan Kara <jack@suse.cz>

block: do not reverse request order when flushing plug list

Commit 26fed4ac4eab ("block: flush plug based on hardware and software
queue order") changed flushing of plug list to submit requests one
device at a time. However while doing that it also started using
list_add_tail() instead of list_add() used previously thus effectively
submitting requests in reverse order. Also when forming a rq_list with
remaining requests (in case two or more devices are used), we
effectively reverse the ordering of the plug list for each device we
process. Submitting requests in reverse order has negative impact on
performance for rotational disks (when BFQ is not in use). We observe
10-25% regression in random 4k write throughput, as well as ~20%
regression in MariaDB OLTP benchmark on rotational storage on btrfs
filesystem.

Fix the problem by preserving ordering of the plug list when inserting
requests into the queuelist as well as by appending to requeue_list
instead of prepending to it.

Fixes: 26fed4ac4eab ("block: flush plug based on hardware and software queue order")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 49d24398 28-Feb-2023 Uday Shankar <ushankar@purestorage.com>

blk-mq: enforce op-specific segment limits in blk_insert_cloned_request

The block layer might merge together discard requests up until the
max_discard_segments limit is hit, but blk_insert_cloned_request checks
the segment count against max_segments regardless of the req op. This
can result in errors like the following when discards are issued through
a DM device and max_discard_segments exceeds max_segments for the queue
of the chosen underlying device.

blk_insert_cloned_request: over max segments limit. (256 > 129)

Fix this by looking at the req_op and enforcing the appropriate segment
limit - max_discard_segments for REQ_OP_DISCARDs and max_segments for
everything else.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230301000655.48112-1-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 23f3e327 08-Feb-2023 Xiao Ni <xni@redhat.com>

block: Merge bio before checking ->cached_rq

It checks if plug->cached_rq is empty before merging bio. But the merge action
doesn't have relationship with plug->cached_rq, it trys to merge bio with
requests within plug->mq_list. Now it checks if ->cached_rq is empty before
merging bio. If it's empty, it will miss the merge chances. So move the merge
function before checking ->cached_rq.

Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209031930.27354-1-xni@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 27e8b2bb 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: use switch/case to improve readability in blk_mq_try_issue_list_directly

Use switch/case handle error as other function do to improve
readability in blk_mq_try_issue_list_directly.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f1ce99f7 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: remove set of bd->last when get driver tag for next request fails

Commit 113285b473824 ("blk-mq: ensure that bd->last is always set
correctly") will set last if we failed to get driver tag for next
request to avoid flush miss as we break the list walk and will not
send the last request in the list which will be sent with last set
normally.
This code seems stale now becase the flush introduced is always
redundant as:
For case tag is really out, we will send a extra flush if we find
list is not empty after list walk.
For case some tag is freed before retry in blk_mq_prep_dispatch_rq for
next, then we can get a tag for next request in retry and flush notified
already is not necessary.

Just remove these stale codes.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4ea58fe4 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: remove unnecessary error count and check in blk_mq_dispatch_rq_list

blk_mq_dispatch_rq_list will notify if hctx is busy in return bool. It will
return true if we are not busy and can handle more and return false on the
opposite. Inside blk_mq_dispatch_rq_list, errors is only used if list is
empty and we will return true if list is empty and (errors + queued) != 0.

There are three types of status returned from request:
-busy error BLK_STS*_RESOURCE: the failed request will be added back
to list and list will not be empty.
-BLK_STS_OK: We count queued for BLK_STS_OK
-rest error: We count errors for rest error

If list is empty, there is no request gets busy error then (errors +
queued) will be total requests in the list which is checked not empty at
beginning of blk_mq_dispatch_rq_list. So (errors + queued) != 0 is always
met if list is empty. Then the (errors + queued) != 0 check and errors
number count is not needed.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e4ef2e05 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: simplify flush check in blk_mq_dispatch_rq_list

1. Remove check of needs_resource and ret == BLK_STS_DEV_RESOURCE.
For busy error BLK_STS*_RESOURCE, request will always be added
back to list, so need_resource will not be true and ret will
not be == BLK_STS_DEV_RESOURCE if list is empty. We could remove
these dead check.

2. Check ret of last request instead of errors
If list is empty, we only need to explicitly commit_rqs
if error happens at last request which is stored in ret. So check
ret of last request instead of errors to remove unnecessary
commit_rqs triggered by errors returned from previous request.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 984ce0a7 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: use blk_mq_commit_rqs helper in blk_mq_try_issue_list_directly

Call blk_mq_commit_rqs instead of access ->commit_rqs directly. As you
can see in comment of blk_mq_commit_rqs, we only need explicitly call
this in two cases:
-did not queue everything initially scheduled to queue
-the last attempt to queue a request failed
Both cases can be checked with ret of last request which breaks list
walk. Then we can remove unnecessary error count and unnecessary
commit triggered by error besides cases described above.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0d617a83 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: remove unncessary error count and commit in blk_mq_plug_issue_direct

We need only to explicitly commit in two error cases:
-did not queue everything initially scheduled to queue
-the last attempt to queue a request failed
(see comment of blk_mq_commit_rqs for more details).
Both cases can be checked with ret of last request which breaks list walk.
Remove unnecessary error count and unnecessary commit triggered by error
which is not covered by cases described above.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 34c9f547 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: make blk_mq_commit_rqs a general function for all commits

1. move blk_mq_commit_rqs forward before functions need commits.
2. add queued check and only commits request if any request was queued
in blk_mq_commit_rqs to keep commit behavior consistent and remove
unnecessary commit.
3. split the queued clearing from blk_mq_plug_commit_rqs as it is
not wanted general.
4. sync current caller of blk_mq_commit_rqs with new general
blk_mq_commit_rqs.
5. document rule for unusual cases which need explicit commit_rqs.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3e368fb0 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: remove unncessary from_schedule parameter in blk_mq_plug_issue_direct

Function blk_mq_plug_issue_direct tries to issue batch requests in plug
list to driver directly. We will only issue plug request to driver if we
are not from scheduler, so from_scheduler parameter of
blk_mq_plug_issue_direct is always false.
Remove unncessary from_scheduler of blk_mq_plug_issue_direct.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 08e3599e 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: remove unnecessary list_empty check in blk_mq_try_issue_list_directly

We only break the list walk if we get 'BLK_STS_*RESOURCE'. We also
count errors for 'BLK_STS_*RESOURCE' error. If list is not empty,
errors will always be non-zero. So we can remove unnecessary list_empty
check. This will remove redundant list_empty check for case that
error happened at sending last request in list.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 47df9ce9 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: Fix potential io hung for shared sbitmap per tagset

Commit f906a6a0f4268 ("blk-mq: improve tag waiting setup for non-shared
tags") mark restart for unshared tags for improvement. At that time,
tags is only shared betweens queues and we can check if tags is shared
by test BLK_MQ_F_TAG_SHARED.
Afterwards, commit 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per
tagset") enabled tags share betweens hctxs inside a queue. We only
mark restart for shared hctxs inside a queue and may cause io hung if
there is no tag currently allocated by hctxs going to be marked restart.
Wait on sbitmap_queue instead of mark restart for shared hctxs case to
fix this.

Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 98b99e94 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_wait

For shared queues case, we will only wait on bitmap_tags if we fail to get
driver tag. However, rq could be from breserved_tags, then two problems
will occur:
1. io hung if no tag is currently allocated from bitmap_tags.
2. unnecessary wakeup when tag is freed to bitmap_tags while no tag is
freed to breserved_tags.
Wait on the bitmap which rq from to fix this.

Fixes: f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6ee858a3 18-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

blk-mq: avoid sleep in blk_mq_alloc_request_hctx

Commit 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx") add
blk_mq_alloc_request_hctx to send commands to a specific queue. If
BLK_MQ_REQ_NOWAIT is not set in tag allocation, we may change to different
hctx after sleep and get tag from unexpected hctx. So BLK_MQ_REQ_NOWAIT
must be set in flags for blk_mq_alloc_request_hctx.
After commit 600c3b0cea784 ("blk-mq: open code __blk_mq_alloc_request in
blk_mq_alloc_request_hctx"), blk_mq_alloc_request_hctx return -EINVAL
if both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED are not set instead of
if BLK_MQ_REQ_NOWAIT is not set. So if BLK_MQ_REQ_NOWAIT is not set and
BLK_MQ_REQ_RESERVED is set, blk_mq_alloc_request_hctx could alloc tag
from unexpected hctx. I guess what we need here is that return -EINVAL
if either BLK_MQ_REQ_NOWAIT or BLK_MQ_REQ_RESERVED is not set.

Currently both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED will be set if
specific hctx is needed in nvme_auth_submit, nvmf_connect_io_queue
and nvmf_connect_admin_queue. Fix the potential BLK_MQ_REQ_NOWAIT missed
case in future.

Fixes: 600c3b0cea78 ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 81ea42b9 30-Jan-2023 Bart Van Assche <bvanassche@acm.org>

block: Fix the blk_mq_destroy_queue() documentation

Commit 2b3f056f72e5 moved a blk_put_queue() call from
blk_mq_destroy_queue() into its callers. Reflect this change in the
documentation block above blk_mq_destroy_queue().

Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Keith Busch <kbusch@kernel.org>
Fixes: 2b3f056f72e5 ("blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230130211233.831613-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 77465647 17-Jan-2023 Pavel Begunkov <asml.silence@gmail.com>

block: fix hctx checks for batch allocation

When there are no read queues read requests will be assigned a
default queue on allocation. However, blk_mq_get_cached_request() is not
prepared for that and will fail all attempts to grab read requests from
the cache. Worst case it doubles the number of requests allocated,
roughly half of which will be returned by blk_mq_free_plug_rqs().

It only affects batched allocations and so is io_uring specific.
For reference, QD8 t/io_uring benchmark improves by 20-35%.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/80d4511011d7d4751b4cf6375c4e38f237d935e3.1673955390.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 613b1488 04-Jan-2023 Jens Axboe <axboe@kernel.dk>

block: handle bio_split_to_limits() NULL return

This can't happen right now, but in preparation for allowing
bio_split_to_limits() returning NULL if it ended the bio, check for it
in all the callers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 90b0296e 06-Nov-2022 Ye Bin <yebin10@huawei.com>

block: fix crash in 'blk_mq_elv_switch_none'

Syzbot found the following issue:
general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
CPU: 0 PID: 5234 Comm: syz-executor931 Not tainted 6.1.0-rc3-next-20221102-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:__elevator_get block/elevator.h:94 [inline]
RIP: 0010:blk_mq_elv_switch_none block/blk-mq.c:4593 [inline]
RIP: 0010:__blk_mq_update_nr_hw_queues block/blk-mq.c:4658 [inline]
RIP: 0010:blk_mq_update_nr_hw_queues+0x304/0xe40 block/blk-mq.c:4709
RSP: 0018:ffffc90003cdfc08 EFLAGS: 00010206
RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
RDX: 000000000000001d RSI: 0000000000000002 RDI: 00000000000000e8
RBP: ffff88801dbd0000 R08: ffff888027c89398 R09: ffffffff8de2e517
R10: fffffbfff1bc5ca2 R11: 0000000000000000 R12: ffffc90003cdfc70
R13: ffff88801dbd0008 R14: ffff88801dbd03f8 R15: ffff888027c89380
FS: 0000555557259300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000005d84c8 CR3: 000000007a7cb000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
nbd_start_device+0x153/0xc30 drivers/block/nbd.c:1355
nbd_start_device_ioctl drivers/block/nbd.c:1405 [inline]
__nbd_ioctl drivers/block/nbd.c:1481 [inline]
nbd_ioctl+0x5a1/0xbd0 drivers/block/nbd.c:1521
blkdev_ioctl+0x36e/0x800 block/ioctl.c:614
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

As after dd6f7f17bf58 commit move '__elevator_get(qe->type)' before set
'qe->type', so will lead to access wild pointer.
To solve above issue get 'qe->type' after set 'qe->type'.

Reported-by: syzbot+746a4eece09f86bc39d7@syzkaller.appspotmail.com
Fixes:dd6f7f17bf58("block: add proper helpers for elevator_type module refcount management")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221107033956.3276891-1-yebin@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d4b2e0d4 22-Nov-2022 Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

block: fix missing nr_hw_queues update in blk_mq_realloc_tag_set_tags

The commit ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags")
cleaned up the function blk_mq_realloc_tag_set_tags. After this change,
the function does not update nr_hw_queues of struct blk_mq_tag_set when
new nr_hw_queues value is smaller than original. This results in failure
of queue number change of block devices. To avoid the failure, add the
missing nr_hw_queues update.

Fixes: ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags")
Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/linux-block/20221118140640.featvt3fxktfquwh@shindev/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221122084917.2034220-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ee9d5521 09-Nov-2022 Christoph Hellwig <hch@lst.de>

blk-mq: simplify blk_mq_realloc_tag_set_tags

Use set->nr_hw_queues for the current number of tags, and remove the
duplicate set->nr_hw_queues update in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5ee20298 09-Nov-2022 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_alloc_tag_set_tags

There is no point in trying to share any code with the realloc case when
all that is needed by the initial tagset allocation is a simple
kcalloc_node.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 40467282 01-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

blk-mq: use if-else instead of goto in blk_mq_alloc_cached_request()

if-else is more readable than goto here.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Link: https://lore.kernel.org/r/d3306fa4e92dc9cc614edc8f1802686096bafef2.1667356813.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7edfd681 01-Nov-2022 Jinlong Chen <nickyc975@zju.edu.cn>

blk-mq: improve error handling in blk_mq_alloc_rq_map()

Use goto-style error handling like we do elsewhere in the kernel.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Link: https://lore.kernel.org/r/bbbc2d9b17b137798c7fb92042141ca4cbbc58cc.1667356813.git.nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 414dd48e 01-Nov-2022 Chao Leng <lengchao@huawei.com>

blk-mq: add tagset quiesce interface

Drivers that have shared tagsets may need to quiesce potentially a lot
of request queues that all share a single tagset (e.g. nvme). Add an
interface to quiesce all the queues on a given tagset. This interface is
useful because it can speedup the quiesce by doing it in parallel.

Because some queues should not need to be quiesced (e.g. the nvme
connect_q) when quiescing the tagset, introduce a
QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to
ski quiescing a particular queue.

Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: simplify for the per-tag_set srcu_struct]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 483239c7 01-Nov-2022 Christoph Hellwig <hch@lst.de>

blk-mq: pass a tagset to blk_mq_wait_quiesce_done

Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just
pass the tagset, and move the non-mq check into the only caller that
needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 80bd4a7a 01-Nov-2022 Christoph Hellwig <hch@lst.de>

blk-mq: move the srcu_struct used for quiescing to the tagset

All I/O submissions have fairly similar latencies, and a tagset-wide
quiesce is a fairly common operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-12-hch@lst.de
[axboe: fix whitespace]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8537380b 01-Nov-2022 Christoph Hellwig <hch@lst.de>

blk-mq: skip non-mq queues in blk_mq_quiesce_queue

For submit_bio based queues there is no (S)RCU critical section during
I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done,
so skip doing any synchronization. No non-mq driver should be calling
this, but for now we have core callers that unconditionally call into it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 64b36075 30-Oct-2022 Christoph Hellwig <hch@lst.de>

block: split elevator_switch

Split an elevator_disable helper from elevator_switch for the case where
we want to switch to no scheduler at all. This includes removing the
pointless elevator_switch_mq helper and removing the switch to no
schedule logic from blk_mq_init_sched.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 56c1ee92 30-Oct-2022 Jinlong Chen <nickyc975@zju.edu.cn>

blk-mq: remove redundant call to blk_freeze_queue_start in blk_mq_destroy_queue

The calling relationship in blk_mq_destroy_queue() is as follows:

blk_mq_destroy_queue()
...
-> blk_queue_start_drain()
-> blk_freeze_queue_start() <- called
...
-> blk_freeze_queue()
-> blk_freeze_queue_start() <- called again
-> blk_mq_freeze_queue_wait()
...

So there is a redundant call to blk_freeze_queue_start().

Replace blk_freeze_queue() with blk_mq_freeze_queue_wait() to avoid the
redundant call.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030083212.1251255-1-nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 219cf43c 30-Oct-2022 Jinlong Chen <nickyc975@zju.edu.cn>

blk-mq: move queue_is_mq out of blk_mq_cancel_work_sync

The only caller that needs queue_is_mq check is del_gendisk, so move the
check into it.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030094730.1275463-1-nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 82c22947 25-Oct-2022 David Jeffery <djeffery@redhat.com>

blk-mq: avoid double ->queue_rq() because of early timeout

David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.

So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.

But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2b3f056f 18-Oct-2022 Christoph Hellwig <hch@lst.de>

blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue

The fact that blk_mq_destroy_queue also drops a queue reference leads
to various places having to grab an extra reference. Move the call to
blk_put_queue into the callers to allow removing the extra references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de
[axboe: fix fabrics_q vs admin_q conflict in nvme core.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ed40ee3 20-Oct-2022 Jinlong Chen <nickyc975@zju.edu.cn>

block: fix up elevator_type refcounting

The current reference management logic of io scheduler modules contains
refcnt problems. For example, blk_mq_init_sched may fail before or after
the calling of e->ops.init_sched. If it fails before the calling, it does
nothing to the reference to the io scheduler module. But if it fails after
the calling, it releases the reference by calling kobject_put(&eq->kobj).

As the callers of blk_mq_init_sched can't know exactly where the failure
happens, they can't handle the reference to the io scheduler module
properly: releasing the reference on failure results in double-release if
blk_mq_init_sched has released it, and not releasing the reference results
in ghost reference if blk_mq_init_sched did not release it either.

The same problem also exists in io schedulers' init_sched implementations.

We can address the problem by adding releasing statements to the error
handling procedures of blk_mq_init_sched and init_sched implementations.
But that is counterintuitive and requires modifications to existing io
schedulers.

Instead, We make elevator_alloc get the io scheduler module references
that will be released by elevator_release. And then, we match each
elevator_get with an elevator_put. Therefore, each reference to an io
scheduler module explicitly has its own getter and releaser, and we no
longer need to worry about the refcnt problems.

The bugs and the patch can be validated with tools here:
https://github.com/nickyc975/linux_elv_refcnt_bug.git

[hch: split out a few bits into separate patches, use a non-try
module_get in elevator_alloc]

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dd6f7f17 20-Oct-2022 Christoph Hellwig <hch@lst.de>

block: add proper helpers for elevator_type module refcount management

Make sure we have helpers for all relevant module refcount operations on
the elevator_type in elevator.h, and use them. Move the call to the get
helper in blk_mq_elv_switch_none a bit so that it is obvious with a less
verbose comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 22c17e27 22-Nov-2022 Christoph Hellwig <hch@lst.de>

blk-mq: fix queue reference leak on blk_mq_alloc_disk_for_queue failure

Drop the request queue reference just acquired when __alloc_disk_node
failed.

Fixes: 6f8191fdf41d ("block: simplify disk shutdown")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20221122072753.426077-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 878eb6e4 31-Oct-2022 Al Viro <viro@zeniv.linux.org.uk>

block: blk_add_rq_to_plug(): clear stale 'last' after flush

blk_mq_flush_plug_list() empties ->mq_list and request we'd peeked there
before that call is gone; in any case, we are not dealing with a mix
of requests for different queues now - there's no requests left in the
plug.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 943f45b9 30-Oct-2022 Chen Jun <chenjun102@huawei.com>

blk-mq: Fix kmemleak in blk_mq_init_allocated_queue

There is a kmemleak caused by modprobe null_blk.ko

unreferenced object 0xffff8881acb1f000 (size 1024):
comm "modprobe", pid 836, jiffies 4294971190 (age 27.068s)
hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
ff ff ff ff ff ff ff ff 00 53 99 9e ff ff ff ff .........S......
backtrace:
[<000000004a10c249>] kmalloc_node_trace+0x22/0x60
[<00000000648f7950>] blk_mq_alloc_and_init_hctx+0x289/0x350
[<00000000af06de0e>] blk_mq_realloc_hw_ctxs+0x2fe/0x3d0
[<00000000e00c1872>] blk_mq_init_allocated_queue+0x48c/0x1440
[<00000000d16b4e68>] __blk_mq_alloc_disk+0xc8/0x1c0
[<00000000d10c98c3>] 0xffffffffc450d69d
[<00000000b9299f48>] 0xffffffffc4538392
[<0000000061c39ed6>] do_one_initcall+0xd0/0x4f0
[<00000000b389383b>] do_init_module+0x1a4/0x680
[<0000000087cf3542>] load_module+0x6249/0x7110
[<00000000beba61b8>] __do_sys_finit_module+0x140/0x200
[<00000000fdcfff51>] do_syscall_64+0x35/0x80
[<000000003c0f1f71>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

That is because q->ma_ops is set to NULL before blk_release_queue is
called.

blk_mq_init_queue_data
blk_mq_init_allocated_queue
blk_mq_realloc_hw_ctxs
for (i = 0; i < set->nr_hw_queues; i++) {
old_hctx = xa_load(&q->hctx_table, i);
if (!blk_mq_alloc_and_init_hctx(.., i, ..)) [1]
if (!old_hctx)
break;

xa_for_each_start(&q->hctx_table, j, hctx, j)
blk_mq_exit_hctx(q, set, hctx, j); [2]

if (!q->nr_hw_queues) [3]
goto err_hctxs;

err_exit:
q->mq_ops = NULL; [4]

blk_put_queue
blk_release_queue
if (queue_is_mq(q)) [5]
blk_mq_release(q);

[1]: blk_mq_alloc_and_init_hctx failed at i != 0.
[2]: The hctxs allocated by [1] are moved to q->unused_hctx_list and
will be cleaned up in blk_mq_release.
[3]: q->nr_hw_queues is 0.
[4]: Set q->mq_ops to NULL.
[5]: queue_is_mq returns false due to [4]. And blk_mq_release
will not be called. The hctxs in q->unused_hctx_list are leaked.

To fix it, call blk_release_queue in exception path.

Fixes: 2f8f1336a48b ("blk-mq: always free hctx after request queue is freed")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221031031242.94107-1-chenjun102@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e3c5a78c 26-Oct-2022 John Garry <john.garry@huawei.com>

blk-mq: Properly init requests from blk_mq_alloc_request_hctx()

Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio,
biotail, __sector, and __data_len members, which blk_mq_alloc_request()
has, so duplicate what we do in blk_mq_alloc_request().

Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx")
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1666780513-121650-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 76dd2980 11-Oct-2022 Yu Kuai <yukuai3@huawei.com>

blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()

Our syzkaller report a null pointer dereference, root cause is
following:

__blk_mq_alloc_map_and_rqs
set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs
blk_mq_alloc_map_and_rqs
blk_mq_alloc_rqs
// failed due to oom
alloc_pages_node
// set->tags[hctx_idx] is still NULL
blk_mq_free_rqs
drv_tags = set->tags[hctx_idx];
// null pointer dereference is triggered
blk_mq_clear_rq_mapping(drv_tags, ...)

This is because commit 63064be150e4 ("blk-mq:
Add blk_mq_alloc_map_and_rqs()") merged the two steps:

1) set->tags[hctx_idx] = blk_mq_alloc_rq_map()
2) blk_mq_alloc_rqs(..., set->tags[hctx_idx])

into one step:

set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs()

Since tags is not initialized yet in this case, fix the problem by
checking if tags is NULL pointer in blk_mq_clear_rq_mapping().

Fixes: 63064be150e4 ("blk-mq: Add blk_mq_alloc_map_and_rqs()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ab3e1d3b 21-Sep-2022 Jens Axboe <axboe@kernel.dk>

block: allow end_io based requests in the completion batch handling

With end_io handlers now being able to potentially pass ownership of
the request upon completion, we can allow requests with end_io handlers
in the batch completion handling.

Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# de671d61 21-Sep-2022 Jens Axboe <axboe@kernel.dk>

block: change request end_io handler to pass back a return value

Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4b6a5d9c 21-Sep-2022 Jens Axboe <axboe@kernel.dk>

block: enable batched allocation for blk_mq_alloc_request()

The filesystem IO path can take advantage of allocating batches of
requests, if the underlying submitter tells the block layer about it
through the blk_plug. For passthrough IO, the exported API is the
blk_mq_alloc_request() helper, and that one does not allow for
request caching.

Wire up request caching for blk_mq_alloc_request(), which is generally
done without having a bio available upfront.

Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 110fdb44 29-Sep-2022 Pankaj Raghav <p.raghav@samsung.com>

block: add rationale for not using blk_mq_plug() when applicable

There are two places in the block layer at the moment where
blk_mq_plug() helper could be used instead of directly accessing the
plug from struct current. In both these cases, directly accessing the plug
should not have any consequences for zoned devices.

Make the intent explicit by adding comments instead of introducing unwanted
checks with blk_mq_plug() helper.[1]

[1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com
[axboe: fixup multi-line comment style]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8237c01f 27-Sep-2022 Keith Busch <kbusch@kernel.org>

blk-mq: use quiesced elevator switch when reinitializing queues

The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.

nvme nvme0: resetting controller
nvme nvme0: 32/0/0 default/read/poll queues
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
Oops: 0000 [#1] SMP PTI
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:kyber_has_work+0x29/0x70

...

Call Trace:
__blk_mq_do_dispatch_sched+0x83/0x2b0
__blk_mq_sched_dispatch_requests+0x12e/0x170
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x2b/0x50
process_one_work+0x1ef/0x380
worker_thread+0x2d/0x3e0

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f168420c 20-Sep-2022 Liu Song <liusong@linux.alibaba.com>

blk-mq: don't redirect completion for hctx withs only one ctx mapping

High-performance NVMe devices usually support a large hw queues, which
ensures a 1:1 mapping of hctx and ctx. In this case there will be no
remote request, so we don't need to care about it.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1663731123-81536-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6e99ea4 23-Aug-2022 Kanchan Joshi <joshi.k@samsung.com>

block: export blk_rq_is_poll

This is in preparation to support iopoll for nvme passthrough.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-4-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6d5e8d21 05-Sep-2022 Miaohe Lin <linmiaohe@huawei.com>

blk-mq: remove unneeded needs_restart check

If code reaches here, needs_restart must be true. Remove this unneeded
needs_restart check. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220905101950.4606-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a4e1d0b7 15-Aug-2022 Bart Van Assche <bvanassche@acm.org>

block: Change the return type of blk_mq_map_queues() into void

Since blk_mq_map_queues() and the .map_queues() callbacks always return 0,
change their return type into void. Most callers ignore the returned value
anyway.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org
[axboe: fold in fix from Bart]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 65fac0d5 26-Jul-2022 Yu Kuai <yukuai3@huawei.com>

blk-mq: fix io hung due to missing commit_rqs

Currently, in virtio_scsi, if 'bd->last' is not set to true while
dispatching request, such io will stay in driver's queue, and driver
will wait for block layer to dispatch more rqs. However, if block
layer failed to dispatch more rq, it should trigger commit_rqs to
inform driver.

There is a problem in blk_mq_try_issue_list_directly() that commit_rqs
won't be called:

// assume that queue_depth is set to 1, list contains two rq
blk_mq_try_issue_list_directly
blk_mq_request_issue_directly
// dispatch first rq
// last is false
__blk_mq_try_issue_directly
blk_mq_get_dispatch_budget
// succeed to get first budget
__blk_mq_issue_directly
scsi_queue_rq
cmd->flags |= SCMD_LAST
virtscsi_queuecommand
kick = (sc->flags & SCMD_LAST) != 0
// kick is false, first rq won't issue to disk
queued++

blk_mq_request_issue_directly
// dispatch second rq
__blk_mq_try_issue_directly
blk_mq_get_dispatch_budget
// failed to get second budget
ret == BLK_STS_RESOURCE
blk_mq_request_bypass_insert
// errors is still 0

if (!list_empty(list) || errors && ...)
// won't pass, commit_rqs won't be called

In this situation, first rq relied on second rq to dispatch, while
second rq relied on first rq to complete, thus they will both hung.

Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since
it means that request is not queued successfully.

Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE'
can't be treated as 'errors' here, fix the problem by calling
commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.

Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d3b38596 02-Aug-2022 Yufen Yu <yuyufen@huawei.com>

blk-mq: run queue no matter whether the request is the last request

We do test on a virtio scsi device (/dev/sda) and the default mq
scheduler is 'none'. We found a IO hung as following:

blk_finish_plug
blk_mq_plug_issue_direct
scsi_mq_get_budget
//get budget_token fail and sdev->restarts=1

scsi_end_request
scsi_run_queue_async
//sdev->restart=0 and run queue

blk_mq_request_bypass_insert
//add request to hctx->dispatch list

//continue to dispath plug list
blk_mq_dispatch_plug_list
blk_mq_try_issue_list_directly
//success issue all requests from plug list

After .get_budget fail, scsi_mq_get_budget will increase 'restarts'.
Normally, it will run hw queue when io complete and set 'restarts'
as 0. But if we run queue before adding request to the dispatch list
and blk_mq_dispatch_plug_list also success issue all requests, then
on one will run queue, and the request will be stall in the dispatch
list and cannot complete forever.

It is wrong to use last request of plug list to decide if run queue is
needed since all the remained requests in plug list may be from other
hctxs. To fix the bug, pass run_queue as true always to
blk_mq_request_bypass_insert().

Fix-suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Fixes: dc5fc361d891 ("block: attempt direct issue of plug list")
Link: https://lore.kernel.org/r/20220803023355.3687360-1-yuyufen@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a8239f03 18-Aug-2022 Yu Kuai <yukuai3@huawei.com>

blk-mq: remove unused function blk_mq_queue_stopped()

blk_mq_queue_stopped() doesn't have any caller, which was found by
code coverage test, thus remove it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220818063555.3741222-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c55ddd90 26-Jul-2022 Christoph Hellwig <hch@lst.de>

block: pass struct queue_limits to the bio splitting helpers

Allow using the splitting helpers on just a queue_limits instead of
a full request_queue structure. This will eventually allow file systems
or remapping drivers to split REQ_OP_ZONE_APPEND bios based on limits
calculated as the minimum common capabilities over multiple devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220727162300.3089193-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 51d798cd 26-Jul-2022 Christoph Hellwig <hch@lst.de>

block: change the blk_queue_bounce calling convention

The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220727162300.3089193-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a97806f 26-Jul-2022 Christoph Hellwig <hch@lst.de>

block: change the blk_queue_split calling convention

The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.

Also give it and the helpers used to implement it more descriptive names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a3e5cc7 20-Jul-2022 Christoph Hellwig <hch@lst.de>

blk-mq: fix error handling in __blk_mq_alloc_disk

To fully clean up the queue if the disk allocation fails we need to
call blk_mq_destroy_queue and not just blk_put_queue.

Fixes: 6f8191fdf41d ("block: simplify disk shutdown")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220720130541.1323531-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 16458cf3 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

block: Use the new blk_opf_t type

Use the new blk_opf_t type for arguments and variables that represent
request flags or a bitwise combination of a request operation and
request flags. Rename the function arguments and also a structure member
that hold a request operation and flags from 'rw' into 'opf'.

This patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-7-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6deacb3b 06-Jul-2022 Christoph Hellwig <hch@lst.de>

block: simplify blk_mq_plug

Drop the unused q argument, and invert the check to move the exception
into a branch and the regular path as the normal return.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2dd6532e 06-Jul-2022 John Garry <john.garry@huawei.com>

blk-mq: Drop 'reserved' arg of busy_tag_iter_fn

We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter
function so it may be dropped.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9bdb4833 06-Jul-2022 John Garry <john.garry@huawei.com>

blk-mq: Drop blk_mq_ops.timeout 'reserved' arg

With new API blk_mq_is_reserved_rq() we can tell if a request is from
the reserved pool, so stop passing 'reserved' arg. There is actually
only a single user of that arg for all the callback implementations, which
can use blk_mq_is_reserved_rq() instead.

This will also allow us to stop passing the same 'reserved' around the
blk-mq iter functions next.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 99e48cd6 06-Jul-2022 John Garry <john.garry@huawei.com>

blk-mq: Add a flag for reserved requests

Add a flag for reserved requests so that drivers may know this for any
special handling.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# eaa870f9 28-Jun-2022 Christoph Hellwig <hch@lst.de>

blk-mq: rename blk_mq_sysfs_{,un}register

Add a _hctx postfix to better describe what the functions do, match
the debugfs equivalents and release the old names for functions that
should be using them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6f8191fd 19-Jun-2022 Christoph Hellwig <hch@lst.de>

block: simplify disk shutdown

Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.

Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.

This saves an extra queue freeze for devices without a separately
allocated queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a78418e6 23-Jun-2022 Jan Kara <jack@suse.cz>

block: Always initialize bio IO priority on submit

Currently, IO priority set in task's IO context is not reflected in the
bio->bi_ioprio for most IO (only io_uring and direct IO set it). This
results in odd results where process is submitting some bios with one
priority and other bios with a different (unset) priority and due to
differing priorities bios cannot be merged. Make sure bio->bi_ioprio is
always set on bio submission.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-9-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9c6227e0 23-Jun-2022 Jan Kara <jack@suse.cz>

block: Initialize bio priority earlier

Bio's IO priority needs to be initialized before we try to merge the bio
with other bios. Otherwise we could merge bios which would otherwise
receive different IO priorities leading to possible QoS issues.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 82b74cac 23-Jun-2022 Jan Kara <jack@suse.cz>

blk-ioprio: Convert from rqos policy to direct call

Convert blk-ioprio handling from a rqos policy to a direct call from
blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy
anyway, it just needs a hook in bio submission path to set the bio's IO
priority. Secondly, the rqos .track hook gets actually called too late
for blk-ioprio purposes and introducing a special rqos hook just for
blk-ioprio looks even weirder.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3c8f9da4 22-Jun-2022 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Don't disable preemption around __blk_mq_run_hw_queue().

__blk_mq_delay_run_hw_queue() disables preemption to get a stable
current CPU number and then invokes __blk_mq_run_hw_queue() if the CPU
number is part the mask.

__blk_mq_run_hw_queue() acquires a spin_lock_t which is a sleeping lock
on PREEMPT_RT and can't be acquired with disabled preemption.

It is not required for correctness to invoke __blk_mq_run_hw_queue() on
a CPU matching hctx->cpumask. Both (async and direct requests) can run
on a CPU not matching hctx->cpumask.

The CPU mask without disabling preemption and invoking
__blk_mq_run_hw_queue().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/YrLSEiNvagKJaDs5@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 51ab80f0 15-Jun-2022 Bart Van Assche <bvanassche@acm.org>

block: Make blk_mq_get_sq_hctx() select the proper hardware queue type

Since the introduction of blk_mq_get_hctx_type() the operation type in
the second argument of blk_mq_get_hctx_type() matters. The introduction
of blk_mq_get_hctx_type() caused blk_mq_get_sq_hctx() to select a
hardware queue of type HCTX_TYPE_READ instead of HCTX_TYPE_DEFAULT.
Switch to hardware queue type HCTX_TYPE_DEFAULT since HCTX_TYPE_READ
should only be used for read requests.

Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220615225549.1054905-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2645672f 21-Jun-2022 Jens Axboe <axboe@kernel.dk>

block: pop cached rq before potentially blocking rq_qos_throttle()

If rq_qos_throttle() ends up blocking, then we will have invalidated and
flushed our current plug. Since blk_mq_get_cached_request() hasn't
popped the cached request off the plug list just yet, we end holding a
pointer to a request that is no longer valid. This insta-crashes with
rq->mq_hctx being NULL in the validity checks just after.

Pop the request off the cached list before doing rq_qos_throttle() to
avoid using a potentially stale request.

Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
Reported-by: Dylan Yudaken <dylany@fb.com>
Tested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6cfeadbf 15-Jun-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: don't clear flush_rq from tags->rqs[]

commit 364b61818f65 ("blk-mq: clearing flush request reference in
tags->rqs[]") is added to clear the to-be-free flush request from
tags->rqs[] for avoiding use-after-free on the flush rq.

Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time
by ~8s because running scsi probe which may create and remove lots of
unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and
each request queue has lots of hw queues.

Improve the situation by not running blk_mq_clear_flush_rq_mapping if
disk isn't added when there can't be any flush request issued.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4d337ceb 15-Jun-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: avoid to touch q->elevator without any protection

q->elevator is referred in blk_mq_has_sqsched() without any protection,
no .q_usage_counter is held, no queue srcu and rcu read lock is held,
so potential use-after-free may be triggered.

Fix the issue by adding one queue flag for checking if the elevator
uses single queue style dispatch. Meantime the elevator feature flag
of ELEVATOR_F_MQ_AWARE isn't needed any more.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5fd7a84a 15-Jun-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none

elevator can be tore down by sysfs switch interface or disk release, so
hold ->sysfs_lock before referring to q->elevator, then potential
use-after-free can be avoided.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 14dc7a18 15-Jun-2022 Bart Van Assche <bvanassche@acm.org>

block: Fix handling of offline queues in blk_mq_alloc_request_hctx()

This patch prevents that test nvme/004 triggers the following:

UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
index 512 is out of range for type 'long unsigned int [512]'
Call Trace:
show_stack+0x52/0x58
dump_stack_lvl+0x49/0x5e
dump_stack+0x10/0x12
ubsan_epilogue+0x9/0x3b
__ubsan_handle_out_of_bounds.cold+0x44/0x49
blk_mq_alloc_request_hctx+0x304/0x310
__nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
nvmf_dev_write+0xae/0x111 [nvme_fabrics]
vfs_write+0x144/0x560
ksys_write+0xb7/0x140
__x64_sys_write+0x42/0x50
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b81c14ca 30-May-2022 Haisu Wang <haisuwang@tencent.com>

blk-mq: do not update io_ticks with passthrough requests

Flush or passthrough requests are not accounted as normal IO in completion.
To reflect iostat for slow IO, io_ticks is updated when stat show called
based on inflight numbers.
It may cause inconsistent io_ticks calculation result.

So do not account non-passthrough request when check inflight.

Fixes: 86d7331299fd ("block: update io_ticks when io hang")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: samuelliao <samuelliao@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220530064059.1120058-1-haisuwang@tencent.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e2e53086 24-May-2022 Christoph Hellwig <hch@lst.de>

blk-mq: remove the done argument to blk_execute_rq_nowait

Let the caller set it together with the end_io_data instead of passing
a pointless argument. Note the the target code did in fact already
set it and then just overrode it again by calling blk_execute_rq_nowait.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 32ac5a9b 24-May-2022 Christoph Hellwig <hch@lst.de>

blk-mq: avoid a mess of casts for blk_end_sync_rq

Instead of trying to cast a __bitwise 32-bit integer to a larger integer
and then a pointer, just allow a struct with the blk_status_t and the
completion on stack and set the end_io_data to that. Use the
opportunity to move the code to where it belongs and drop rather
confusing comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ae948fd6 24-May-2022 Christoph Hellwig <hch@lst.de>

blk-mq: remove __blk_execute_rq_nowait

We don't want to plug for synchronous execution that where we immediately
wait for the request. Once that is done not a whole lot of code is
shared, so just remove __blk_execute_rq_nowait.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5d05426e 22-May-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx

blk_mq_run_hw_queues() could be run when there isn't queued request and
after queue is cleaned up, at that time tagset is freed, because tagset
lifetime is covered by driver, and often freed after blk_cleanup_queue()
returns.

So don't touch ->tagset for figuring out current default hctx by the mapping
built in request queue, so use-after-free on tagset can be avoided. Meantime
this way should be fast than retrieving mapping from tagset.

Cc: "yukuai (C)" <yukuai3@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Fixes: b6e68ee82585 ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2aaf5160 21-May-2022 Julia Lawall <Julia.Lawall@inria.fr>

blk-mq: fix typo in comment

Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20220521111145.81697-29-Julia.Lawall@inria.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a327c341 12-May-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: fix passthrough plugging

First we can't add request into plug list in blk_mq_request_bypass_insert
which may be called when flushing plug list, so nested plug is caused.

Second if polled passthrough request is inserted via blk_execute_rq(),
it can't be added to plug list too since io polling needs the request
to be issued to driver.

Fixes the two by moving plugging into blk_execute_rq_no_wait().

Cc: Christoph Hellwig <hch@lst.de>
Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220512140010.1458645-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1c2d2fff 10-May-2022 Jens Axboe <axboe@kernel.dk>

block: wire-up support for passthrough plugging

Add support for plugging in passthrough path. When plugging is enabled, the
requests are added to a plug instead of getting dispatched to the driver.
And when the plug is finished, the whole batch gets dispatched via
->queue_rqs which turns out to be more efficient. Otherwise dispatching
used to happen via ->queue_rq, one request at a time.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4cddeaca 27-Apr-2022 Tejun Heo <tj@kernel.org>

Revert "block: inherit request start time from bio for BLK_CGROUP"

This reverts commit 0006707723233cb2a9a23ca19fc3d0864835704c. It has a
couple problems:

* bio_issue_time() is stored in bio->bi_issue truncated to 51 bits. This
overflows in slightly over 26 days. Setting rq->io_start_time_ns with it
means that io duration calculation would yield >26days after 26 days of
uptime. This, for example, confuses kyber making it cause high IO
latencies.

* rq->io_start_time_ns should record the time that the IO is issued to the
device so that on-device latency can be measured. However,
bio_issue_time() is set before the bio goes through the rq-qos controllers
(wbt, iolatency, iocost), so when the bio gets throttled in any of the
mechanisms, the measured latencies make no sense - on-device latencies end
up higher than request-alloc-to-completion latencies.

We'll need a smarter way to avoid calling ktime_get_ns() repeatedly
back-to-back. For now, let's revert the commit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.16+
Link: https://lore.kernel.org/r/YmmeOLfo5lzc+8yI@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3d973a76 23-Mar-2022 Christoph Hellwig <hch@lst.de>

block: don't print I/O error warning for dead disks

When a disk has been marked dead, don't print warnings for I/O errors
as they are very much expected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220323163815.1526998-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4a3b666e 31-Mar-2022 Jakob Koschel <jakobkoschel@gmail.com>

block: use dedicated list iterator variable

To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.

To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Link: https://lore.kernel.org/r/20220331091218.641532-1-jakobkoschel@gmail.com
[axboe: move lookup to where return value is checked]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c75e707f 04-Mar-2022 Christoph Hellwig <hch@lst.de>

block: remove the per-bio/request write hint

With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 26fed4ac 11-Mar-2022 Jens Axboe <axboe@kernel.dk>

block: flush plug based on hardware and software queue order

We used to sort the plug list if we had multiple queues before dispatching
requests to the IO scheduler. This usually isn't needed, but for certain
workloads that interleave requests to disks, it's a less efficient to
process the plug list one-by-one if everything is interleaved.

Don't sort the list, but skip through it and flush out entries that have
the same target at the same time.

Fixes: df87eb0fce8f ("block: get rid of plug list sorting")
Reported-and-tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# de3d347f 07-Mar-2022 Ming Lei <ming.lei@redhat.com>

block: don't remove hctx debugfs dir from blk_mq_exit_queue

The queue's top debugfs dir is removed from blk_release_queue(), so all
hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue()
is only called from blk_cleanup_queue(), it isn't necessary to remove
hctx debugfs from blk_mq_exit_queue().

So remove it from blk_mq_exit_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308055200.735835-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e02657ea 07-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: handle already freed tags gracefully in blk_mq_free_rqs

To simplify further changes allow for double calling blk_mq_free_rqs on
a queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: split out from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220308055200.735835-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 41fa7222 07-Mar-2022 Christoph Hellwig <hch@lst.de>

blk-mq: do not include passthrough requests in I/O accounting

I/O accounting buckets I/O into the read/write/discard categories into
which passthrough I/O does not fit at all. It also accounts to the
block_device, which may not even exist for passthrough I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220308055200.735835-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e5cc99e 08-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: manage hctx map via xarray

First code becomes more clean by switching to xarray from plain array.

Second use-after-free on q->queue_hw_ctx can be fixed because
queue_for_each_hw_ctx() may be run when updating nr_hw_queues is
in-progress. With this patch, q->hctx_table is defined as xarray, and
this structure will share same lifetime with request queue, so
queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably.

Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com
[axboe: fix blk_mq_hw_ctx forward declaration]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f481208 08-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: prepare for implementing hctx table via xarray

It is inevitable to cause use-after-free on q->queue_hw_ctx between
queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting
to xarray can fix the uaf, meantime code gets cleaner.

Prepare for converting q->queue_hctx_ctx into xarray, one thing is that
xa_for_each() can only accept 'unsigned long' as index, so changes type
of hctx index of queue_for_each_hw_ctx() into 'unsigned long'.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 42ee3061 08-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: reconfigure poll after queue map is changed

queue map can be changed when updating nr_hw_queues, so we need to
reconfigure queue's poll capability. Add one helper for doing this job.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 306f13ee 08-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: simplify reallocation of hw ctxs a bit

blk_mq_alloc_and_init_hctx() has already taken reuse into account, so
no need to do it outside, then we can simplify blk_mq_realloc_hw_ctxs().

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4d805131 08-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: figure out correct numa node for hw queue

The current code always uses default queue map and hw queue index
for figuring out the numa node for hw queue, this way isn't correct
because blk-mq supports three queue maps, and the correct queue map
should be used for the specified hw queue.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8f5fea65 31-Jan-2022 David Jeffery <djeffery@redhat.com>

blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues

When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can
reset the delay length for an already pending delayed work run_work. This
creates a scenario where multiple hctx may have their queues set to run,
but if one runs first and finds nothing to do, it can reset the delay of
another hctx and stall the other hctx's ability to run requests.

To avoid this I/O stall when an hctx's run_work is already pending,
leave it untouched to run at its current designated time rather than
extending its delay. The work will still run which keeps closed the race
calling blk_mq_delay_run_hw_queues is needed for while also avoiding the
I/O stall.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7f36b7d0 15-Feb-2022 Ming Lei <ming.lei@redhat.com>

block: move blk_crypto_bio_prep() out of blk-mq.c

blk_crypto_bio_prep() is called for both bio based and blk-mq drivers,
so move it out of blk-mq.c, then we can unify this kind of handling.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 28db4711 15-Feb-2022 Christoph Hellwig <hch@lst.de>

blk-mq: remove the request_queue argument to blk_insert_cloned_request

The request must be submitted to the queue it was allocated for, so
remove the extra request_queue argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a5efda3c 15-Feb-2022 Christoph Hellwig <hch@lst.de>

blk-mq: fold blk_cloned_rq_check_limits into blk_insert_cloned_request

Fold blk_cloned_rq_check_limits into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 248c7933 15-Feb-2022 Christoph Hellwig <hch@lst.de>

blk-mq: make the blk-mq stacking code optional

The code to stack blk-mq drivers is only used by dm-multipath, and
will preferably stay that way. Make it optional and only selected
by device mapper, so that the buildbots more easily catch abuses
like the one that slipped in in the ufs driver in the last merged
window. Another positive side effects is that kernel builds without
device mapper shrink a little bit as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d5869fdc 10-Feb-2022 Yang Shi <shy828301@gmail.com>

block: introduce block_rq_error tracepoint

Currently, rasdaemon uses the existing tracepoint block_rq_complete
and filters out non-error cases in order to capture block disk errors.

But there are a few problems with this approach:

1. Even kernel trace filter could do the filtering work, there is
still some overhead after we enable this tracepoint.

2. The filter is merely based on errno, which does not align with kernel
logic to check the errors for print_req_error().

3. block_rq_complete only provides dev major and minor to identify
the block device, it is not convenient to use in user-space.

So introduce a new tracepoint block_rq_error just for the error case.
With this patch, rasdaemon could switch to block_rq_error.

Since the new tracepoint has the similar implementation with
block_rq_complete, so move the existing code from TRACE_EVENT
block_rq_complete() into new event class block_rq_completion(). Then add
event for block_rq_complete and block_rq_err respectively from the newly
created event class per the suggestion from Chaitanya Kulkarni.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# abfc426d 02-Feb-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device to bio_clone_fast

Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a5aa8d1 08-Mar-2022 Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection

Commit 9d497e2941c3 ("block: don't protect submit_bio_checks by
q_usage_counter") moved blk_mq_attempt_bio_merge and rq_qos_throttle
calls out of q_usage_counter protection. However, these functions require
q_usage_counter protection. The blk_mq_attempt_bio_merge call without
the protection resulted in blktests block/005 failure with KASAN null-
ptr-deref or use-after-free at bio merge. The rq_qos_throttle call
without the protection caused kernel hang at qos throttle.

To fix the failures, move the blk_mq_attempt_bio_merge and
rq_qos_throttle calls back to q_usage_counter protection.

Fixes: 9d497e2941c3 ("block: don't protect submit_bio_checks by q_usage_counter")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20220308080915.3473689-1-shinichiro.kawasaki@wdc.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a12821d5 11-Feb-2022 Pankaj Raghav <p.raghav@samsung.com>

block: Add handling for zone append command in blk_complete_request

Zone append command needs special handling to update the bi_sector
field in the bio struct with the actual position of the data in the
device. It is stored in __sector field of the request struct.

Fixes: 5581a5ddfe8d ("block: add completion handler for fast path")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220211093425.43262-2-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 592ee119 25-Jan-2022 Yu Kuai <yukuai3@huawei.com>

blk-mq: fix missing blk_account_io_done() in error path

If blk_mq_request_issue_directly() failed from
blk_insert_cloned_request(), the request will be accounted start.
Currently, blk_insert_cloned_request() is only called by dm, and such
request won't be accounted done by dm.

In normal path, io will be accounted start from blk_mq_bio_to_request(),
when the request is allocated, and such io will be accounted done from
__blk_mq_end_request_acct() whether it succeeded or failed. Thus add
blk_account_io_done() to fix the problem.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220126012132.3111551-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9b51d9d8 14-Aug-2021 Yury Norov <yury.norov@gmail.com>

cpumask: replace cpumask_next_* with cpumask_first_* where appropriate

cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0). This patch replaces 'next' with 'first' where
things look trivial.

There's no cpumask_first_zero() function, so create it.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>


# fd9f4e62 18-Jan-2022 Christoph Hellwig <hch@lst.de>

block: assign bi_bdev for cloned bios in blk_rq_prep_clone

bio_clone_fast() sets the cloned bio to have the same ->bi_bdev as the
source bio. This means that when request-based dm called setup_clone(),
the cloned bio had its ->bi_bdev pointing to the dm device. After Commit
0b6e522cdc4a ("blk-mq: use ->bi_bdev for I/O accounting")
__blk_account_io_start() started using the request's ->bio->bi_bdev for
I/O accounting, if it was set. This caused IO going to the underlying
devices to use the dm device for their I/O accounting.

Set up the proper ->bi_bdev in blk_rq_prep_clone based on the whole
device bdev for the queue the request is cloned onto.

Fixes: 0b6e522cdc4a ("blk-mq: use ->bi_bdev for I/O accounting")
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[hch: the commit message is mostly from a different patch from Benjamin]
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Link: https://lore.kernel.org/r/20220118070444.1241739-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9d497e29 04-Jan-2022 Ming Lei <ming.lei@redhat.com>

block: don't protect submit_bio_checks by q_usage_counter

Commit cc9c884dd7f4 ("block: call submit_bio_checks under q_usage_counter")
uses q_usage_counter to protect submit_bio_checks for avoiding IO after
disk is deleted by del_gendisk().

Turns out the protection isn't necessary, because once
blk_mq_freeze_queue_wait() in del_gendisk() returns:

1) all in-flight IO has been done

2) all new IO will be failed in __bio_queue_enter() because
q_usage_counter is dead, and GD_DEAD is set

3) both disk and request queue instance are safe since caller of
submit_bio() guarantees that the disk can't be closed.

Once submit_bio_checks() needn't the protection of q_usage_counter, we can
move submit_bio_checks before calling blk_mq_submit_bio() and
->submit_bio(). With this change, we needn't to throttle queue with
holding one allocated request, then precise driver tag or request won't be
wasted in throttling. Meantime we can unify the bio check for both bio
based and request based driver.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220104134223.590803-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 518579a9 20-Dec-2021 Keith Busch <kbusch@kernel.org>

blk-mq: blk-mq: check quiesce state before queue_rqs

The low level drivers don't expect to see new requests after a
successful quiesce completes. Check the queue quiesce state within the
rcu protected area prior to calling the driver's queue_rqs().

Fixes: 3c67d44de787 ("block: add mq_ops->queue_rqs hook")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211220205919.180191-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3c67d44d 03-Dec-2021 Jens Axboe <axboe@kernel.dk>

block: add mq_ops->queue_rqs hook

If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5581a5dd 01-Dec-2021 Jens Axboe <axboe@kernel.dk>

block: add completion handler for fast path

The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ab30a33 06-Dec-2021 John Garry <john.garry@huawei.com>

blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument

The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is
blk_mq_rq_inflight().

Function blk_mq_rq_inflight() uses the hctx to find the associated request
queue to match against the request. However this same check is already
done in caller bt_iter(), so drop this check.

With that change there are no more users of busy_iter_fn blk_mq_hw_ctx
argument, so drop the argument.

Reviewed-by Hannes Reinecke <hare@suse.de>

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 73f3760e 05-Dec-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()

blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.

Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 4cafe86c9267 ("blk-mq: run dispatch lock once in case of issuing from list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 41adf531 06-Dec-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: don't run might_sleep() if the operation needn't blocking

The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.

Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4cafe86c 03-Dec-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: run dispatch lock once in case of issuing from list

It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bcc330f4 03-Dec-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: pass request queue to blk_mq_run_dispatch_ops

We have switched to allocate srcu into request queue, so it is fine
to pass request queue to blk_mq_run_dispatch_ops().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 704b914f 03-Dec-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: move srcu from blk_mq_hw_ctx to request_queue

In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.

Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.

So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2a904d00 03-Dec-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: remove hctx_lock and hctx_unlock

Remove hctx_lock and hctx_unlock, and add one helper of
blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops
with rcu/srcu read held.

Compared with hctx_lock()/hctx_unlock():

1) remove 2 branch to 1, so we just need to check
(hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops

2) srcu_idx needn't to be touched in case of non-blocking

3) might_sleep_if() can be moved to the blocking branch

Also put the added blk_mq_run_dispatch_ops() in private header, so that
the following patch can use it out of blk-mq.c.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a467d0f 14-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: switch to atomic_t for request references

refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.

This borrows that same implementation, which in turn is based on the
mm implementation from Linus.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a08ed9aa 02-Dec-2021 Jens Axboe <axboe@kernel.dk>

block: fix double bio queue when merging in cached request path

When we attempt to merge off the cached request path, we return NULL
if successful. This makes the caller believe that it's should allocate
a new request, and hence we end up with the bio both merged and associated
with a new request. This, predictably, leads to all sorts of crashes.

Pass in a pointer to the bio pointer, and clear it for the merge case.
Then the caller knows that the bio is already queued, and no new requests
need to get allocated.

Fixes: 5b13bc8a3fd5 ("blk-mq: cleanup request allocation")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 373b5416 02-Dec-2021 Jens Axboe <axboe@kernel.dk>

block: get rid of useless goto and label in blk_mq_get_new_requests()

Expected case is returning a request, just check for success and return
the request rather than having an error label.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b84ba30b 26-Nov-2021 Christoph Hellwig <hch@lst.de>

block: remove the gendisk argument to blk_execute_rq

Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now. Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f3fa33ac 26-Nov-2021 Christoph Hellwig <hch@lst.de>

block: remove the ->rq_disk field in struct request

Just use the disk attached to the request_queue instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 222ee581 25-Nov-2021 Christoph Hellwig <hch@lst.de>

block: move the remaining elv.icq handling to the I/O scheduler

After the prepare side has been moved to the only I/O scheduler that
cares, do the same for the cleanup and the NULL initialization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5f480b1a 26-Nov-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: use bio->bi_opf after bio is checked

bio->bi_opf isn't finalized before checking the bio, so use it after
submit_bio_checks() returns.

Fixes: 5b13bc8a3fd5 ("blk-mq: cleanup request allocation")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5b13bc8a 23-Nov-2021 Christoph Hellwig <hch@lst.de>

blk-mq: cleanup request allocation

Refactor the request alloction so that blk_mq_get_cached_request tries
to find a cached request first, and the entirely separate and now
self contained blk_mq_get_new_requests allocates one or more requests
if that is not possible.

There is a small change in behavior as submit_bio_checks is called
twice now if a cached request is present but can't be used, but that
is a small price to pay for unwinding this code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124062856.1444266-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 82d981d4 23-Nov-2021 Christoph Hellwig <hch@lst.de>

block: don't include <linux/part_stat.h> in blk.h

Not needed, shift it into the source files that need it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 35c90e6e 22-Nov-2021 Guo Zhengkui <guozhengkui@vivo.com>

blk_mq: remove repeated includes

Remove a repeated "#include<linux/sched/sysctl.h>".

Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211123063340.25882-1-guozhengkui@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a9d041b 13-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: move io_context creation into where it's needed

The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.

Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 48b5c1fb 13-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: only allocate poll_stats if there's a user of them

This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.

Kill the queue flag for this while at it, since we can just check the
assigned pointer now.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1e9c2303 23-Nov-2021 Christoph Hellwig <hch@lst.de>

blk-mq: move more plug handling from blk_mq_submit_bio into blk_add_rq_to_plug

Keep all the functionality for adding a request to a plug in a single place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0c5bcc92 23-Nov-2021 Christoph Hellwig <hch@lst.de>

blk-mq: simplify the plug handling in blk_mq_submit_bio

blk_mq_submit_bio has two different plug cases, one that uses full
plugging and a limited plugging one.

The limited plugging case is only used for a corner case that does
not matter in real life:

- no ->commit_rqs (so not NVMe)
- no shared tags (so not SCSI)
- not rotational (so no old disk or floppy driver)
- must have multiple queues (so no eMMC)

Remove the limited merging case and all the related junk to simplify
blk_mq_submit_bio and the functions called from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0d7a29a2 16-Nov-2021 Christoph Hellwig <hch@lst.de>

block: move blk_print_req_error to blk-mq.c

This function is only used by the request completion path. Factor out
a blk_status_to_str to keep blk_errors private in blk-core.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 22350ad7 16-Nov-2021 Christoph Hellwig <hch@lst.de>

block: move blk_dump_rq_flags to blk-mq.c

blk_dump_rq_flags deals with a request, so move it to blk-mq.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 450b7879 16-Nov-2021 Christoph Hellwig <hch@lst.de>

block: move blk_account_io_{start,done} to blk-mq.c

These are only used for request based I/O, so move them where they are
used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f2b8f3ce 16-Nov-2021 Christoph Hellwig <hch@lst.de>

block: move blk_steal_bios to blk-mq.c

Keep all the request based code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 52fdbbcc 16-Nov-2021 Christoph Hellwig <hch@lst.de>

block: move blk_rq_init to blk-mq.c

blk_rq_init deals with a request structure, so move it to blk-mq.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 06c8c691 16-Nov-2021 Christoph Hellwig <hch@lst.de>

block: move request based cloning helpers to blk-mq.c

Keep all the request based code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b84c5b50 16-Nov-2021 Christoph Hellwig <hch@lst.de>

blk-mq: move blk_mq_flush_plug_list

Move blk_mq_flush_plug_list and blk_mq_plug_issue_direct down in blk-mq.c
to prepare for marking blk_mq_request_issue_directly static without the
need of a forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4054cff9 16-Nov-2021 Christoph Hellwig <hch@lst.de>

block: remove blk-exec.c

All this code is tightly coupled to the blk-mq core, so move it
there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-4-hch@lst.de
[axboe: remove doc generation for blk-exec.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 98b26a0e 26-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: call rq_qos_done() before ref check in batch completions

We need to call rq_qos_done() regardless of whether or not we're freeing
the request or not, as the reference count doesn't cover the IO completion
tracking.

Fixes: f794f3351f26 ("block: add support for blk_mq_end_request_batch()")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2b504bd4 18-Nov-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: don't insert FUA request with data into scheduler queue

We never insert flush request into scheduler queue before.

Recently commit d92ca9d8348f ("blk-mq: don't handle non-flush requests in
blk_insert_flush") tries to handle FUA data request as normal request.
This way has caused warning[1] in mq-deadline dd_exit_sched() or io hang in
case of kyber since RQF_ELVPRIV isn't set for flush request, then
->finish_request won't be called.

Fix the issue by inserting FUA data request with blk_mq_request_bypass_insert()
when the device supports FUA, just like what we did before.

[1] https://lore.kernel.org/linux-block/CAHj4cs-_vkTW=dAzbZYGxpEWSpzpcmaNeY1R=vH311+9vMUSdg@mail.gmail.com/

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: d92ca9d8348f ("blk-mq: don't handle non-flush requests in blk_insert_flush")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211118153041.2163228-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2a19b28f 15-Nov-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()

For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().

However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.

Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():

1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.

2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.

[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]

Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 95febeb6 15-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: fix missing queue put in error path

If we fail the submission queue checks, we don't put the queue afterwards.
This can cause various issues like stalls on scheduler switch or failure
to remove the device, or like in the original bug report, timeout waiting
for the device on reboot/restart.

While in there, fix a few whitespace discrepancies in the surrounding
code.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215039
Fixes: b637108a4022 ("blk-mq: fix filesystem I/O request allocation")
Reported-and-tested-by: Stephen Smith <stephenmsmith@blueyonder.co.uk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b637108a 12-Nov-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: fix filesystem I/O request allocation

submit_bio_checks() may update bio->bi_opf, so we have to initialize
blk_mq_alloc_data.cmd_flags with bio->bi_opf after submit_bio_checks()
returns when allocating new request.

In case of using cached request, fallback to allocate new request if
cached rq isn't compatible with the incoming bio, otherwise change
rq->cmd_flags with incoming bio->bi_opf.

Fixes: 900e080752025f00 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b131f201 11-Nov-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: rename blk_attempt_bio_merge

It is very annoying to have two block layer functions which share same
name, so rename blk_attempt_bio_merge in blk-mq.c as
blk_mq_attempt_bio_merge.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9ef4d020 09-Nov-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: add one API for waiting until quiesce is done

Some drivers(NVMe, SCSI) need to call quiesce and unquiesce in pair, but it
is hard to switch to this style, so these drivers need one atomic flag for
helping to balance quiesce and unquiesce.

When quiesce is in-progress, the driver still needs to wait until
the quiesce is done, so add API of blk_mq_wait_quiesce_done() for
these drivers.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a846a8e6 08-Nov-2021 Ye Bin <yebin10@huawei.com>

blk-mq: don't free tags if the tag_set is used by other device in queue initialztion

We got UAF report on v5.10 as follows:
[ 1446.674930] ==================================================================
[ 1446.675970] BUG: KASAN: use-after-free in blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.676902] Read of size 8 at addr ffff8880185afd10 by task kworker/1:2/12348
[ 1446.677851]
[ 1446.678073] CPU: 1 PID: 12348 Comm: kworker/1:2 Not tainted 5.10.0-10177-gc9c81b1e346a #2
[ 1446.679168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1446.680692] Workqueue: kthrotld blk_throtl_dispatch_work_fn
[ 1446.681448] Call Trace:
[ 1446.681800] dump_stack+0x9b/0xce
[ 1446.682916] print_address_description.constprop.6+0x3e/0x60
[ 1446.685999] kasan_report.cold.9+0x22/0x3a
[ 1446.687186] blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.687785] blk_mq_dispatch_rq_list+0x21a/0x1d40
[ 1446.692576] __blk_mq_do_dispatch_sched+0x394/0x830
[ 1446.695758] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[ 1446.698279] blk_mq_sched_dispatch_requests+0xdf/0x140
[ 1446.698967] __blk_mq_run_hw_queue+0xc0/0x270
[ 1446.699561] __blk_mq_delay_run_hw_queue+0x4cc/0x550
[ 1446.701407] blk_mq_run_hw_queue+0x13b/0x2b0
[ 1446.702593] blk_mq_sched_insert_requests+0x1de/0x390
[ 1446.703309] blk_mq_flush_plug_list+0x4b4/0x760
[ 1446.705408] blk_flush_plug_list+0x2c5/0x480
[ 1446.708471] blk_finish_plug+0x55/0xa0
[ 1446.708980] blk_throtl_dispatch_work_fn+0x23b/0x2e0
[ 1446.711236] process_one_work+0x6d4/0xfe0
[ 1446.711778] worker_thread+0x91/0xc80
[ 1446.713400] kthread+0x32d/0x3f0
[ 1446.714362] ret_from_fork+0x1f/0x30
[ 1446.714846]
[ 1446.715062] Allocated by task 1:
[ 1446.715509] kasan_save_stack+0x19/0x40
[ 1446.716026] __kasan_kmalloc.constprop.1+0xc1/0xd0
[ 1446.716673] blk_mq_init_tags+0x6d/0x330
[ 1446.717207] blk_mq_alloc_rq_map+0x50/0x1c0
[ 1446.717769] __blk_mq_alloc_map_and_request+0xe5/0x320
[ 1446.718459] blk_mq_alloc_tag_set+0x679/0xdc0
[ 1446.719050] scsi_add_host_with_dma.cold.3+0xa0/0x5db
[ 1446.719736] virtscsi_probe+0x7bf/0xbd0
[ 1446.720265] virtio_dev_probe+0x402/0x6c0
[ 1446.720808] really_probe+0x276/0xde0
[ 1446.721320] driver_probe_device+0x267/0x3d0
[ 1446.721892] device_driver_attach+0xfe/0x140
[ 1446.722491] __driver_attach+0x13a/0x2c0
[ 1446.723037] bus_for_each_dev+0x146/0x1c0
[ 1446.723603] bus_add_driver+0x3fc/0x680
[ 1446.724145] driver_register+0x1c0/0x400
[ 1446.724693] init+0xa2/0xe8
[ 1446.725091] do_one_initcall+0x9e/0x310
[ 1446.725626] kernel_init_freeable+0xc56/0xcb9
[ 1446.726231] kernel_init+0x11/0x198
[ 1446.726714] ret_from_fork+0x1f/0x30
[ 1446.727212]
[ 1446.727433] Freed by task 26992:
[ 1446.727882] kasan_save_stack+0x19/0x40
[ 1446.728420] kasan_set_track+0x1c/0x30
[ 1446.728943] kasan_set_free_info+0x1b/0x30
[ 1446.729517] __kasan_slab_free+0x111/0x160
[ 1446.730084] kfree+0xb8/0x520
[ 1446.730507] blk_mq_free_map_and_requests+0x10b/0x1b0
[ 1446.731206] blk_mq_realloc_hw_ctxs+0x8cb/0x15b0
[ 1446.731844] blk_mq_init_allocated_queue+0x374/0x1380
[ 1446.732540] blk_mq_init_queue_data+0x7f/0xd0
[ 1446.733155] scsi_mq_alloc_queue+0x45/0x170
[ 1446.733730] scsi_alloc_sdev+0x73c/0xb20
[ 1446.734281] scsi_probe_and_add_lun+0x9a6/0x2d90
[ 1446.734916] __scsi_scan_target+0x208/0xc50
[ 1446.735500] scsi_scan_channel.part.3+0x113/0x170
[ 1446.736149] scsi_scan_host_selected+0x25a/0x360
[ 1446.736783] store_scan+0x290/0x2d0
[ 1446.737275] dev_attr_store+0x55/0x80
[ 1446.737782] sysfs_kf_write+0x132/0x190
[ 1446.738313] kernfs_fop_write_iter+0x319/0x4b0
[ 1446.738921] new_sync_write+0x40e/0x5c0
[ 1446.739429] vfs_write+0x519/0x720
[ 1446.739877] ksys_write+0xf8/0x1f0
[ 1446.740332] do_syscall_64+0x2d/0x40
[ 1446.740802] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1446.741462]
[ 1446.741670] The buggy address belongs to the object at ffff8880185afd00
[ 1446.741670] which belongs to the cache kmalloc-256 of size 256
[ 1446.743276] The buggy address is located 16 bytes inside of
[ 1446.743276] 256-byte region [ffff8880185afd00, ffff8880185afe00)
[ 1446.744765] The buggy address belongs to the page:
[ 1446.745416] page:ffffea0000616b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x185ac
[ 1446.746694] head:ffffea0000616b00 order:2 compound_mapcount:0 compound_pincount:0
[ 1446.747719] flags: 0x1fffff80010200(slab|head)
[ 1446.748337] raw: 001fffff80010200 ffffea00006a3208 ffffea000061bf08 ffff88801004f240
[ 1446.749404] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 1446.750455] page dumped because: kasan: bad access detected
[ 1446.751227]
[ 1446.751445] Memory state around the buggy address:
[ 1446.752102] ffff8880185afc00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.753090] ffff8880185afc80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.754079] >ffff8880185afd00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.755065] ^
[ 1446.755589] ffff8880185afd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.756574] ffff8880185afe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.757566] ==================================================================

Flag 'BLK_MQ_F_TAG_QUEUE_SHARED' will be set if the second device on the
same host initializes it's queue successfully. However, if the second
device failed to allocate memory in blk_mq_alloc_and_init_hctx() from
blk_mq_realloc_hw_ctxs() from blk_mq_init_allocated_queue(),
__blk_mq_free_map_and_rqs() will be called on error path, and if
'BLK_MQ_TAG_HCTX_SHARED' is not set, 'tag_set->tags' will be freed
while it's still used by the first device.

To fix this issue we move release newly allocated hardware context from
blk_mq_realloc_hw_ctxs to __blk_mq_update_nr_hw_queues. As there is needn't to
release hardware context in blk_mq_init_allocated_queue.

Fixes: 868f2f0b7206 ("blk-mq: dynamic h/w context count")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211108074019.1058843-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 10c47870 04-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: ensure cached plug request matches the current queue

If we're driving multiple devices, we could have pre-populated the cache
for a different device. Ensure that the empty request matches the current
queue.

Fixes: 47c122e35d7e ("block: pre-allocate requests if plug is started and is a batch")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 900e0807 03-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: move queue enter logic into blk_mq_submit_bio()

Retain the old logic for the fops based submit, but for our internal
blk_mq_submit_bio(), move the queue entering logic into the core
function itself.

We need to be a bit careful if going into the scheduler, as a scheduler
or queue mappings can arbitrarily change before we have entered the queue.
Have the bio scheduler mapping do that separately, it's a very cheap
operation compared to actually doing merging locking and lookups.

Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: update to check merge post submit_bio_checks() doing remap...]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 71539717 03-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: split request allocation components into helpers

This is in preparation for a fix, but serves as a cleanup as well moving
the cached vs regular alloc logic out of blk_mq_submit_bio().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c5fc7b93 03-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: have plug stored requests hold references to the queue

Requests that were stored in the cache deliberately didn't hold an enter
reference to the queue, instead we grabbed one every time we pulled a
request out of there. That made for awkward logic on freeing the remainder
of the cached list, if needed, where we had to artificially raise the
queue usage count before each free.

Grab references up front for cached plug requests. That's safer, and also
more efficient.

Fixes: 47c122e35d7e ("block: pre-allocate requests if plug is started and is a batch")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3b87c6ea 02-Nov-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: update hctx->nr_active in blk_mq_end_request_batch()

In case of shared tags and none io sched, batched completion still may
be run into, and hctx->nr_active is accounted when getting driver tag,
so it has to be updated in blk_mq_end_request_batch().

Otherwise, hctx->nr_active may become same with queue depth, then
hctx_may_queue() always return false, then io hang is caused.

Fixes the issue by updating the counter in batched way.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: f794f3351f26 ("block: add support for blk_mq_end_request_batch()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211102153619.3627505-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 781dd830 02-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: move RQF_ELV setting into allocators

It's not safe to do this before blk_queue_enter(), as the scheduler state
could have changed in between. Hence move the RQF_ELV setting into the
allocators, where we know the queue is already entered.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Steffen Maier <maier@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b2280909 01-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: replace always false argument with 'false'

A previous commit fixed up the condition for doing direct issue, but that
left the 'from_schedule' argument dead inside the branch. Replace it with
'false'.

Fixes: ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a22c00be 01-Nov-2021 Jens Axboe <axboe@kernel.dk>

block: assign correct tag before doing prefetch of request

Ensure that current tag is correctly assigned before attempting
to prefetch the first cacheline of the request.

Fixes: 92aff191cc5b ("block: prefetch request to be initialized")
Reported-and-tested-by: syzbot+cd20829ac44b92bf6ed0@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 02f7eab0 28-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: improve readability of blk_mq_end_request_batch()

It's faster and easier to read if we tolerate cur_hctx being NULL in
the "when to flush" condition. Rename last_hctx to cur_hctx while at it,
as it better describes the role of that variable.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c7b84d42 19-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: re-flow blk_mq_rq_ctx_init()

Now that we have flags passed in, we can do a final re-arrange of the
flow of blk_mq_rq_ctx_init() so we're always writing request in the
order in which it is laid out.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 92aff191 19-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: prefetch request to be initialized

Now we have the tags available in __blk_mq_alloc_requests_batch(), we
can start fetching the first request cacheline before calling into the
request initialization.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fe6134f6 19-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: pass in blk_mq_tags to blk_mq_rq_ctx_init()

Instead of getting this from data for every invocation of request
initialization, pass it in as an argument instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 56f8da64 19-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: add rq_flags to struct blk_mq_alloc_data

There's a hole here we can use, and it's faster to set this earlier
rather than need to check q->elevator multiple times.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ff155223 26-Oct-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: don't issue request directly in case that current is to be blocked

When flushing plug list in case that current will be blocked, we can't
issue request directly because ->queue_rq() may sleep, otherwise scheduler
may complain.

Fixes: dc5fc361d891 ("block: attempt direct issue of plug list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026082257.2889890-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 297db731 22-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

block: fix req_bio_endio append error handling

Shinichiro Kawasaki reports that there is a bug in a recent
req_bio_endio() patch causing problems with zonefs. As Shinichiro
suggested, inverse the condition in zone append path to resemble how it
was before: fail when it's not fully completed.

Fixes: 478eb72b815f3 ("block: optimise req_bio_endio()")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/344ea4e334aace9148b41af5f2426da38c8aa65a.1634914228.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 179ae84f 20-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

block: clean up blk_mq_submit_bio() merging

Combine blk_mq_sched_bio_merge() and blk_attempt_plug_merge() under a
common if, so we don't check it twice.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/daedc90d4029a5d1d73344771632b1faca3aaf81.1634755800.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a214b949 20-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: only flush requests from the plug in blk_mq_submit_bio

Replace the call to blk_flush_plug_list in blk_mq_submit_bio with a
direct call to blk_mq_flush_plug_list. This means we do not flush
plug callback from stackable devices, which doesn't really help with
the accumulated requests anyway, and it also means the cached requests
aren't freed here as they can still be used later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 037057a5 20-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: remove inaccurate requeue check

This check is meant to catch cases where a requeue is attempted on a
request that is still inserted. It's never really been useful to catch any
misuse, and now it's actively wrong. Outside of that, this should not be a
BUG_ON() to begin with.

Remove the check as it's now causing active harm, as requeue off the plug
path will trigger it even though the request state is just fine.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 478eb72b 19-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

block: optimise req_bio_endio()

First, get rid of an extra branch and chain error checks. Also reshuffle
it with bio_advance(), so it goes closer to the final check, with that
the compiler loads rq->rq_flags only once, and also doesn't reload
bio->bi_iter.bi_size if bio_advance() didn't actually advanced the iter.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e70feb8b 14-Oct-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: support concurrent queue quiesce/unquiesce

blk_mq_quiesce_queue() has been used a bit wide now, so far we don't support
concurrent/nested quiesce. One biggest issue is that unquiesce can happen
unexpectedly in case that quiesce/unquiesce are run concurrently from
more than one context.

This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last/outer-most one of all
contexts.

Several kernel panic issue has been reported[1][2][3] when running stress
quiesce test. And this patch has been verified in these reports.

[1] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m1fc52431fad7f33b1ffc3f12c4450e4238540787
[2] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m10ad90afeb9c8cc318334190a7c24c8b5c5e0722
[3] https://listman.redhat.com/archives/dm-devel/2021-September/msg00189.html

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a808a9d5 13-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: inline fast path of driver tag allocation

If we don't use an IO scheduler or have shared tags, then we don't need
to call into this external function at all. This saves ~2% for such
a setup.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d92ca9d8 19-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: don't handle non-flush requests in blk_insert_flush

Return to the normal blk_mq_submit_bio flow if the bio did not end up
actually being a flush because the device didn't support it. Note that
this is basically impossible to hit without special instrumentation given
that submit_bio_checks already clears these flags usually, so we'd need a
tight race to actually hit this code path.

With this the call to blk_mq_run_hw_queue for the flush requests can be
removed given that the actual flush requests are always issued via the
requeue workqueue which runs the queue unconditionally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019122553.2467817-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dc5fc361 19-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: attempt direct issue of plug list

If we have just one queue type in the plug list, then we can extend our
direct issue to cover a full plug list as well. This allows sending a
batch of requests for direct issue, which is more efficient than doing
one-at-a-time kind of issue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bc490f81 18-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: change plugging to use a singly linked list

Use a singly linked list for the blk_plug. This saves 8 bytes in the
blk_plug struct, and makes for faster list manipulations than doubly
linked lists. As we don't use the doubly linked lists for anything,
singly linked is just fine.

This yields a bump in default (merging enabled) performance from 7.0
to 7.1M IOPS, and ~7.5M IOPS with merging disabled.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e028f167 16-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: move blk_mq_tag_to_rq() inline

This is in the fast path of driver issue or completion, and it's a single
array index operation. Move it inline to avoid a function call for it.

This does mean making struct blk_mq_tags block layer public, but there's
not really much in there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# df87eb0f 18-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: get rid of plug list sorting

Even if we have multiple queues in the plug list, chances that they
are very interspersed is minimal. Don't bother spending CPU cycles
sorting the list.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 87c037d1 18-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: return whether or not to unplug through boolean

Instead of returning the same queue request through a request pointer,
use a boolean to accomplish the same.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8a7d267b 18-Oct-2021 Christoph Hellwig <hch@lst.de>

block: don't call blk_status_to_errno in blk_update_request

We only need to call it to resolve the blk_status_t -> errno mapping for
tracing, so move the conversion into the tracepoints that are not called
at all when tracing isn't enabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e0d78afe 18-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: fix too broad elevator check in blk_mq_free_request()

We added RQF_ELV to tell whether there's an IO scheduler attached, and
RQF_ELVPRIV tells us whether there's an IO scheduler with private data
attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
about here is just if we have scheduler private data attached.

This fixes a boot crash

Fixes: 2ff0682da6e0 ("block: store elevator state in request")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f794f335 08-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: add support for blk_mq_end_request_batch()

Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a72e899 12-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: add a struct io_comp_batch argument to fops->iopoll()

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 013a7f95 13-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: provide helpers for rq_list manipulation

Instead of open-coding the list additions, traversal, and removal,
provide a basic set of helpers.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# afd7de03 18-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: remove some blk_mq_hw_ctx debugfs entries

Just like the blk_mq_ctx counterparts, we've got a bunch of counters
in here that are only for debugfs and are of questionnable value. They
are:

- dispatched, index of how many requests were dispatched in one go

- poll_{considered,invoked,success}, which track poll sucess rates. We're
confident in the iopoll implementation at this point, don't bother
tracking these.

As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
dropping a whole cacheline.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9a14d6ce 16-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes

These were added as part of early days debugging for blk-mq, and they
are not really useful anymore. Rather than spend cycles updating them,
just get rid of them.

As a bonus, this shrinks the per-cpu software queue size from 256b
to 192b. That's a whole cacheline less.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 12845906 18-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

block: cache rq_flags inside blk_mq_rq_ctx_init()

Add a local variable for rq_flags, it helps to compile out some of
rq_flags reloads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 605f784e 18-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

block: blk_mq_rq_ctx_init cache ctx/q/hctx

We should have enough of registers in blk_mq_rq_ctx_init(), store them
in local vars, so we don't keep reloading them.

note: keeping q->elevator may look unnecessary, but it's also used
inside inlined blk_mq_tags_from_data().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f266f2b 18-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

block: skip elevator fields init for non-elv queue

Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
no elevator. Also, move some other initialisers that imply barriers to
the end, so the compiler is free to rearrange and optimise other the
rest of them.

note: fold in a change from Jens leaving queue_list unconditional, as
it might lead to problems otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2ff0682d 15-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: store elevator state in request

Add an rq private RQF_ELV flag, which tells the block layer that this
request was initialized on a queue that has an IO scheduler attached.
This allows for faster checking in the fast path, rather than having to
deference rq->q later on.

Elevator switching does full quiesce of the queue before detaching an
IO scheduler, so it's safe to cache this in the request itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9be3e06f 14-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: move update request helpers into blk-mq.c

For some reason we still have them in blk-core, with the rest of the
request completion being in blk-mq. That causes and out-of-line call
for each completion.

Move them into blk-mq.c instead, where they belong.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ed6cddef 14-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

block: convert the rest of block to bdev_get_queue

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# abd45c15 13-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: handle fast path of bio splitting inline

The fast path is no splitting needed. Separate the handling into a
check part we can inline, and an out-of-line handling path if we do
need to split.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3e08773c 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: switch polling to be bio based

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6ce913fe 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: rename REQ_HIPRI to REQ_POLLED

Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d729cf9a 12-Oct-2021 Christoph Hellwig <hch@lst.de>

io_uring: don't sleep when polling for I/O

There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ef99b2d3 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: replace the spin argument to blk_iopoll with a flags argument

Switch the boolean spin argument to blk_poll to passing a set of flags
instead. This will allow to control polling behavior in a more fine
grained way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 28a1ae6b 12-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_qc_t_valid

Move the trivial check into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# efbabbe1 12-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal

Merge both functions into their only caller to keep the blk-mq tag to
blk_qc_t mapping as private as possible in blk-mq.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6699d6f 12-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a "classic" poll helper

Factor the code to do the classic full metal polling out of blk_poll into
a separate blk_mq_poll_classic helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f70299f0 12-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a blk_qc_to_hctx helper

Add a helper to get the hctx from a request_queue and cookie, and fold
the blk_qc_t_to_queue_num helper into it as no other callers are left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 349302da 09-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: improve batched tag allocation

Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().

This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8971a3b7 13-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: optimise *end_request non-stat path

We already have a blk_mq_need_time_stamp() check in
__blk_mq_end_request() to get a timestamp, hide all the statistics
accounting under it. It cuts some cycles for requests that don't need
stats, and is free otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e0f2ea812e93a8adcd07101212e7d7e70ca304e7.1634115360.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4a60f360 16-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: don't dereference request after flush insertion

We could have a race here, where the request gets freed before we call
into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
of the request.

Grab the hardware context before inserting the flush.

Fixes: 0f38d7664615 ("blk-mq: cleanup blk_mq_submit_bio")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0f38d766 11-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: cleanup blk_mq_submit_bio

Move the blk_mq_alloc_data stack allocation only into the branch
that actually needs it, and use rq->mq_hctx instead of data.hctx
to refer to the hctx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b90cfaed 11-Oct-2021 Christoph Hellwig <hch@lst.de>

blk-mq: cleanup and rename __blk_mq_alloc_request

The newly added loop for the cached requests in __blk_mq_alloc_request
is a little too convoluted for my taste, so unwind it a bit. Also
rename the function to __blk_mq_alloc_requests now that it can allocate
more than a single request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 47c122e3 06-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: pre-allocate requests if plug is started and is a batch

The caller typically has a good (or even exact) idea of how many requests
it needs to submit. We can make the request/tag allocation a lot more
efficient if we just allocate N requests/tags upfront when we queue the
first bio from the batch.

Provide a new plug start helper that allows the caller to specify how many
IOs are expected. This sets plug->nr_ios, and we can use that for smarter
request allocation. The plug provides a holding spot for requests, and
request allocation will check it before calling into the normal request
allocation path.

The blk_finish_plug() is called, check if there are unused requests and
free them. This should not happen in normal operations. The exception is
if we get merging, then we may be left with requests that need freeing
when done.

This raises the per-core performance on my setup from ~5.8M to ~6.1M
IOPS.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ba0ffdd8 06-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: bump max plugged deferred size from 16 to 32

Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.

Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.

While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 00067077 05-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: inherit request start time from bio for BLK_CGROUP

Doing high IOPS testing with blk-cgroups enabled spends ~15-20% of the
time just doing ktime_get_ns() -> readtsc. We essentially read and
set the start time twice, one for the bio and then again when that bio
is mapped to a request.

Given that the time between the two is very short, inherit the bio
start time instead of reading it again. This cuts 1/3rd of the overhead
of the time keeping.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 079a2e3e 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Change shared sbitmap naming to shared tags

Now that shared sbitmap support really means shared tags, rename symbols
to match that.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ae0f1a73 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Stop using pointers for blk_mq_tags bitmap tags

Now that we use shared tags for shared sbitmap support, we don't require
the tags sbitmap pointers, so drop them.

This essentially reverts commit 222a5ae03cdd ("blk-mq: Use pointers for
blk_mq_tags bitmap tags").

Function blk_mq_init_bitmap_tags() is removed also, since it would be only
a wrappper for blk_mq_init_bitmaps().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e155b0c2 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Use shared tags for shared sbitmap support

Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.

However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.

As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.

So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static
rqs.

Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 645db34e 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()

Refactor blk_mq_free_map_and_requests() such that it can be used at many
sites at which the tag map and rqs are freed.

Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the
alloc equivalent.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 63064be1 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Add blk_mq_alloc_map_and_rqs()

Add a function to combine allocating tags and the associated requests,
and factor out common patterns to use this new function.

Some function only call blk_mq_alloc_map_and_rqs() now, but more
functionality will be added later.

Also make blk_mq_alloc_rq_map() and blk_mq_alloc_rqs() static since they
are only used in blk-mq.c, and finally rename some functions for
conciseness and consistency with other function names:
- __blk_mq_alloc_map_and_{request -> rqs}()
- blk_mq_alloc_{map_and_requests -> set_map_and_rqs}()

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-11-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a7e7388d 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap()

Put the functionality to update the sched shared sbitmap size in a common
function.

Since the same formula is always used to resize, and it can be got from
the request queue argument, so just pass the request queue pointer.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-10-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f245d5b 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Don't clear driver tags own mapping

Function blk_mq_clear_rq_mapping() is required to clear the sched tags
mappings in driver tags rqs[].

But there is no need for a driver tags to clear its own mapping, so skip
clearing the mapping in this scenario.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-9-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f32e4eaf 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Pass driver tags to blk_mq_clear_rq_mapping()

Function blk_mq_clear_rq_mapping() will be used for shared sbitmap tags
in future, so pass a driver tags pointer instead of the tagset container
and HW queue index.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-8-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f6adcef5 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Invert check in blk_mq_update_nr_requests()

It's easier to read:

if (x)
X;
else
Y;

over:

if (!x)
Y;
else
X;

No functional change intended.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-5-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8fa04464 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests()

For shared sbitmap, if the call to blk_mq_tag_update_depth() was
successful for any hctx when hctx->sched_tags is not set, then it would be
successful for all (due to nature in which blk_mq_tag_update_depth()
fails).

As such, there is no need to call blk_mq_tag_resize_shared_sbitmap() for
each hctx. So relocate the call until after the hctx iteration under the
!q->elevator check, which is equivalent (to !hctx->sched_tags).

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 65de57bb 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Change rqs check in blk_mq_free_rqs()

The original code in commit 24d2f90309b23 ("blk-mq: split out tag
initialization, support shared tags") would check tags->rqs is non-NULL and
then dereference tags->rqs[].

Then in commit 2af8cbe30531 ("blk-mq: split tag ->rqs[] into two"), we
started to dereference tags->static_rqs[], but continued to check non-NULL
tags->rqs.

Check tags->static_rqs as non-NULL instead, which is more logical.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fe45e630 20-Sep-2021 Christoph Hellwig <hch@lst.de>

block: move integrity handling out of <linux/blkdev.h>

Split the integrity/metadata handling definitions out into a new header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e41d12f5 20-Sep-2021 Christoph Hellwig <hch@lst.de>

mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>

There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9586e67b 26-Oct-2021 Naohiro Aota <naohiro.aota@wdc.com>

block: schedule queue restart after BLK_STS_ZONE_RESOURCE

When dispatching a zone append write request to a SCSI zoned block device,
if the target zone of the request is already locked, the device driver will
return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
hctx dipatch queue. The queue will be marked as RESTART in
dd_finish_request() and restarted in __blk_mq_free_request(). However, this
restart applies to the hctx of the completed request. If the requeued
request is on a different hctx, dispatch will no be retried until another
request is submitted or the next periodic queue run triggers, leading to up
to 30 seconds latency for the requeued request.

Fix this problem by scheduling a queue restart similarly to the
BLK_STS_RESOURCE case or when we cannot get the budget.

Also, consolidate the checks into the "need_resource" variable to simplify
the condition.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Link: https://lore.kernel.org/r/20211026165127.4151055-1-naohiro.aota@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aec89dc5 29-Sep-2021 Christoph Hellwig <hch@lst.de>

block: keep q_usage_counter in atomic mode after del_gendisk

Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices. After removing the disk only passthrough
commands can be send anyway.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7f2a6a69 07-Sep-2021 Song Liu <songliubraving@fb.com>

blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues

Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts
performance for large md arrays. [1] shows resync speed of md array drops
for md array with more than 16 HDDs.

Fix this by allowing more request at plug queue. The multiple_queue flag
is used to only apply higher limit to multiple queue cases.

[1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/
Tested-by: Marcin Wanat <marcin.wanat@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1e61c1a8 29-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: block: Remove the remaining SG_IO-related fields from struct request_queue

Move the sg_timeout and sg_reserved_size fields into the bsg_device and
scsi_device structures as they have nothing to do with generic block I/O.
Note that these values are now separate for bsg vs. SCSI device node
access, but that just matches how /dev/sg vs the other nodes has always
behaved.

Link: https://lore.kernel.org/r/20210729064845.1044147-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# d152c682 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: add an explicit ->disk backpointer to the request_queue

Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.

Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Reported-by: syzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4a1fa41d 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: pass a request_queue to __blk_alloc_disk

Pass in a request_queue and assign disk->queue in __blk_alloc_disk to
ensure struct gendisk always has a valid ->queue pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a58bd768 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove the minors argument to __alloc_disk_node

This was a leftover from the legacy alloc_disk interface. Switch
the scsi ULPs and dasd to set ->minors directly like all other
drivers and remove the argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd]
Link: https://lore.kernel.org/r/20210816131910.615153-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4dcc4874 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: cleanup the lockdep handling in *alloc_disk

Pass the lockdep name to the low-level __blk_alloc_disk helper and
hardcode the name for it given that the number of minors or node_id
are not very useful information. While this passes a pointless
argument for non-lockdep builds that is not really an issue as
disk allocation is a probe time only slow path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 91cc470e 02-Jun-2021 Tanner Love <tannerlove@google.com>

genirq: Change force_irqthreads to a static key

With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
could incur a cache line miss in invoke_softirq() and other places.

Replace the test with a static key to avoid the potential cache miss.

[ tglx: Dropped the IDE part, removed the export and updated blk-mq ]

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tanner Love <tannerlove@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com


# edb0872f 09-Aug-2021 Christoph Hellwig <hch@lst.de>

block: move the bdi from the request_queue to the gendisk

The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a9ed27a7 17-Aug-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: fix is_flush_rq

is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the
following check:

hctx->fq->flush_rq == req

but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because:

1) memory re-order in blk_mq_rq_ctx_init():

rq->mq_hctx = data->hctx;
...
refcount_set(&rq->ref, 1);

OR

2) tag re-use and ->rqs[] isn't updated with new request.

Fix the issue by re-writing is_flush_rq() as:

return rq->end_io == flush_end_io;

which turns out simpler to follow and immune to data race since we have
ordered WRITE rq->end_io and refcount_set(&rq->ref, 1).

Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Cc: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c797b40c 11-Aug-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: don't grab rq's refcount in blk_mq_check_expired()

Inside blk_mq_queue_tag_busy_iter() we already grabbed request's
refcount before calling ->fn(), so needn't to grab it one more time
in blk_mq_check_expired().

Meantime remove extra request expire check in blk_mq_check_expired().

Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 454bb677 31-Jul-2021 Yu Kuai <yukuai3@huawei.com>

blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED

We run a test that delete and recover devcies frequently(two devices on
the same host), and we found that 'active_queues' is super big after a
period of time.

If device a and device b share a tag set, and a is deleted, then
blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
is only one queue that are using the tag set. However, if b is still
active, the active_queues of b might never be cleared even if b is
deleted.

Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5ec780a6 24-Jun-2021 Christoph Hellwig <hch@lst.de>

block: mark blk_mq_init_queue_data static

All driver uses are gone now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210624081012.256464-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cb9516be 24-Jun-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: update hctx->dispatch_busy in case of real scheduler

Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
starts to support io batching submission by using hctx->dispatch_busy.

However, blk_mq_update_dispatch_busy() isn't changed to update hctx->dispatch_busy
in that commit, so fix the issue by updating hctx->dispatch_busy in case
of real scheduler.

Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 08c1d480 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_init_sq_queue

All users are gone now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b461dfc4 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: add the blk_mq_alloc_disk APIs

Add a new API to allocate a gendisk including the request_queue for use
with blk-mq based drivers. This is to avoid boilerplate code in drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 26a9750a 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: improve the blk_mq_init_allocated_queue interface

Don't return the passed in request_queue but a normal error code, and
drop the elevator_init argument in favor of just calling elevator_init_mq
directly from dm-rq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cdb14e0f 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a blk_mq_alloc_sq_tag_set helper

Factour out a helper to initialize a simple single hw queue tag_set from
blk_mq_init_sq_queue. This will allow to phase out blk_mq_init_sq_queue
in favor of a more symmetric and general API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 61347154 02-Jun-2021 Jan Kara <jack@suse.cz>

block: Do not pull requests from the scheduler when we cannot dispatch them

Provided the device driver does not implement dispatch budget accounting
(which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
requests from the IO scheduler as long as it is willing to give out any.
That defeats scheduling heuristics inside the scheduler by creating
false impression that the device can take more IO when it in fact
cannot.

For example with BFQ IO scheduler on top of virtio-blk device setting
blkio cgroup weight has barely any impact on observed throughput of
async IO because __blk_mq_do_dispatch_sched() always sucks out all the
IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
when that is all dispatched, it will give out IO of lower weight cgroups
as well. And then we have to wait for all this IO to be dispatched to
the disk (which means lot of it actually has to complete) before the
IO scheduler is queried again for dispatching more requests. This
completely destroys any service differentiation.

So grab request tag for a request pulled out of the IO scheduler already
in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
cannot get it because we are unlikely to be able to dispatch it. That
way only single request is going to wait in the dispatch list for some
tag to free.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d97e594c 13-May-2021 John Garry <john.garry@huawei.com>

blk-mq: Use request queue-wide tags for tagset-wide sbitmap

The tags used for an IO scheduler are currently per hctx.

As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.

This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.

Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.

Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).

For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 56b68085 13-May-2021 John Garry <john.garry@huawei.com>

blk-mq: Some tag allocation code refactoring

The tag allocation code to alloc the sbitmap pairs is common for regular
bitmaps tags and shared sbitmap, so refactor into a common function.

Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 364b6181 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: clearing flush request reference in tags->rqs[]

Before we free request queue, clearing flush request reference in
tags->rqs[], so that potential UAF can be avoided.

Based on one patch written by David Jeffery.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bd63141d 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: clear stale request in tags->rq[] before freeing one request pool

refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.

Fix the issue by the following approach:

1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()

2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]

So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.

The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2e315dc0 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter

Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.

Fix request use-after-free(UAF) related with completion race or queue
releasing:

- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.

- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.

However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2f064a59 11-Jun-2021 Peter Zijlstra <peterz@infradead.org>

sched: Change task_struct::state

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org


# d6c23bb3 11-Jun-2021 Peter Zijlstra <peterz@infradead.org>

sched: Add get_current_state()

Remove yet another few p->state accesses.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org


# b03fbd4f 11-Jun-2021 Peter Zijlstra <peterz@infradead.org>

sched: Introduce task_is_running()

Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org


# 630ef623 13-May-2021 Bart Van Assche <bvanassche@acm.org>

blk-mq: Swap two calls in blk_mq_exit_queue()

If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30e4 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 03f26d8f 13-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: plug request for shared sbitmap

In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.

Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.

Fix the issue by recovering original behavior to allow to hold request
in plug list.

Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8d663f34 14-Apr-2021 Lin Feng <linf@wangsu.com>

blk-mq: bypass IO scheduler's limit_depth for passthrough request

Commit 01e99aeca39796003 ("blk-mq: insert passthrough request into
hctx->dispatch directly") gives high priority to passthrough requests and
bypass underlying IO scheduler. But as we allocate tag for such request it
still runs io-scheduler's callback limit_depth, while we really want is to
give full sbitmap-depth capabity to such request for acquiring available
tag.
blktrace shows PC requests(dmraid -s -c -i) hit bfq's limit_depth:
8,0 2 0 0.000000000 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 1 0.000008134 39952 D R 4 [dmraid]
8,0 2 2 0.000021538 24 C R [0]
8,0 2 0 0.000035442 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 3 0.000038813 39952 D R 24 [dmraid]
8,0 2 4 0.000044356 24 C R [0]

This patch introduce a new wrapper to make code not that ugly.

Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210415033920.213963-1-linf@wangsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2a5a24aa 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: blk-mq: Return budget token from .get_budget callback

SCSI uses a global atomic variable to track queue depth for each
LUN/request queue.

This doesn't scale well when there are lots of CPU cores and the disk is
very fast. It has been observed that IOPS is affected a lot by tracking
queue depth via sdev->device_busy in the I/O path.

Return budget token from .get_budget callback. The budget token can be
passed to driver so that we can replace the atomic variable with
sbitmap_queue and alleviate the scaling problems that way.

Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# c548e62b 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Move allocation hint into sbitmap

Allocation hint should have belonged to sbitmap. Also, when sbitmap's depth
is high and there is no need to use mulitple wakeup queues, user can
benefit from percpu allocation hint too.

Move allocation hint into sbitmap, then SCSI device queue can benefit from
allocation hint when converting to plain sbitmap.

Convert vhost/scsi.c to use sbitmap allocation with percpu alloc hint. This
is more efficient than the previous approach.

Link: https://lore.kernel.org/r/20210122023317.687987-5-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: virtualization@lists.linux-foundation.org
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# efe1f3a1 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Maintain allocation round_robin in sbitmap

Currently the allocation round_robin info is maintained by sbitmap_queue.

However, bit allocation really belongs to sbitmap. Move it there.

Link: https://lore.kernel.org/r/20210122023317.687987-3-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: virtualization@lists.linux-foundation.org
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# 4f0f586b 08-Apr-2021 Sami Tolvanen <samitolvanen@google.com>

treewide: Change list_sort to use const pointers

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com


# f9ab4918 23-Jan-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Use llist_head for blk_cpu_done

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU without additional locking.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions.

blk_mq_raise_softirq() needs a preempt-disable section to ensure the
request is enqueued on the same CPU as the softirq is raised.
Some callers (USB-storage) invoke this path in preemptible context.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a2efafb 23-Jan-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Always complete remote completions requests in softirq

Controllers with multiple queues have their IRQ-handelers pinned to a
CPU. The core shouldn't need to complete the request on a remote CPU.

Remove this case and always raise the softirq to complete the request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b6e68ee8 11-Jan-2021 Jan Kara <jack@suse.cz>

blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

clients MQ SQ MQ-Patched
Amean 1 39.12 (0.00%) 43.29 * -10.67%* 36.09 * 7.74%*
Amean 2 128.58 (0.00%) 101.30 * 21.22%* 96.14 * 25.23%*
Amean 4 577.42 (0.00%) 494.47 * 14.37%* 508.49 * 11.94%*
Amean 8 610.95 (0.00%) 363.86 * 40.44%* 362.12 * 40.73%*
Amean 16 391.78 (0.00%) 261.49 * 33.25%* 282.94 * 27.78%*
Amean 32 324.64 (0.00%) 267.71 * 17.54%* 233.00 * 28.23%*
Amean 64 295.04 (0.00%) 253.02 * 14.24%* 242.37 * 17.85%*
Amean 512 10281.61 (0.00%) 10211.16 * 0.69%* 10447.53 * -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5ac83c64 11-Jan-2021 Jan Kara <jack@suse.cz>

Revert "blk-mq, elevator: Count requests per hctx to improve performance"

This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 309dca30 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: store a block_device pointer in struct bio

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 71425189 04-Dec-2020 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Don't complete on a remote CPU in force threaded mode

With force threaded interrupts enabled, raising softirq from an SMP
function call will always result in waking the ksoftirqd thread. This is
not optimal given that the thread runs at SCHED_OTHER priority.

Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
the force threaded mode) is bad because the completion handler may
acquire sleeping locks which violate the locking context.

Disable request completing on a remote CPU in force threaded mode.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e6582cb5 30-Nov-2020 Daniel Wagner <dwagner@suse.de>

blk-mq: Remove 'running from the wrong CPU' warning

It's guaranteed that no request is in flight when a hctx is going
offline. This warning is only triggered when the wq's CPU is hot
plugged and the blk-mq is not synced up yet.

As this state is temporary and the request is still processed
correctly, better remove the warning as this is the fast path.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fa94ba8a 04-Dec-2020 Minwoo Im <minwoo.im.dev@gmail.com>

blk-mq: fix msec comment from micro to milli seconds

Delay to wait for queue running is milli second unit which is passed to
delayed work via msecs_to_jiffies() which is to convert milliseconds to
jiffies.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 91cdf265 04-Dec-2020 Minwoo Im <minwoo.im.dev@gmail.com>

blk-mq: add helper allocating tagset->tags

tagset->set is allocated from blk_mq_alloc_tag_set() rather than being
reallocated. This patch added a helper to make its meaning explicitly
which is to allocate rather than to reallocate.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a4d34da7 08-Dec-2020 Bart Van Assche <bvanassche@acm.org>

scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT

Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.

Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# 0854bcdc 08-Dec-2020 Bart Van Assche <bvanassche@acm.org>

scsi: block: Introduce BLK_MQ_REQ_PM

Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.

Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Can Guo <cang@codeaurora.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# cc29e1bf 26-Nov-2020 Jeffle Xu <jefflexu@linux.alibaba.com>

block: disable iopoll for split bio

iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.

Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.

iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.

iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.

To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f6f371f7 06-Dec-2020 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: skip hybrid polling if iopoll doesn't spin

If blk_poll() is not going to spin (i.e. @spin=false), it also must not
sleep in hybrid polling, otherwise it might be pretty suprising for
users trying to do a quick check and expecting no-wait behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a54895fa 03-Dec-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue to argument request based tracepoints

The request_queue can trivially be derived from the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e8a676d6 03-Dec-2020 Christoph Hellwig <hch@lst.de>

block: simplify and extend the block_bio_merge tracepoint class

The block_bio_merge tracepoint class can be reused for most bio-based
tracepoints. For that it just needs to lose the superfluous q and rq
parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b0d97557 02-Dec-2020 Jeffle Xu <jefflexu@linux.alibaba.com>

block: fix inflight statistics of part0

The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.

Thus the following case is possible:

$ cat /sys/block/vda/inflight
       0        0
$ cat /sys/block/vda/vda1/inflight
       0      128

While single queue device (on a previous version, e.g. v3.10) has no
this issue:

$cat /sys/block/sda/sda3/inflight
0 33
$cat /sys/block/sda/inflight
0 33

Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba65dd ("blk-mq: fix
sysfs inflight counter").

Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)

$cat /proc/diskstats
259 0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
259 2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0

This is introduced since commit f299b7c7a9de ("blk-mq: provide internal
in-flight variant").

Fixes: bf0ddaba65dd ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8446fe92 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: switch partition lookup to use struct block_device

Use struct block_device to lookup partitions on a disk. This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 545b8c8d 15-Jun-2020 Peter Zijlstra <peterz@infradead.org>

smp: Cleanup smp_call_function*()

Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>


# e21ee5a6 30-Sep-2020 Hannes Reinecke <hare@suse.de>

scsi: block: Return status code in blk_mq_end_request()

blk_mq_end_request() will use the block status returned from queue_rq() as
argument, except in one instance in blk_mq_dispatch_rq_list(), where the
generic BLK_STS_IOERR is used.

Link: https://lore.kernel.org/r/20200930080256.90964-2-hare@suse.de
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# 24f7bb88 23-Oct-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

block: blk-mq: fix a kernel-doc markup

Fix a typo:
blk_mq_run_hw_queue -> blk_mq_run_hw_queues

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 576e85c5 19-Oct-2020 Xianting Tian <tian.xianting@h3c.com>

blk-mq: remove the calling of local_memory_node()

We don't need to check whether the node is memoryless numa node before
calling allocator interface. SLUB(and SLAB,SLOB) relies on the page
allocator to pick a node. Page allocator should deal with memoryless
nodes just fine. It has zonelists constructed for each possible nodes.
And it will automatically fall back into a node which is closest to the
requested node. As long as __GFP_THISNODE is not enforced of course.

The code comments of kmem_cache_alloc_node() of SLAB also showed this:
* Fallback to other node is possible if __GFP_THISNODE is not set.

blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling
of local_memory_node().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0841031a 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

blk-mq: use helper function to test hw stopped

We have introduced helper function blk_mq_hctx_stopped() to test
BLK_MQ_S_STOPPED.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 681cc5e8 07-Oct-2020 Mike Snitzer <snitzer@redhat.com>

dm: fix request-based DM to not bounce through indirect dm_submit_bio

It is unnecessary to force request-based DM to call into bio-based
dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then
call blk_mq_submit_bio().

Fix this by establishing a request-based DM block_device_operations
(dm_rq_blk_dops, which doesn't have .submit_bio) and update
dm_setup_md_queue() to set md->disk->fops to it for
DM_TYPE_REQUEST_BASED.

Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport
blk_mq_submit_bio.

Fixes: c62b37d96b6eb ("block: move ->make_request_fn to struct block_device_operations")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# a926c7af 06-Oct-2020 Gabriel Krisman Bertazi <krisman@collabora.com>

block: Consider only dispatched requests for inflight statistic

According to Documentation/block/stat.rst, inflight should not include
I/O requests that are in the queue but not yet dispatched to the device,
but blk-mq identifies as inflight any request that has a tag allocated,
which, for queues without elevator, happens at request allocation time
and before it is queued in the ctx (default case in blk_mq_submit_bio).

In addition, current behavior is different for queues with elevator from
queues without it, since for the former the driver tag is allocated at
dispatch time. A more precise approach would be to only consider
requests with state MQ_RQ_IN_FLIGHT.

This effectively reverts commit 6131837b1de6 ("blk-mq: count allocated
but not started requests in iostats inflight") to consolidate blk-mq
behavior with itself (elevator case) and with original documentation,
but it differs from the behavior used by the legacy path.

This version differs from v1 by using blk_mq_rq_state to access the
state attribute. Avoid using blk_mq_request_started, which was
suggested, since we don't want to include MQ_RQ_COMPLETE.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 93f221ae 15-Sep-2020 Eric Biggers <ebiggers@google.com>

block: make blk_crypto_rq_bio_prep() able to fail

blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c.

This case isn't currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
blk_crypto_rq_bio_prep() able to fail.

Suggested-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 632bfb63 05-Sep-2020 yangerkun <yangerkun@huawei.com>

blk-mq: call commit_rqs while list empty but error happen

Blk-mq should call commit_rqs once 'bd.last != true' and no more
request will come(so virtscsi can kick the virtqueue, e.g.). We already
do that in 'blk_mq_dispatch_rq_list/blk_mq_try_issue_list_directly' while
list not empty and 'queued > 0'. However, we can seen the same scene
once the last request in list call queue_rq and return error like
BLK_STS_IOERR which will not requeue the request, and lead that list
empty but need call commit_rqs too(Or the request for virtscsi will stay
timeout until other request kick virtqueue).

We found this problem by do fsstress test with offline/online virtscsi
device repeat quickly.

Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()")
Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8229cca8 25-Sep-2020 Xianting Tian <tian.xianting@h3c.com>

blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.

Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;

set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);

2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}

Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1

So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 28500850 11-Sep-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: always allow reserved allocation in hctx_may_queue

NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.

Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.

Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b445547e 19-Aug-2020 Kashyap Desai <kashyap.desai@broadcom.com>

blk-mq, elevator: Count requests per hctx to improve performance

High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.

It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.

The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.

Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f1b49fdc 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bccf5e26 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Record nr_active_requests per queue for when using shared sbitmap

The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.

Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 32bc15af 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Facilitate a shared sbitmap per tagset

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 222a5ae0 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Use pointers for blk_mq_tags bitmap tags

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1c0706a7 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Pass flags for tag init/free

Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.

For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 655ac300 19-Aug-2020 Hannes Reinecke <hare@suse.de>

blk-mq: Rename blk_mq_update_tag_set_depth()

The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.

So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

[jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()]

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 51db1c37 19-Aug-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

[jpg: rebase a few times, add rnbd-clt.c change]

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# db03f88f 18-Aug-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: insert request not through ->queue_rq into sw/scheduler queue

c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed
to add request which has been through ->queue_rq() to the hw queue dispatch
list, however it adds request running out of budget or driver tag to hw queue
too. This way basically bypasses request merge, and causes too many request
dispatched to LLD, and system% is unnecessary increased.

Fixes this issue by adding request not through ->queue_rq into sw/scheduler
queue, and this way is safe because no ->queue_rq is called on this request
yet.

High %system can be observed on Azure storvsc device, and even soft lock
is observed. This patch reduces %system during heavy sequential IO,
meantime decreases soft lockup risk.

Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d7d8535f 17-Aug-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART

SCHED_RESTART code path is relied to re-run queue for dispatch requests
in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
requests to hctx->dispatch.

memory barriers have to be used for ordering the following two pair of OPs:

1) adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()

2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().

Without the added memory barrier, either:

1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
dispatch side

or

2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in
hctx->dispatch from blk_mq_sched_restart() is missed.

IO hang in ltp/fs_fill test is reported by kernel test robot:

https://lkml.org/lkml/2020/7/26/77

Turns out it is caused by the above out-of-order OPs. And the IO hang
can't be observed any more after applying this patch.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 26bfeb26 16-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

block: blk-mq.c: fix @at_head kernel-doc warning

Fix a kernel-doc warning in block/blk-mq.c:

../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'

Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: André Almeida <andrealmeid@collabora.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 70f15a4f 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: blk-mq: delete duplicated word

Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 08c875cb 28-Jul-2020 Daniel Wagner <dwagner@suse.de>

block: Use non _rcu version of list functions for tag_set_list

tag_set_list is only accessed under the tag_set_lock lock. There is
no need for using the _rcu list functions.

The _rcu list function were introduced to allow read access to the
tag_set_list protected under RCU, see 705cda97ee3a ("blk-mq: Make it
safe to use RCU to iterate over blk_mq_tag_set.tag_list") and
05b79413946d ("Revert "blk-mq: don't handle TAG_SHARED in restart"").
Those changes got reverted later but the cleanup commit missed a
couple of places to undo the changes.

Fixes: 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 87890092 04-Jul-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-mq: remove redundant validation in __blk_mq_end_request()

We've already validated the 'q->elevator' before calling
->ops.completed_request() in blk_mq_sched_completed_request(), thus no
need to validate rq->internal_tag again. Rmove it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 568f2700 06-Jul-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: centralise related handling into blk_mq_get_driver_tag

Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7bf13729 01-Jul-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: streamline handling of q->mq_ops->queue_rq result

Current handling of q->mq_ops->queue_rq result is a bit ugly:

- two branches which needs to 'continue' have to check if the
dispatch local list is empty, otherwise one bad request may
be retrieved via 'rq = list_first_entry(list, struct request, queuelist);'

- the branch of 'if (unlikely(ret != BLK_STS_OK))' isn't easy
to follow, since it is actually one error branch.

Streamline this handling, so the code becomes more readable, meantime
potential kernel oops can be avoided in case that the last request in
local dispatch list is failed.

Fixes: fc17b6534eb8 ("blk-mq: switch ->queue_rq return value to blk_status_t")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 05a4fed6 07-Jul-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight()

dm-multipath is the only user of blk_mq_queue_inflight(). When
dm-multipath calls blk_mq_queue_inflight() to check if it has
outstanding IO it can get a false negative. The reason for this is
blk_mq_rq_inflight() doesn't consider requests that are no longer
MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
called or finished yet) as "inflight".

This causes request-based dm-multipath's dm_wait_for_completion() to
return before all outstanding dm-multipath requests have actually
completed. This breaks DM multipath's suspend functionality because
blk-mq requests complete after DM's suspend has finished -- which
shouldn't happen.

Fix this by considering any request not in the MQ_RQ_IDLE state
(so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
blk_mq_rq_inflight().

Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e2f62e5 01-Jul-2020 Jens Axboe <axboe@kernel.dk>

Revert "blk-mq: put driver tag when this request is completed"

This reverts commits the following commits:

37f4a24c2469a10a4c16c641671bd766e276cf9f
723bf178f158abd1ce6069cb049581b3cb003aab
36a3df5a4574d5ddf59804fcd0c4e9654c514d9a

The last one is the culprit, but we have to go a bit deeper to get this
to revert cleanly. There's been a report that this breaks some MMC
setups [1], and also causes an issue with swap [2]. Until this can be
figured out, revert the offending commits.

[1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
[2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c62b37d9 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: move ->make_request_fn to struct block_device_operations

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f695ca38 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue argument from blk_queue_split

The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 37f4a24c 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: centralise related handling into blk_mq_get_driver_tag

Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 723bf178 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: move blk_mq_put_driver_tag() into blk-mq.c

It is used by blk-mq.c only, so move it to the source file.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 570e9b73 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: move blk_mq_get_driver_tag into blk-mq.c

blk_mq_get_driver_tag() is only used by blk-mq.c and is supposed to
stay in blk-mq.c, so move it and preparing for cleanup code of
get/put driver tag.

Meantime hctx_may_queue() is moved to header file and it is fine
since it is defined as inline always.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6e6fcbc2 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: support batching dispatch in case of io

More and more drivers want to get batching requests queued from
block layer, such as mmc, and tcp based storage drivers. Also
current in-tree users have virtio-scsi, virtio-blk and nvme.

For none, we already support batching dispatch.

But for io scheduler, every time we just take one request from scheduler
and pass the single request to blk_mq_dispatch_rq_list(). This way makes
batching dispatch not possible when io scheduler is applied. One reason
is that we don't want to hurt sequential IO performance, becasue IO
merge chance is reduced if more requests are dequeued from scheduler
queue.

Try to support batching dispatch for io scheduler by starting with the
following simple approach:

1) still make sure we can get budget before dequeueing request

2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy
we fackback to non-batching dispatch, otherwise dequeue as many as
possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list().

Wrt. 2), we use similar policy for none, and turns out that SCSI SSD
performance got improved much.

In future, maybe we can develop more intelligent algorithem for batching
dispatch.

Baolin has tested this patch and found that MMC performance is improved[3].

[1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r
[2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/
[3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1fd40b5e 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: pass obtained budget count to blk_mq_dispatch_rq_list

Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
for supporting fully batching submission.

With the obtained budget count, it is easier to put extra budgets
in case of .queue_rq failure.

Meantime remove the old 'got_budget' parameter.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bbdb3c5d 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: remove dead check from blk_mq_dispatch_rq_list

When BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE is returned from
.queue_rq, the 'list' variable always holds this rq which isn't
queued to LLD successfully.

So blk_mq_dispatch_rq_list() always returns false from the branch
of '!list_empty(list)'.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 75383524 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: move getting driver tag and budget into one helper

Move code for getting driver tag and budget into one helper, so
blk_mq_dispatch_rq_list gets a bit simplified, and easier to read.

Meantime move updating of 'no_tag' and 'no_budget_available' into
the branch for handling partial dispatch because that is exactly
consumer of the two local variables.

Also rename the parameter of 'got_budget' as 'ask_budget'.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 445874e8 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: pass hctx to blk_mq_dispatch_rq_list

All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
hctx, so it is better to pass hctx instead of request queue, because
blk-mq's dispatch target is hctx instead of request queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 65c76369 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: pass request queue into get/put budget callback

blk-mq budget is abstract from scsi's device queue depth, and it is
always per-request-queue instead of hctx.

It can be quite absurd to get a budget from one hctx, then dequeue a
request from scheduler queue, and this request may not belong to this
hctx, at least for bfq and deadline.

So fix the mess and always pass request queue to get/put budget
callback.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 42fdc5e4 29-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove the BLK_MQ_REQ_INTERNAL flag

Just check for a non-NULL elevator directly to make the code more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 36a3df5a 29-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: put driver tag when this request is completed

It is natural to release driver tag when this request is completed by
LLD or device since its purpose is for LLD use.

One big benefit is that the released tag can be re-used quicker since
bio_endio() may take too long.

Meantime we don't need to release driver tag for flush request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 40d09b53 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: add a new blk_mq_complete_request_remote API

This is a variant of blk_mq_complete_request_remote that only completes
the request if it needs to be bounced to another CPU or a softirq. If
the request can be completed locally the function returns false and lets
the driver complete it without requring and indirect function call.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 96339526 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a blk_mq_complete_need_ipi helper

Add a helper to decide if we can complete locally or need an IPI.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4c8fc196 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove the get_cpu/put_cpu pair in blk_mq_complete_request

We don't really care if we get migrated during the I/O completion.
In the worth case we either perform an IPI that wasn't required, or
complete the request on a CPU which we just migrated off.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 15f73f5b 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: move failure injection out of blk_mq_complete_request

Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d391a7a3 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: merge the softirq vs non-softirq IPI logic

Both the softirq path for single queue devices and the multi-queue
completion handler share the same logic to figure out if we need an
IPI for the completion and eventually issue it. Merge the two
versions into a single unified code path.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d6cc464c 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: short cut the IPI path in blk_mq_force_complete_rq for !SMP

Let the compile optimize out the entire IPI path, given that we are
obviously not going to use it.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6aab1da6 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: complete polled requests directly

Even for single queue devices there is no point in offloading a polled
completion to the softirq, given that blk_mq_force_complete_rq is called
from the polling thread in that case and thus there are no starvation
issues.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dea6f399 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove raise_blk_irq

By open coding raise_blk_irq in the only caller, and replacing the
ifdef CONFIG_SMP with an IS_ENABLED check the flow in the caller
can be significantly simplified.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 115243f5 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a helper to reise the block softirq

Add a helper to deduplicate the logic that raises the block softirq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c3077b5d 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: merge blk-softirq.c into blk-mq.c

__blk_complete_request is only called from the blk-mq code, and
duplicates a lot of code from blk-mq.c. Move it there to prepare
for better code sharing and simplifications.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fe35ec58 17-Jun-2020 Weiping Zhang <zhangweiping@didiglobal.com>

block: update hctx map when use multiple maps

There is an issue when tune the number for read and write queues,
if the total queue count was not changed. The hctx->type cannot
be updated, since __blk_mq_update_nr_hw_queues will return directly
if the total queue count has not been changed.

Reproduce:

dmesg | grep "default/read/poll"
[ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues
cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
48 default

tune the write queues to 24:
echo 24 > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller

dmesg | grep "default/read/poll"
[ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues

cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
48 default

The driver's hardware queue mapping is not same as block layer.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d94ecfc3 05-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: split out a __blk_mq_get_driver_tag helper

Allocation of the driver tag in the case of using a scheduler shares very
little code with the "normal" tag allocation. Split out a new helper to
streamline this path, and untangle it from the complex normal tag
allocation.

This way also avoids to fail driver tag allocation because of inactive hctx
during cpu hotplug, and fixes potential hang risk.

Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bf0beec0 29-May-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: drain I/O when all CPUs in a hctx are offline

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

The driver/subsystem has to quiesce the interrupt line and the associated
queue _before_ it gets shutdown in CPU unplug and not fiddle with it
until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests. This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

[hch: different retry mechanism, merged two patches, minor cleanups]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 600c3b0c 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx

blk_mq_alloc_request_hctx is only used for NVMeoF connect commands, so
tailor it to the specific requirements, and don't bother the general
fast path code with its special twinkles.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 76647368 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: use BLK_MQ_NO_TAG in more places

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 419c3d5e 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG

To prepare for wider use of this constant give it a more applicable name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7ea4d8a4 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: move more request initialization to blk_mq_rq_ctx_init

Don't split request initialization between __blk_mq_alloc_request and
blk_mq_rq_ctx_init. Also remove the op argument as it can be derived
from the blk_mq_alloc_data structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e6e7abff 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: simplify the blk_mq_get_request calling convention

The bio argument is entirely unused, and the request_queue can be passed
through the alloc_data, given that it needs to be filled out for the
low-level tag allocation anyway. Also rename the function to
__blk_mq_alloc_request as the switch between get and alloc in the call
chains is rather confusing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5d9c305b 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove the bio argument to ->prepare_request

None of the I/O schedulers actually needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7b11eab0 29-May-2020 Keith Busch <kbusch@kernel.org>

blk-mq: blk-mq: provide forced completion method

Drivers may need to bypass error injection for error recovery. Rename
__blk_mq_complete_request() to blk_mq_force_complete_rq() and export
that function so drivers may skip potential fake timeouts after they've
reclaimed lost requests.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b5af37ab 26-May-2020 Konstantin Khlebnikov <koct9i@gmail.com>

block: add a blk_account_io_merge_bio helper

Move the non-"new_io" branch of blk_account_io_start() into separate
function. Fix merge accounting for discards (they were counted as write
merges).

The new blk_account_io_merge_bio() doesn't call update_io_ticks() unlike
blk_account_io_start(), as there is no reason for that.

[hch: rebased]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ac7c5675 16-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: allow blk_mq_make_request to consume the q_usage_counter reference

blk_mq_make_request currently needs to grab an q_usage_counter
reference when allocating a request. This is because the block layer
grabs one before calling blk_mq_make_request, but also releases it as
soon as blk_mq_make_request returns. Remove the blk_queue_exit call
after blk_mq_make_request returns, and instead let it consume the
reference. This works perfectly fine for the block layer caller, just
device mapper needs an extra reference as the old problem still
persists there. Open code blk_queue_enter_live in device mapper,
as there should be no other callers and this allows better documenting
why we do a non-try get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 35b371ff 16-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request_hctx

No need for two queue references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 22fa792c 16-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request

No need for two queue references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a5ea58110 16-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: move the call to blk_queue_enter_live out of blk_mq_get_request

Move the blk_queue_enter_live calls into the callers, where they can
successively be cleaned up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a892c8d5 13-May-2020 Satya Tangirala <satyat@google.com>

block: Inline encryption support for blk-mq

We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.

We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.

We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.

Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.

Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.

Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0512a75b 12-May-2020 Keith Busch <kbusch@kernel.org>

block: Introduce REQ_OP_ZONE_APPEND

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs and check these limits in bio_full().

Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g. because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch <kbusch@kernel.org>
[ jth: added zone-append specific add_page and merge_page helpers ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aa880ad6 12-May-2020 Weiping Zhang <zhangweiping@didiglobal.com>

block: reset mapping if failed to update hardware queue count

When we increase hardware queue count, blk_mq_update_queue_map will
reset the mapping between cpu and hardware queue base on the hardware
queue count(set->nr_hw_queues). The mapping cannot be reset if it
encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
continue using it, then blk_mq_map_swqueue will touch a invalid memory,
because the mapping points to a wrong hctx.

blktest block/030:

null_blk: module loaded
Increasing nr_hw_queues to 8 fails, fallback to 1
==================================================================
BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
Read of size 8 at addr 0000000000000128 by task nproc/8541

CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack+0xa5/0xe6
__kasan_report.cold+0x65/0xbb
kasan_report+0x45/0x60
check_memory_region+0x15e/0x1c0
__kasan_check_read+0x15/0x20
blk_mq_map_swqueue+0x2f2/0x830
__blk_mq_update_nr_hw_queues+0x3df/0x690
blk_mq_update_nr_hw_queues+0x32/0x50
nullb_device_submit_queues_store+0xde/0x160 [null_blk]
configfs_write_file+0x1c4/0x250 [configfs]
__vfs_write+0x4c/0x90
vfs_write+0x14b/0x2d0
ksys_write+0xdd/0x180
__x64_sys_write+0x47/0x50
do_syscall_64+0x6f/0x310
entry_SYSCALL_64_after_hwframe+0x49/0xb3

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 79fab528 07-May-2020 Weiping Zhang <zhangweiping@didiglobal.com>

block: rename blk_mq_alloc_rq_maps

rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests,
this function allocs both map and request, make function name align
with funtion.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 03b63b02 07-May-2020 Weiping Zhang <zhangweiping@didiglobal.com>

block: rename __blk_mq_alloc_rq_map

rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request,
actually it alloc both map and request, make function name
align with function.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fd689871 07-May-2020 Ming Lei <ming.lei@redhat.com>

block: alloc map and request for new hardware queue

Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.

Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system

A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.

Reproduce script:

nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1

[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
[ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
[ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
[ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
[ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
[ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
[ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
[ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8040.805647] PKRU: 55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
[ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
[ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
[ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
[ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
[ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
[ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
[ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8229.365476] PKRU: 55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a2584e43 07-May-2020 Weiping Zhang <zhangweiping@didiglobal.com>

block: save previous hardware queue count before udpate

blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so
save old set->nr_hw_queues before call this function.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2e194422 07-May-2020 Weiping Zhang <zhangweiping@didiglobal.com>

block: free both rq_map and request

Allocation:

__blk_mq_alloc_rq_map
blk_mq_alloc_rq_map
blk_mq_alloc_rq_map
tags = blk_mq_init_tags : kzalloc_node:
tags->rqs = kcalloc_node
tags->static_rqs = kcalloc_node
blk_mq_alloc_rqs
p = alloc_pages_node
tags->static_rqs[i] = p + offset;

Free:

blk_mq_free_rq_map
kfree(tags->rqs);
kfree(tags->static_rqs);
blk_mq_free_tags
kfree(tags);

The page allocated in blk_mq_alloc_rqs cannot be released,
so we should use blk_mq_free_map_and_requests here.

blk_mq_free_map_and_requests
blk_mq_free_rqs
__free_pages : cleanup for blk_mq_alloc_rqs
blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8cf7961d 25-Apr-2020 Christoph Hellwig <hch@lst.de>

block: bypass ->make_request_fn for blk-mq drivers

Call blk_mq_make_request when no ->make_request_fn is set. This is
safe now that blk_alloc_queue always sets up the pointer for make_request
based drivers. This avoids an indirect call in the blk-mq driver I/O
fast path, which is rather expensive due to spectre mitigations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bdf8710d 14-Apr-2020 Christoph Hellwig <hch@lst.de>

block: move dma_pad handling from blk_rq_map_sg into the callers

There are only two callers of blk_rq_map_sg/__blk_rq_map_sg that set
the dma_pad value in the queue. Move the handling into those callers
instead of burdening the common code, and move the ->extra_len field
from struct request to struct scsi_cmnd.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cc97923a 14-Apr-2020 Christoph Hellwig <hch@lst.de>

block: move dma drain handling to scsi

Don't burden the common block code with with specifics of the libata DMA
draining mechanism. Instead move most of the code to the scsi midlayer.

That also means the nr_phys_segments adjustments in the blk-mq fast path
can go away entirely, given that SCSI never looks at nr_phys_segments
after mapping the request to a scatterlist.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b9151e7b 20-Apr-2020 Douglas Anderson <dianders@chromium.org>

blk-mq: Add blk_mq_delay_run_hw_queues() API call

We have:
* blk_mq_run_hw_queue()
* blk_mq_delay_run_hw_queue()
* blk_mq_run_hw_queues()

...but not blk_mq_delay_run_hw_queues(), presumably because nobody
needed it before now. Since we need it for a later patch in this
series, add it.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ab3cee37 20-Apr-2020 Douglas Anderson <dianders@chromium.org>

blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick

In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
true and the driver returns BLK_STS_RESOURCE then we'll kick the
queue. However, there's another case where we might need to kick it.
If we were unable to get budget we can be in much the same state as
when the driver returns BLK_STS_RESOURCE, so we should treat it the
same.

It should be noted that even if we add a whole bunch of extra kicking
to the queue in other patches this patch is still important.
Specifically any kicking that happened before we re-spliced leftover
requests into 'hctx->dispatch' wouldn't have found any work, so we
really need to make sure we kick ourselves after we've done the
splicing.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5fe56de7 16-Apr-2020 John Garry <john.garry@huawei.com>

blk-mq: Put driver tag in blk_mq_dispatch_rq_list() when no budget

If in blk_mq_dispatch_rq_list() we find no budget, then we break of the
dispatch loop, but the request may keep the driver tag, evaulated
in 'nxt' in the previous loop iteration.

Fix by putting the driver tag for that request.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 536167d4 06-Apr-2020 Keith Busch <kbusch@kernel.org>

blk-mq: don't commit_rqs() if none were queued

Unburden the drivers from checking if a call to commit_rqs() is valid by
not calling it when there are no requests to commit.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3d745ea5 27-Mar-2020 Christoph Hellwig <hch@lst.de>

block: simplify queue allocation

Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper. Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2f227bb9 27-Mar-2020 Christoph Hellwig <hch@lst.de>

block: add a blk_mq_init_queue_data helper

This allows a driver to pass a queuedata member before ->init_hctx is
called. null_blk currently open codes this logic, but I'd rather have
it in the core to ease future maintainance.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c92a4103 24-Mar-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

block: factor out requeue handling from dispatch code

Factor out the requeue handling from the dispatch code, this will make
subsequent addition of different requeueing schemes easier.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 754a1572 09-Mar-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

block: remove unneeded argument from blk_alloc_flush_queue

Remove 'q' from arguments since it is not used anymore after
commit 7e992f847a08e ("block: remove non mq parts from the
flush code").

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d0930bb8 09-Mar-2020 Bart Van Assche <bvanassche@acm.org>

blk-mq: Fix a recently introduced regression in blk_mq_realloc_hw_ctxs()

q->nr_hw_queues must only be updated once it is known that
blk_mq_realloc_hw_ctxs() has succeeded. Otherwise it can happen that
reallocation fails and that q->nr_hw_queues is larger than the number of
allocated hardware queues. This patch fixes the following crash if
increasing the number of hardware queues fails:

BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x775/0x810
Write of size 8 at addr 0000000000000118 by task check/977

CPU: 3 PID: 977 Comm: check Not tainted 5.6.0-rc1-dbg+ #8
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0xa5/0xe6
__kasan_report.cold+0x65/0x99
kasan_report+0x16/0x20
check_memory_region+0x140/0x1b0
memset+0x28/0x40
blk_mq_map_swqueue+0x775/0x810
blk_mq_update_nr_hw_queues+0x468/0x710
nullb_device_submit_queues_store+0xf7/0x1a0 [null_blk]
configfs_write_file+0x1c4/0x250 [configfs]
__vfs_write+0x4c/0x90
vfs_write+0x145/0x2c0
ksys_write+0xd7/0x180
__x64_sys_write+0x47/0x50
do_syscall_64+0x6f/0x2f0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: ac0d6b926e74 ("block: Reduce the amount of memory required per request queue")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6e66b493 09-Mar-2020 Bart Van Assche <bvanassche@acm.org>

blk-mq: Keep set->nr_hw_queues and set->map[].nr_queues in sync

blk_mq_map_queues() and multiple .map_queues() implementations expect that
set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the number of hardware
queues. Hence set .nr_queues before calling these functions. This patch
fixes the following kernel warning:

WARNING: CPU: 0 PID: 2501 at include/linux/cpumask.h:137
Call Trace:
blk_mq_run_hw_queue+0x19d/0x350 block/blk-mq.c:1508
blk_mq_run_hw_queues+0x112/0x1a0 block/blk-mq.c:1525
blk_mq_requeue_work+0x502/0x780 block/blk-mq.c:775
process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x361/0x430 kernel/kthread.c:255

Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0
Reported-by: syzbot+d44e1b26ce5c3e77458d@syzkaller.appspotmail.com
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cae740a0 26-Feb-2020 John Garry <john.garry@huawei.com>

blk-mq: Remove some unused function arguments

The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(),
blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove
it.

Overall obj code size shows a minor reduction, before:
text data bss dec hex filename
27306 1312 0 28618 6fca block/blk-mq.o
4303 272 0 4575 11df block/blk-mq-tag.o

after:
27282 1312 0 28594 6fb2 block/blk-mq.o
4311 272 0 4583 11e7 block/blk-mq-tag.o

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
--
This minor patch had been carried as part of the blk-mq shared tags RFC,
I'd rather not carry it anymore as it required rebasing, so now or never..
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 01e99aec 24-Feb-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: insert passthrough request into hctx->dispatch directly

For some reason, device may be in one situation which can't handle
FS request, so STS_RESOURCE is always returned and the FS request
will be added to hctx->dispatch. However passthrough request may
be required at that time for fixing the problem. If passthrough
request is added to scheduler queue, there isn't any chance for
blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
Then the FS IO request may never be completed, and IO hang is caused.

So passthrough request has to be added to hctx->dispatch directly
for fixing the IO hang.

Fix this issue by inserting passthrough request into hctx->dispatch
directly together withing adding FS request to the tail of
hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

Then it becomes consistent with original legacy IO request
path, in which passthrough request is always added to q->queue_head.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 105663f7 06-Jan-2020 André Almeida <andrealmeid@collabora.com>

blk-mq: Document functions for sending request

Add or improve documentation for function regarding creating and sending
IO requests to the hardware.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 95ed0c5b 28-Nov-2019 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: optimise blk_mq_flush_plug_list()

Instead of using list_del_init() in a loop, that generates a lot of
unnecessary memory read/writes, iterate from the first request of a
batch and cut out a sublist with list_cut_before().

Apart from removing the list node initialisation part, this is more
register-friendly, and the assembly uses the stack less intensively.

list_empty() at the beginning is done with hope, that the compiler can
optimise out the same check in the following list_splice_init().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7d30a621 28-Nov-2019 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: optimise rq sort function

Check "!=" in multi-layer comparisons. The same memory usage, fewer
instructions, and 2 from 4 jumps are replaced with SETcc.

Note, that list_sort() doesn't differ 0 and <0.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cb711b91 13-Nov-2019 John Garry <john.garry@huawei.com>

blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()

These functions are not referenced, so delete them.

Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 626fb735 29-Oct-2019 John Garry <john.garry@huawei.com>

blk-mq: Make blk_mq_run_hw_queue() return void

Since commit 97889f9ac24f ("blk-mq: remove synchronize_rcu() from
blk_mq_del_queue_tag_set()"), the return value of blk_mq_run_hw_queue()
is never checked, so make it return void, which very marginally simplifies
the code.

Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1fead718 25-Oct-2019 André Almeida <andrealmeid@collabora.com>

blk-mq: remove needless goto from blk_mq_get_driver_tag

The only usage of the label "done" is when (rq->tag != -1) at the
beginning of the function. Rather than jumping to label, we can just
remove this label and execute the code at the "if". Besides that, the
code that would be executed after the label "done" is the return of the
logical expression (rq->tag != -1) but since we are already inside the
if, we now that this is true. Remove the label and replace the goto with
the proper result of the label.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f7e76dbc 25-Oct-2019 Bart Van Assche <bvanassche@acm.org>

block: Reduce the amount of memory used for tag sets

Instead of allocating an array of size nr_cpu_ids for set->tags, allocate
an array of size set->nr_hw_queues. This patch improves behavior that was
introduced by commit 868f2f0b7206 ("blk-mq: dynamic h/w context count").

Reallocating tag sets from inside __blk_mq_update_nr_hw_queues() is safe
because:
- All request queues that share the tag sets are frozen before the tag sets
are reallocated.
- blk_mq_queue_tag_busy_iter() holds q->q_usage_counter while active and
hence is serialized against __blk_mq_update_nr_hw_queues().

Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ac0d6b92 25-Oct-2019 Bart Van Assche <bvanassche@acm.org>

block: Reduce the amount of memory required per request queue

Instead of always allocating at least nr_cpu_ids hardware queues per request
queue, reallocate q->queue_hw_ctx if it has to grow. This patch improves
behavior that was introduced by commit 868f2f0b7206 ("blk-mq: dynamic h/w
context count").

Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a9a80808 25-Oct-2019 Bart Van Assche <bvanassche@acm.org>

block: Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()

Since the blk_mq_{,un}freeze_queue() calls in __blk_mq_update_nr_hw_queues()
already serialize __blk_mq_update_nr_hw_queues() against
blk_mq_queue_tag_busy_iter(), the synchronize_rcu() call in
__blk_mq_update_nr_hw_queues() is not necessary. Hence remove it.

Note: the synchronize_rcu() call in __blk_mq_update_nr_hw_queues() was
introduced by commit f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with
blk_mq_queue_tag_busy_iter"). Commit 530ca2c9bd69 ("blk-mq: Allow blocking
queue tag iter callbacks") removed the rcu_read_{,un}lock() calls that
correspond to the synchronize_rcu() call in __blk_mq_update_nr_hw_queues().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a2e80f6f 30-Sep-2019 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: Embed counters into struct mq_inflight

Store inflight counters immediately in struct mq_inflight.
That's type-safer and removes extra indirection.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bb4e6b14 30-Sep-2019 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: Reuse callback in blk_mq_in_flight*()

Reuse a more generic callback in both blk_mq_in_flight() and
blk_mq_in_flight_rw().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 27a46989 30-Sep-2019 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: Inline status checkers

blk_mq_request_completed() and blk_mq_request_started() are
short, inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 95662565 30-Sep-2019 Bart Van Assche <bvanassche@acm.org>

block: Remove request_queue.nr_queues

Commit 897bb0c7f1ea ("blk-mq: Use proper cpumask iterator"; v4.6)
removed the last use of request_queue.nr_queues from outside
blk_mq_init_allocate_queue(). Remove this member variable to make
struct request_queue smaller. This patch does not change any
functionality.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3154df26 27-Sep-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: apply normal plugging for HDD

Some HDD drive may expose multiple hardware queues, such as MegraRaid.
Let's apply the normal plugging for such devices because sequential IO
may benefit a lot from plug merging.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a12de1d4 27-Sep-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: honor IO scheduler for multiqueue devices

If a device is using multiple queues, the IO scheduler may be bypassed.
This may hurt performance for some slow MQ devices, and it also breaks
zoned devices which depend on mq-deadline for respecting the write order
in one zone.

Don't bypass io scheduler if we have one setup.

This patch can double sequential write performance basically on MQ
scsi_debug when mq-deadline is applied.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Javier GonzĂ¡lez <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8d699663 27-Sep-2019 Yufen Yu <yuyufen@huawei.com>

block: fix null pointer dereference in blk_mq_rq_timed_out()

We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:

[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40

The bug is caused by the race between timeout handle and completion for
flush request.

When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.

After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.

However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.

We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>

-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 54d4e6ab 16-Sep-2019 Max Gurtovoy <maxg@mellanox.com>

block: centralize PI remapping logic to the block layer

Currently t10_pi_prepare/t10_pi_complete functions are called during the
NVMe and SCSi layers command preparetion/completion, but their actual
place should be the block layer since T10-PI is a general data integrity
feature that is used by block storage protocols. Introduce .prepare_fn
and .complete_fn callbacks within the integrity profile that each type
can implement according to its needs.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>

Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
isn't defined in the config.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9a91b05b 21-May-2019 Hou Tao <houtao1@huawei.com>

block: also check RQF_STATS in blk_mq_need_time_stamp()

In __blk_mq_end_request() if block stats needs update, we should
ensure now is valid instead of 0 even when iostat is disabled.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3d244306 21-May-2019 Hou Tao <houtao1@huawei.com>

block: make rq sector size accessible for block stats

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 737eb78e 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Delay default elevator initialization

When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
case, a call to elevator_init_mq() is added to __device_add_disk(),
resulting in the delayed initialization of the queue elevator
after the device driver finished probing the device information. This
effectively allows elevator_init_mq() access to more information
about the device.
2) elevator_init = true preserves the current behavior of initializing
the elevator directly from blk_mq_init_allocated_queue(). This case
is used for the special request based DM devices where the device
gendisk is created before the queue initialization and device
information (e.g. queue limits) is already known when the queue
initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 954b4a5c 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Change elevator_init_mq() to always succeed

If the default elevator chosen is mq-deadline, elevator_init_mq() may
return an error if mq-deadline initialization fails, leading to
blk_mq_init_allocated_queue() returning an error, which in turn will
cause the block device initialization to fail and the device not being
exposed.

Instead of taking such extreme measure, handle mq-deadline
initialization failures in the same manner as when mq-deadline is not
available (no module to load), that is, default to the "none" scheduler.
With this change, elevator_init_mq() return type can be changed to void.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 61db437d 05-Sep-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Cleanup elevator_init_mq() use

Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
calling elevator_init_mq() to make sure that the queue supports IO
scheduling, use the elevator.c function elv_support_iosched() in
elevator_init_mq(). This does not introduce any functional change but
ensure that elevator_init_mq() does the right thing based on the queue
settings.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6f816b4b 28-Aug-2019 Tejun Heo <tj@kernel.org>

blk-mq: add optional request->alloc_time_ns

There are currently two start time timestamps - start_time_ns and
io_start_time_ns. The former marks the request allocation and and the
second issue-to-device time. The planned io.weight controller needs
to measure the total time bios take to execute after it leaves rq_qos
including the time spent waiting for request to become available,
which can easily dominate on saturated devices.

This patch adds request->alloc_time_ns which records when the request
allocation attempt started. As it isn't used for the usual stats,
make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
no users and it's active only on queues which need it even when
compiled in.

v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
gating as suggested by Jens.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6ba9333 27-Aug-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue

blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
isn't exposed to userspace yet. For the latter caller, hctx sysfs entries
and debugfs are un-registered before updating nr_hw_queues.

On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
request queue is freed") moves freeing hctx into queue's release
handler, so there won't be race with queue release path too.

So don't hold q->sysfs_lock in blk_mq_map_swqueue().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7b6620d7 15-Aug-2019 Jens Axboe <axboe@kernel.dk>

block: remove REQ_NOWAIT_INLINE

We had a few issues with this code, and there's still a problem around
how we deal with error handling for chained/split bios. For now, just
revert the code and we'll try again with a thoroug solution. This
reverts commits:

e15c2ffa1091 ("block: fix O_DIRECT error handling for bio fragments")
0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
893a1c97205a ("blk-mq: allow REQ_NOWAIT to return an error inline")

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 73d9c8d4 23-Jul-2019 zhengbin <zhengbin13@huawei.com>

blk-mq: Fix memory leak in blk_mq_init_allocated_queue error handling

If blk_mq_init_allocated_queue->elevator_init_mq fails, need to release
the previously requested resources.

Fixes: d34849913819 ("blk-mq-sched: allow setting of default IO scheduler")
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e26cc082 12-Aug-2019 zhengbin <zhengbin13@huawei.com>

blk-mq: move cancel of requeue_work to the front of blk_exit_queue

blk_exit_queue will free elevator_data, while blk_mq_requeue_work
will access it. Move cancel of requeue_work to the front of
blk_exit_queue to avoid use-after-free.

blk_exit_queue blk_mq_requeue_work
__elevator_exit blk_mq_run_hw_queues
blk_mq_exit_sched blk_mq_run_hw_queue
dd_exit_queue blk_mq_hctx_has_pending
kfree(elevator_data) blk_mq_sched_has_work
dd_has_work

Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a87ccce0 23-Jul-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: remove blk_mq_complete_request_sync

blk_mq_tagset_wait_completed_request() has been applied for waiting
for completed request's fn, so not necessary to use
blk_mq_complete_request_sync() any more.

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aa306ab7 23-Jul-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce blk_mq_request_completed()

NVMe needs this function to decide if one request to be aborted has
been completed in normal IO path already.

So introduce it.

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9dd8813e 30-Jul-2019 Thomas Gleixner <tglx@linutronix.de>

hrtimer/treewide: Use hrtimer_sleeper_start_expires()

hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
that possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# dbc1625f 26-Jul-2019 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls

hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
object which is embedded into the hrtimer_sleeper.

Combine the initialization and spare a function call. Fixup all call sites.

This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
specific initializations of the embedded hrtimer without modifying any of
the call sites.

No functional change.

[ anna-maria: Minor cleanups ]
[ tglx: Adopted to the removal of the task argument of
hrtimer_init_sleeper() and trivial polishing.
Folded a fix from Stephen Rothwell for the vsoc code ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de


# b7449487 26-Jul-2019 Thomas Gleixner <tglx@linutronix.de>

hrtimer: Remove task argument from hrtimer_init_sleeper()

All callers hand in 'current' and that's the only task pointer which
actually makes sense. Remove the task argument and set current in the
function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.791885290@linutronix.de


# 327fe1d4 22-Jul-2019 Marcos Paulo de Souza <marcos.souza.org@gmail.com>

block: blk-mq: Remove blk_mq_sched_started_request and started_request

blk_mq_sched_completed_request is a function that checks if the elevator
related to the request has started_request implemented, but currently, none of
the available IO schedulers implement started_request, so remove both.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 893a1c97 16-Jul-2019 Jens Axboe <axboe@kernel.dk>

blk-mq: allow REQ_NOWAIT to return an error inline

By default, if a caller sets REQ_NOWAIT and we need to block, we'll
return -EAGAIN through the bio->bi_end_io() callback. For some use
cases, this makes it hard to use.

Allow a caller to ask for inline return of errors related to
blocking by also setting REQ_NOWAIT_INLINE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b49773e7 10-Jul-2019 Damien Le Moal <damien.lemoal@wdc.com>

block: Disable write plugging for zoned block devices

Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.

Context A Context B

| blk_start_plug()
| ...
| seq_write_zone()
| mutex_lock(zone)
| bio-0->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-0)
| submit_bio(bio-0)
| bio-1->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-1)
| submit_bio(bio-1)
| mutex_unlock(zone)
| return
| -----------------------> | seq_write_zone()
| mutex_lock(zone)
| bio-2->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-2)
| submit_bio(bio-2)
| mutex_unlock(zone)
| <------------------------- |
| blk_finish_plug()

In the above example, despite the mutex synchronization ensuring the
correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
issued after BIO 2 of context B, when the plug is released with
blk_finish_plug().

While this problem can be addressed using the blk_flush_plug_list()
function (in the above example, the call must be inserted before the
zone mutex lock is released), a simple generic solution in the block
layer avoid this additional code in all zoned block device user code.
The simple generic solution implemented with this patch is to introduce
the internal helper function blk_mq_plug() to access the current
context plug on BIO submission. This helper returns the current plug
only if the target device is not a zoned block device or if the BIO to
be plugged is not a write operation. Otherwise, the caller context plug
is ignored and NULL returned, resulting is all writes to zoned block
device to never be plugged.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 970d168d 01-Jul-2019 Bart Van Assche <bvanassche@acm.org>

blk-mq: simplify blk_mq_make_request()

Move the blk_mq_bio_to_request() call in front of the if-statement.

Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c05f4220 01-Jul-2019 Bart Van Assche <bvanassche@acm.org>

blk-mq: remove blk_mq_put_ctx()

No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
on preemption being disabled for its correctness. Since removing the CPU
preemption calls does not measurably affect performance, simplify the
blk-mq code by removing the blk_mq_put_ctx() function and also by not
disabling preemption in blk_mq_get_ctx().

Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 14ccb66b 05-Jun-2019 Christoph Hellwig <hch@lst.de>

block: remove the bi_phys_segments field in struct bio

We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f924cdde 05-Jun-2019 Christoph Hellwig <hch@lst.de>

block: remove blk_init_request_from_bio

lightnvm should have never used this function, as it is sending
passthrough requests, so switch it to blk_rq_append_bio like all the
other passthrough request users. Inline blk_init_request_from_bio into
the only remaining caller.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Javier GonzĂ¡lez <javier@javigon.com>
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 41de54c6 19-Apr-2019 Jes Sorensen <jsorensen@fb.com>

blk-mq: Fix memory leak in error handling

If blk_mq_init_allocated_queue() fails, make sure to free the poll
stat callback struct allocated.

Signed-off-by: Jes Sorensen <jsorensen@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7996a8b5 20-May-2019 Bob Liu <bob.liu@oracle.com>

blk-mq: fix hang caused by freeze/unfreeze sequence

The following is a description of a hang in blk_mq_freeze_queue_wait().
The hang happens on attempt to freeze a queue while another task does
queue unfreeze.

The root cause is an incorrect sequence of percpu_ref_resurrect() and
percpu_ref_kill() and as a result those two can be swapped:

CPU#0 CPU#1
---------------- -----------------
q1 = blk_mq_init_queue(shared_tags)

q2 = blk_mq_init_queue(shared_tags):
blk_mq_add_queue_tag_set(shared_tags):
blk_mq_update_tag_set_depth(shared_tags):
list_for_each_entry()
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
> blk_mq_freeze_queue_wait()

blk_cleanup_queue(q1)
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
^^^^^^ freeze_depth can't guarantee the order

blk_mq_unfreeze_queue()
> percpu_ref_resurrect()

> blk_mq_freeze_queue_wait()
^^^^^^ Hang here!!!!

This wrong sequence raises kernel warning:
percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
which waits for a zero of a q_usage_counter, which never happens
because percpu-ref was reinited (instead of being killed) and stays in
PERCPU state forever.

How to reproduce:
- "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
- cpu0: python Script.py 0; taskset the corresponding process running on cpu0
- cpu1: python Script.py 1; taskset the corresponding process running on cpu1

Script.py:
------
#!/usr/bin/python3

import os
import sys

while True:
on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
os.system(on)
os.system(off)
------

This bug was first reported and fixed by Roman, previous discussion:
[1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
[2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
[3] https://patchwork.kernel.org/patch/9268199/

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2f8f1336 29-Apr-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: always free hctx after request queue is freed

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7c6c5b7c 29-Apr-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: split blk_mq_alloc_and_init_hctx into two parts

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c7e2d94b 29-Apr-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: free hw queue's resource in hctx's release handler

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fbc2a15e 29-Apr-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: move cancel of requeue_work into blk_mq_release

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 273938bf 02-May-2019 Raul E Rangel <rrangel@chromium.org>

block: fix function name in comment

The comment was out of date.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Raul E Rangel <rrangel@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3dcf60bc 30-Apr-2019 Christoph Hellwig <hch@lst.de>

block: add SPDX tags to block layer files missing licensing information

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 77f1e0a5 18-Jan-2019 Jens Axboe <axboe@kernel.dk>

bfq: update internal depth state when queue depth changes

A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.

Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.

Tested-by: Kai Krakow <kai@kaishome.de>
Reported-by: Paolo Valente <paolo.valente@linaro.org>
Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1b8f21b7 08-Apr-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce blk_mq_complete_request_sync()

In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
mark the request as abort
blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fd9c40f6 04-Apr-2019 Bart Van Assche <bvanassche@acm.org>

block: Revert v5.0 blk_mq_request_issue_directly() changes

blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
have been queued. If that happens when blk_mq_try_issue_directly() is called
by the dm-mpath driver then dm-mpath will try to resubmit a request that is
already queued and a kernel crash follows. Since it is nontrivial to fix
blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
changes that went into kernel v5.0.

This patch reverts the following commits:
* d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
* 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
* 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: James Smart <james.smart@broadcom.com>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org>
Reported-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bcc816df 03-Apr-2019 Dongli Zhang <dongli.zhang@oracle.com>

blk-mq: do not reset plug->rq_count before the list is sorted

We would never be able to sort the list if we first reset plug->rq_count
which is used in conditional check later.

Fixes: ce5b009cff19 ("block: improve logic around when to sort a plug list")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ff3b74b8 26-Mar-2019 Yufen Yu <yuyufen@huawei.com>

blk-mq: add trace block plug and unplug for multiple queues

For now, we just trace plug for single queue device or drivers
provide .commit_rqs, and have not trace plug for multiple queues
device. But, unplug events will be recorded when call
blk_mq_flush_plug_list(). Then, trace events will be asymmetrical,
just have unplug and without plug.

This patch add trace plug and unplug for multiple queues device in
blk_mq_make_request(). After that, we can accurately trace plug and
unplug for multiple queues.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b9a1ff50 01-Apr-2019 Shenghui Wang <shhuiw@foxmail.com>

block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx

kfree() can leak the hctx->fq->flush_rq field.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e8618575 25-Mar-2019 Jens Axboe <axboe@kernel.dk>

blk-mq: fix sbitmap ws_active for shared tags

We now wrap sbitmap waitqueues in an active counter, so we can avoid
iterating wakeups unless we have waiters there. This works as long as
everyone that's manipulating the waitqueues use the proper helpers. For
the tag wait case for shared tags, however, we add ourselves to the
waitqueue without incrementing/decrementing the ->ws_active count. This
means that wakeups can take a long time to happen.

Fix this by manually doing the inc/dec as needed for the wait queue
handling.

Reported-by: Michael Leun <kbug@newton.leun.net>
Tested-by: Michael Leun <kbug@newton.leun.net>
Cc: stable@vger.kernel.org
Reviewed-by: Omar Sandoval <osandov@fb.com>
Fixes: 5d2ee7122c73 ("sbitmap: optimize wakeup check")
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 85fae294 24-Mar-2019 Yufen Yu <yuyufen@huawei.com>

blk-mq: update comment for blk_mq_hctx_has_pending()

For now, blk_mq_hctx_has_pending() checks any of ctx, hctx->dispatch
or io scheduler have pending work. So, update the comment accordingly.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e6c98712 20-Mar-2019 Bart Van Assche <bvanassche@acm.org>

block: Unexport blk_mq_add_to_requeue_list()

This function is not used outside the block layer core. Hence unexport it.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 29ece8b4 18-Mar-2019 Yufen Yu <yuyufen@huawei.com>

block: add BLK_MQ_POLL_CLASSIC for hybrid poll and return EINVAL for unexpected value

For q->poll_nsec == -1, means doing classic poll, not hybrid poll.
We introduce a new flag BLK_MQ_POLL_CLASSIC to replace -1, which
may make code much easier to read.

Additionally, since val is an int obtained with kstrtoint(), val can be
a negative value other than -1, so return -EINVAL for that case.

Thanks to Damien Le Moal for some good suggestion.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 684b7324 14-Mar-2019 Yufen Yu <yuyufen@huawei.com>

blk-mq: use blk_mq_sched_mark_restart_hctx to set RESTART

Let blk_mq_mark_tag_wait() use the blk_mq_sched_mark_restart_hctx()
to set BLK_MQ_S_SCHED_RESTART.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7d76f856 27-Feb-2019 Dongli Zhang <dongli.zhang@oracle.com>

blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map

Replace set->map[0] with set->map[HCTX_TYPE_DEFAULT] to avoid hardcoding.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2705c937 15-Feb-2019 Ming Lei <ming.lei@redhat.com>

block: kill QUEUE_FLAG_NO_SG_MERGE

Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aef1897c 11-Feb-2019 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue

When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),

kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]

(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e5fa8140 08-Feb-2019 Aleksei Zakharov <zakharov.a.g@yandex.ru>

block: avoid setting nr_requests to current value

There's no reason to freeze queue and set nr_requests value
if current value is the same.

Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8b3238ca 06-Dec-2018 Christoph Hellwig <hch@lst.de>

scsi: block: remove bidi support

Unused now, and another field in struct request bites the dust.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# 69ed175c 09-Nov-2018 Christoph Hellwig <hch@lst.de>

scsi: block: remove req->special

No users left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# bb94aea1 24-Jan-2019 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: save default hctx into ctx->hctxs for not-supported type

Currently, we check whether the hctx type is supported every time
in hot path. Actually, this is not necessary, we could save the
default hctx into ctx->hctxs if the type is not supported when
map swqueues and use it directly with ctx->hctxs[type].

We also needn't check whether the poll is enabled or not, because
the caller would clear the REQ_HIPRI in that case.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ccdf4a3 24-Jan-2019 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: save queue mapping result into ctx directly

Currently, the queue mapping result is saved in a two-dimensional
array. In the hot path, to get a hctx, we need do following:

q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

This isn't very efficient. We could save the queue mapping result into
ctx directly with different hctx type, like,

ctx->hctxs[type]

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7809167d 16-Jan-2019 Ming Lei <ming.lei@redhat.com>

block: don't lose track of REQ_INTEGRITY flag

We need to pass bio->bi_opf after bio intergrity preparing, otherwise
the flag of REQ_INTEGRITY may not be set on the allocated request, then
breaks block integrity.

Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping")
Cc: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7b7ab780 14-Dec-2018 Sagi Grimberg <sagi@grimberg.me>

block: make request_to_qc_t public

block consumers will need it for polling requests that
are sent with blk_execute_rq_nowait. Also, get rid of
blk_tag_to_qc_t and open-code it instead.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# cd19181b 17-Dec-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: enable IO poll if .nr_queues of type poll > 0

The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues
is bigger than zero, so enhance the constraint by checking .nr_queues of type poll
before enabling IO poll.

Otherwise IO race & timeout can be observed when running block/007.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3c94d83c 17-Dec-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()

There's a single user of this function, dm, and dm just wants
to check if IO is inflight, not that it's just allocated.

This fixes a hang with srp/002 in blktests with dm, where it tries
to suspend but waits for inflight IO to finish first. As it checks
for just allocated requests, this fails.

Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e5edd5f2 17-Dec-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: skip zero-queue maps in blk_mq_map_swqueue

From 7e849dd9cf37 ("nvme-pci: don't share queue maps"), the mapping
table won't be initialized actually if map->nr_queues is zero, so
we can't use blk_mq_map_queue_type() to retrieve hctx any more.

This way still may cause broken mapping, fix it by skipping zero-queues
maps in blk_mq_map_swqueue().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c16d6b5a 17-Dec-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: fix dispatch from sw queue

When a request is added to rq list of sw queue(ctx), the rq may be from
a different type of hctx, especially after multi queue mapping is
introduced.

So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
hctx can be dispatched to current hctx in case that read queue or poll
queue is enabled.

This patch fixes this issue by introducing per-queue-type list.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Changed by me to not use separately cacheline aligned lists, just
place them all in the same cacheline where we had just the one list
and lock before.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 07b35eb5 17-Dec-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: fix allocation for queue mapping table

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d6a51a97 13-Dec-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: replace and kill blk_mq_request_issue_directly

Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly
in blk_insert_cloned_request and kill it as nobody uses it any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5b7a6f12 13-Dec-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests

It is not necessary to issue request directly with bypass 'true'
in blk_mq_sched_insert_requests and handle the non-issued requests
itself. Just set bypass to 'false' and let blk_mq_try_issue_directly
handle them totally. Remove the blk_rq_can_direct_dispatch check,
because blk_mq_try_issue_directly can handle it well.If request is
direct-issued unsuccessfully, insert the reset.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7f556a44 13-Dec-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: refactor the code of issue request directly

Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly
into one interface to unify the interfaces to issue requests
directly. The merged interface takes over the requests totally,
it could insert, end or do nothing based on the return value of
.queue_rq and 'bypass' parameter. Then caller needn't any other
handling any more and then code could be cleaned up.

And also the commit c616cbee ( blk-mq: punt failed direct issue
to dispatch list ) always inserts requests to hctx dispatch list
whenever get a BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, this is
overkill and will harm the merging. We just need to do that for
the requests that has been through .queue_rq. This patch also
could fix this.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e016b782 06-Dec-2018 Mikulas Patocka <mpatocka@redhat.com>

block: return just one value from part_in_flight

The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 59388702 06-Dec-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: re-build queue map in case of kdump kernel

Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[ 4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[ 4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[ 4.444681] PGD 0 P4D 0
[ 4.445367] Oops: 0000 [#1] SMP NOPTI
[ 4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[ 4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[ 4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[ 4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[ 4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[ 4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
[ 4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
[ 4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
[ 4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
[ 4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
[ 4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
[ 4.469220] FS: 0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
[ 4.471554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
[ 4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4.477061] PKRU: 55555554
[ 4.477464] Call Trace:
[ 4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad
[ 4.479595] blk_mq_init_queue+0x32/0x4e
[ 4.480178] nvme_validate_ns+0x98/0x623 [nvme_core]
[ 4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[ 4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[ 4.482601] nvme_scan_work+0x23a/0x29b [nvme_core]
[ 4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38
[ 4.483930] ? try_to_wake_up+0x38d/0x3b3
[ 4.484478] ? process_one_work+0x179/0x2fc
[ 4.485118] process_one_work+0x1d3/0x2fc
[ 4.485655] ? rescuer_thread+0x2ae/0x2ae
[ 4.486196] worker_thread+0x1e9/0x2be
[ 4.486841] kthread+0x115/0x11d
[ 4.487294] ? kthread_park+0x76/0x76
[ 4.487784] ret_from_fork+0x3a/0x50
[ 4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
[ 4.489428] Dumping ftrace buffer:
[ 4.489939] (ftrace buffer empty)
[ 4.490492] CR2: 0000000000000098
[ 4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Cc: David Milburn <dmilburn@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c616cbee 06-Dec-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: punt failed direct issue to dispatch list

After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.

Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.

This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ffe81d45 04-Dec-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: fix corruption with direct issue

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
https://bugzilla.kernel.org/show_bug.cgi?id=201685

Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6544d229 02-Dec-2018 Christoph Hellwig <hch@lst.de>

block: enable polling by default if a poll map is initalized

If the user did setup polling in the driver we should not require
another know in the block layer to enable it.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 529262d5 02-Dec-2018 Christoph Hellwig <hch@lst.de>

block: remove ->poll_fn

This was intended to support users like nvme multipath, but is just
getting in the way and adding another indirect call.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fe1f4526 28-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: don't call ktime_get_ns() if we don't need it

We only need the request fields and the end_io time if we have
stats enabled, or if we have a scheduler attached as those may
use it for completion time stats.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b2c5d16b 29-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: use plug for devices that implement ->commits_rqs()

If we have that hook, we know the driver handles bd->last == true in
a smart fashion. If it does, even for multiple hardware queues, it's
a good idea to flush batches of requests to the device, if we have
batches of requests from the submitter.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# be94f058 24-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: use bd->last == true for list inserts

If we are issuing a list of requests, we know if we're at the last one.
If we fail issuing, ensure that we call ->commits_rqs() to flush any
potential previous requests.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d666ba98 27-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: add mq_ops->commit_rqs()

blk-mq passes information to the hardware about any given request being
the last that we will issue in this sequence. The point is that hardware
can defer costly doorbell type writes to the last request. But if we run
into errors issuing a sequence of requests, we may never send the request
with bd->last == true set. For that case, we need a hook that tells the
hardware that nothing else is coming right now.

For failures returned by the drivers ->queue_rq() hook, the driver is
responsible for flushing pending requests, if it uses bd->last to
optimize that part. This works like before, no changes there.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ce5b009c 27-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: improve logic around when to sort a plug list

Only do it if we have requests for multiple queues in the same
plug.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e6db0f2 29-Nov-2018 Dan Carpenter <dan.carpenter@oracle.com>

blk-mq: Add a NULL check in blk_mq_free_map_and_requests()

I recently found some code which called blk_mq_free_map_and_requests()
with a NULL set->tags pointer. I fixed the caller, but it seems like a
good idea to add a NULL check here as well. Now we can call:

blk_mq_free_tag_set(set);
blk_mq_free_tag_set(set);

twice in a row and it's harmless.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4711b573 27-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: fix failure to decrement plug count on single rq removal

If we yank a 'same_queue_rq' request off the plug list, we should
also decrement the cached request count.

Fixes: 5f0ed774ed29 ("block: sum requests in the plug structure")
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5f0ed774 23-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: sum requests in the plug structure

This isn't exactly the same as the previous count, as it includes
requests for all devices. But that really doesn't matter, if we have
more than the threshold (16) queued up, flush it. It's not worth it
to have an expensive list loop for this.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# af78ff7c 26-Nov-2018 Keith Busch <kbusch@kernel.org>

blk-mq: Simplify request completion state

There are no more users relying on blk-mq request states to prevent
double completions, so replace the relatively expensive cmpxchg operation
with WRITE_ONCE.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 16c15eb1 26-Nov-2018 Keith Busch <kbusch@kernel.org>

blk-mq: Return true if request was completed

A driver may have internal state to cleanup if we're pretending a request
didn't complete. Return 'false' if the command wasn't actually completed
due to the timeout error injection, and true otherwise.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4ab32bf3 18-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: never redirect polled IO completions

It's pointless to do so, we are by definition on the CPU we want/need
to be, as that's the one waiting for a completion event.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aa61bec3 13-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: ensure mq_ops ->poll() is entered at least once

Right now we immediately bail if need_resched() is true, but
we need to do at least one loop in case we have entries waiting.
So just invert the need_resched() check, putting it at the
bottom of the loop.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a1b8b87 26-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: make blk_poll() take a parameter on whether to spin or not

blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.

Existing callers are converted to pass in 'spin == true', to retain
the old behavior.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9743139c 16-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: remove 'tag' parameter from mq_ops->poll()

We always pass in -1 now and none of the callers use the tag value,
remove the parameter.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1052b8ac 26-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: when polling for IO, look for any completion

If we want to support async IO polling, then we have to allow finding
completions that aren't just for the one we are looking for. Always pass
in -1 to the mq_ops->poll() helper, and have that return how many events
were found in this poll loop.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1db4909e 19-Nov-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e2b3fa5a 19-Nov-2018 Damien Le Moal <damien.lemoal@wdc.com>

block: Remove bio->bi_ioc

bio->bi_ioc is never set so always NULL. Remove references to it in
bio_disassociate_task() and in rq_ioc() and delete this field from
struct bio. With this change, rq_ioc() always returns
current->io_context without the need for a bio argument. Further
simplify the code and make it more readable by also removing this
helper, which also allows to simplify blk_mq_sched_assign_ioc() by
removing its bio argument.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 85f4d4b6 06-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: have ->poll_fn() return number of entries polled

We currently only really support sync poll, ie poll with 1 IO in flight.
This prepares us for supporting async poll.

Note that the returned value isn't necessarily 100% accurate. If poll
races with IRQ completion, we assume that the fact that the task is now
runnable means we found at least one entry. In reality it could be more
than 1, or not even 1. This is fine, the caller will just need to take
this into account.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 849a3700 16-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: avoid ordered task state change for polled IO

For the core poll helper, the task state setting don't need to imply any
atomics, as it's the current task itself that is being modified and
we're not going to sleep.

For IRQ driven, the wakeup path have the necessary barriers to not need
us using the heavy handed version of the task state setting.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 344e9ffc 15-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: add queue_is_mq() helper

Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.

Where the ->mq_ops != NULL check is redundant, remove it.

Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6d469642 14-Nov-2018 Christoph Hellwig <hch@lst.de>

block: remove the lock argument to blk_alloc_queue_node

With the legacy request path gone there is no real need to override the
queue_lock.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d5337560 14-Nov-2018 Christoph Hellwig <hch@lst.de>

block: remove the unused lock argument to rq_qos_throttle

Unused now that the legacy request path is gone.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 57d74df9 14-Nov-2018 Christoph Hellwig <hch@lst.de>

block: use atomic bitops for ->queue_flags

->queue_flags is generally not set or cleared in the fast path, and also
generally set or cleared one flag at a time. Make use of the normal
atomic bitops for it so that we don't need to take the queue_lock,
which is otherwise mostly unused in the core block layer now.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 079076b3 14-Nov-2018 Christoph Hellwig <hch@lst.de>

block: remove deadline __deadline manipulation helpers

No users left since the removal of the legacy request interface, we can
remove all the magic bit stealing now and make it a normal field.

But use WRITE_ONCE/READ_ONCE on the new deadline field, given that we
don't seem to have any mechanism to guarantee a new value actually
gets seen by other threads.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9d037ad7 09-Nov-2018 Christoph Hellwig <hch@lst.de>

block: remove req->timeout_list

Unused now that the legacy request path is gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ae879912 08-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: provide a helper to check if a queue is busy

Returns true if the queue currently has requests pending,
false if not.

DM can use this to replace the atomic_inc/dec they do per device
to see if a device is busy.

Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7baa8572 08-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq-tag: change busy_iter_fn to return whether to continue or not

We have this functionality in sbitmap, but we don't export it in
blk-mq for users of the tags busy iteration. This can be useful
for stopping the iteration, if the caller doesn't need to find
more requests.

Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3110fc79 30-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: improve plug list sorting

Currently we only look at the software queue, but with support
for multiple maps, we should also look at the hardware queue.
This is important since we'll flush out the request list if
either the software queue or hardware queue don't match.

This sorts by software queue first, then hardware queue if
that differs. Finally we sort by request location like before.
This minimizes the flush points per plug list.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 67cae4c9 30-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: cleanup and improve list insertion

It's somewhat strange to have a list insertion function that
relies on the fact that the caller has mapped things correctly.
Pass in the hardware queue directly for insertion, which makes
for a much cleaner interface and implementation.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ea4f995e 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: cache request hardware queue mapping

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 392546ae 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: separate number of hardware queues from nr_cpu_ids

With multiple maps, nr_cpu_ids is no longer the maximum number of
hardware queues we support on a given devices. The initializer of
the tag_set can have set ->nr_hw_queues larger than the available
number of CPUs, since we can exceed that with multiple queue maps.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b3c661b1 30-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: support multiple hctx maps

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f31967f0 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: allow software queue to map to multiple hardware queues

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f9afca4d 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: pass in request/bio flags to queue mapping

Prep patch for being able to place request based not just on
CPU location, but also on the type of request.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ed76e329 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: abstract out queue map

This is in preparation for allowing multiple sets of maps per
queue, if so desired.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a8908939 16-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: kill q->mq_map

It's just a pointer to set->mq_map, use that instead. Move the
assignment a bit earlier, so we always know it's valid.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9cf2bab6 31-Oct-2018 Jens Axboe <axboe@kernel.dk>

block: kill request ->cpu member

This was used for completion placement for the legacy path,
but for mq we have rq->mq_ctx->cpu for that. Add a helper
to get the request CPU assignment, as the mq_ctx type is
private to blk-mq.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c7bb9ad1 31-Oct-2018 Jens Axboe <axboe@kernel.dk>

block: get rid of q->softirq_done_fn()

With the legacy path gone, all we do is funnel it through the
mq_ops->complete() operation.

Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7d692330 24-Oct-2018 Jens Axboe <axboe@kernel.dk>

block: get rid of blk_queued_rq()

No point in hiding what this does, just open code it in the
one spot where we are still using it.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# db6d9952 02-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: remove request_list code

It's now dead code, nobody uses it.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f9cd4bfe 01-Nov-2018 Jens Axboe <axboe@kernel.dk>

block: get rid of MQ scheduler ops union

This is a remnant of when we had ops for both SQ and MQ
schedulers. Now it's just MQ, so get rid of the union.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ba7b4434 11-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: remove legacy check in queue blk_freeze_queue()

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7ac257b8 11-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: remove the request_list usage

We don't do anything with it, that's just the legacy path.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d6f1dda2 23-Oct-2018 Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

blk-mq: place trace_block_getrq() in correct place

trace_block_getrq() is to indicate a request struct has been allocated
for queue, so put it in right place.

Reviewed-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9316a9ed 15-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: provide helper for setting up an SQ queue and tag set

This pattern is repeated throughout all the blk-mq conversions.
Provide a basic helper to get it done.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e01ad46d 12-Oct-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: fallback to previous nr_hw_queues when updating fails

When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.

To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.

Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 34d11ffa 12-Oct-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: realloc hctx when hw queue is mapped to another node

When the hw queues and mq_map are updated, a hctx could be mapped
to a different numa node. At this moment, we need to realloc the
hctx. If fail to do that, go on using previous hctx.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5b202853 12-Oct-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: change gfp flags to GFP_NOIO in blk_mq_realloc_hw_ctxs

blk_mq_realloc_hw_ctxs could be invoked during update hw queues.
At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL
to GFP_NOIO to avoid forever hang during memory allocation in
blk_mq_realloc_hw_ctxs.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 477e19de 12-Oct-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues

blk-mq debugfs and sysfs entries need to be removed before updating
queue map, otherwise, we get get wrong result there. This patch fixes
it and remove the redundant debugfs and sysfs register/unregister
operations during __blk_mq_update_nr_hw_queues.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 36e76539 28-Sep-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: complete req in softirq context in case of single queue

Lot of controllers may have only one irq vector for completing IO
request. And usually affinity of the only irq vector is all possible
CPUs, however, on most of ARCH, there may be only one specific CPU
for handling this interrupt.

So if all IOs are completed in hardirq context, it is inevitable to
degrade IO performance because of increased irq latency.

This patch tries to address this issue by allowing to complete request
in softirq context, like the legacy IO path.

IOPS is observed as ~13%+ in the following randread test on raid0 over
virtio-scsi.

mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Zach Marano <zmarano@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ed88660a 27-Sep-2018 Omar Sandoval <osandov@fb.com>

block: move call of scheduler's ->completed_request() hook

Commit 4bc6339a583c ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 587562d0 26-Sep-2018 Ilya Dryomov <idryomov@gmail.com>

blk-mq: I/O and timer unplugs are inverted in blktrace

trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs. schedule() unplugs are implicit and should be
reported as timer unplugs. While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 986d413b 26-Sep-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Enable support for runtime power management

Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bdd63160 26-Sep-2018 Bart Van Assche <bvanassche@acm.org>

block: Allow unfreezing of a queue while requests are in progress

A later patch will call blk_freeze_queue_start() followed by
blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop
to zero. Make sure that this doesn't cause a kernel warning to appear
by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The
former namely requires that the refcount it operates on is zero.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f5bbbbe4 21-Aug-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter

For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d48ece20 21-Aug-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: init hctx sched after update ctx and hctx mapping

Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d263ed99 09-Aug-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: count the hctx as active before allocating tag

Currently, we count the hctx as active after allocate driver tag
successfully. If a previously inactive hctx try to get tag first
time, it may fails and need to wait. However, due to the stale tag
->active_queues, the other shared-tags users are still able to
occupy all driver tags while there is someone waiting for tag.
Consequently, even if the previously inactive hctx is waked up, it
still may not be able to get a tag and could be starved.

To fix it, we count the hctx as active before try to allocate driver
tag, then when it is waiting the tag, the other shared-tag users
will reserve budget for it.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0fc09f92 23-Jul-2018 Keith Busch <kbusch@kernel.org>

blk-mq: export setting request completion state

This is preparing for drivers that want to directly alter the state of
their requests. No functional change here.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8824f622 22-Jul-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: fail the request in case issue failure

Inside blk_mq_try_issue_list_directly(), if the request is issued as
failed, we shouldn't try to do it again, otherwise the warning in
blk_mq_start_request() will be triggered. This change is aligned to
behaviour of other ways of request issue & dispatch.

Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: LKP <lkp@01.org>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6ce3dd6e 09-Jul-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: issue directly if hw queue isn't busy in case of 'none'

In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.

There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.

This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.

Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c1c80384 03-Jul-2018 Josef Bacik <jbacik@fb.com>

block: remove external dependency on wbt_flags

We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a7905043 03-Jul-2018 Josef Bacik <jbacik@fb.com>

blk-rq-qos: refactor out common elements of blk-wbt

blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6e768717 03-Jul-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: dequeue request one by one from sw queue if hctx is busy

It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.

Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3f0cedc7 02-Jul-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: use list_splice_tail_init() to insert requests

list_splice_tail_init() is much more faster than inserting each
request one by one, given all requets in 'list' belong to
same sw queue and ctx->lock is required to insert requests.

Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c018c84f 30-Jun-2018 Minwoo Im <minwoo.im.dev@gmail.com>

blk-mq: fix typo in a function comment

Fix typo in a function blk_mq_alloc_tag_set() comment.
if if it too large -> if it's too large.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0da73d00 02-Jul-2018 Minwoo Im <minwoo.im.dev@gmail.com>

blk-mq: code clean-up by adding an API to clear set->mq_map

set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c. It's also cleared before
updating a queue map in blk_mq_update_queue_map().

This patch provides an API to clear set->mq_map to make it clear.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 97889f9a 25-Jun-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()

We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().

This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:

1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.

2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.

In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.

Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5815839b 25-Jun-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce new lock for protecting hctx->dispatch_wait

Now hctx->lock is only acquired when adding hctx->dispatch_wait to
one wait queue, but not held when removing it from the wait queue.

IO hang can be observed easily if SCHED RESTART is disabled, that means
now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().

This patch fixes the issue by introducing hctx->dispatch_wait_lock and
holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
since we need to avoid acquiring hctx->lock in irq context.

Fixes: eb619fdb2d4cb8b3d3419 ("blk-mq: fix issue with shared tag queue re-running")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2278d69f 25-Jun-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: don't pass **hctx to blk_mq_mark_tag_wait()

'hctx' won't be changed at all, so not necessary to pass
'**hctx' to blk_mq_mark_tag_wait().

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ab6bb9e 25-Jun-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: cleanup blk_mq_get_driver_tag()

We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
we never change '**hctx' as well. The last use of these went away
with the flush cleanup, commit 0c2a6fe4dc3e.

So cleanup the usage and remove the two extra parameters.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1f57f8d4 28-Jun-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: don't queue more if we get a busy return

Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.

Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f5e350f0 22-Jun-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE

Make sure that RQF_TIMED_OUT is cleared when a request is reused
after a block driver timeout handler has returned BLK_EH_DONE.

Fixes: da6612673988 ("blk-mq: don't time out requests again that are in the timeout handler")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Andrew Randrianasulu <randrianasulu@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# da661267 14-Jun-2018 Christoph Hellwig <hch@lst.de>

blk-mq: don't time out requests again that are in the timeout handler

We can currently call the timeout handler again on a request that has
already been handed over to the timeout handler. Prevent that with a new
flag.

Fixes: 12f5b931 ("blk-mq: Remove generation seqeunce")
Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Tested-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 590b5b7d 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kzalloc_node() -> kcalloc_node()

The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:

kzalloc_node(a * b, gfp, node)

with:
kcalloc_node(a * b, gfp, node)

as well as handling cases of:

kzalloc_node(a * b * c, gfp, node)

with:

kzalloc_node(array3_size(a, b, c), gfp, node)

as it's slightly less ugly than:

kcalloc_node(array_size(a, b), c, gfp, node)

This does, however, attempt to ignore constant size factors like:

kzalloc_node(4 * 1024, gfp, node)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# a347c7ad 10-Jun-2018 Roman Pen <roman.penyaev@profitbricks.com>

blk-mq: reinit q->tag_set_list entry only after grace period

It is not allowed to reinit q->tag_set_list list entry while RCU grace
period has not completed yet, otherwise the following soft lockup in
blk_mq_sched_restart() happens:

[ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
[ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
[ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
[ 1064.256510] Call Trace:
[ 1064.256664] <IRQ>
[ 1064.256824] blk_mq_free_request+0xea/0x100
[ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client]
[ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client]
[ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
[ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
[ 1064.257669] ib_create_qp+0x321/0x380 [ib_core]
[ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core]
[ 1064.258007] irq_poll_softirq+0xb7/0xe0
[ 1064.258165] __do_softirq+0x106/0x2a2
[ 1064.258328] irq_exit+0x92/0xa0
[ 1064.258509] do_IRQ+0x4a/0xd0
[ 1064.258660] common_interrupt+0x7a/0x7a
[ 1064.258818] </IRQ>

Meanwhile another context frees other queue but with the same set of
shared tags:

[ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
[ 1288.201833] bash D 0 5910 5820 0x00000000
[ 1288.202016] Call Trace:
[ 1288.202315] schedule+0x32/0x80
[ 1288.202462] schedule_timeout+0x1e5/0x380
[ 1288.203838] wait_for_completion+0xb0/0x120
[ 1288.204137] __wait_rcu_gp+0x125/0x160
[ 1288.204287] synchronize_sched+0x6e/0x80
[ 1288.204770] blk_mq_free_queue+0x74/0xe0
[ 1288.204922] blk_cleanup_queue+0xc7/0x110
[ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
[ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
[ 1288.205548] kernfs_fop_write+0x109/0x180
[ 1288.206328] vfs_write+0xb3/0x1a0
[ 1288.206476] SyS_write+0x52/0xc0
[ 1288.206624] do_syscall_64+0x68/0x1d0
[ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

What happened is the following:

1. There are several MQ queues with shared tags.
2. One queue is about to be freed and now task is in
blk_mq_del_queue_tag_set().
3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
tag list in order to find hctx to restart.

Because linked list entry was modified in blk_mq_del_queue_tag_set()
without proper waiting for a grace period, blk_mq_sched_restart()
never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.

Fix is simple: reinit list entry after an RCU grace period elapsed.

Fixes: Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: stable@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0196d6b4 04-Jun-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: return when hctx is stopped in blk_mq_run_work_fn

If a hardware queue is stopped, it should not be run again before
explicitly started. Ignore stopped queues in blk_mq_run_work_fn(),
fixing a regression recently introduced when the START_ON_RUN bit
was removed.

Fixes: 15fe8a90bb45 ("blk-mq: remove blk_mq_delay_queue()")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 131d08e1 31-May-2018 Christoph Hellwig <hch@lst.de>

block: split the blk-mq case from elevator_init

There is almost no shared logic, which leads to a very confusing code
flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# acddf3b3 31-May-2018 Christoph Hellwig <hch@lst.de>

block: move sysfs_lock into elevator_init

Both callers take just around so function call, so move it in.
Also remove the now pointless blk_mq_sched_init wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d1210d5a 29-May-2018 Christoph Hellwig <hch@lst.de>

blk-mq: simplify blk_mq_rq_timed_out

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f6e7d48a 29-May-2018 Christoph Hellwig <hch@lst.de>

block: remove BLK_EH_HANDLED

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6600593c 29-May-2018 Christoph Hellwig <hch@lst.de>

block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE

The BLK_EH_NOT_HANDLED implies nothing happen, but very often that
is not what is happening - instead the driver already completed the
command. Fix the symbolic name to reflect that a little better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 12f5b931 29-May-2018 Keith Busch <kbusch@kernel.org>

blk-mq: Remove generation seqeunce

This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it. The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
for completions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ad103e79 29-May-2018 Keith Busch <kbusch@kernel.org>

blk-mq: Fix timeout and state order

The block layer had been setting the state to in-flight prior to updating
the timer. This is the wrong order since the timeout handler could observe
the in-flight state with the older timeout, believing the request had
expired when in fact it is just getting started.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b4f6f38d 22-May-2018 huhai <huhai@kylinos.cn>

blk-mq: remove wrong 'unlikely' check

When dispatch_rq_from_ctx is called, in the vast majority of cases
the ctx->rq_list is not empty.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d416c92c 18-May-2018 huhai <huhai@kylinos.cn>

blk-mq: clear hctx->dispatch_from when mappings change

When the number of hardware queues is changed, the drivers will call
blk_mq_update_nr_hw_queues() to remap hardware queues. This changes
the ctx mappings, but the current code doesn't clear the
->dispatch_from hint. This can result in dispatch_from pointing to
a ctx that isn't mapped to the hctx anymore.

Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")
Signed-off-by: huhai <huhai@kylinos.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>

Moved the placement of the clearing to where we clear other items
pertaining to the existing mapping, added Fixes line, and reworded
the commit message.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8fa9f556 16-May-2018 huhai <huhai@kylinos.cn>

blk-mq: remove redundant insert case in blk_mq_make_request()

We can use blk_mq_sched_insert_request() even if we don't have
an IO scheduler attached, since that case will end up being
exactly the same as what blk_mq_queue_io() was doing now.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 17a51199 09-May-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: don't call into depth limiting for reserved tags

It's not useful, they are internal and/or error handling recovery
commands.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 522a7775 09-May-2018 Omar Sandoval <osandov@fb.com>

block: consolidate struct request timestamp fields

Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4bc6339a 09-May-2018 Omar Sandoval <osandov@fb.com>

block: move blk_stat_add() to __blk_mq_end_request()

We want this next to blk_account_io_done() for the next change so that
we can call ktime_get() only once for both.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 544ccc8d 09-May-2018 Omar Sandoval <osandov@fb.com>

block: get rid of struct blk_issue_stat

struct blk_issue_stat squashes three things into one u64:

- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling

It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a8a45941 09-May-2018 Omar Sandoval <osandov@fb.com>

block: pass struct request instead of struct blk_issue_stat to wbt

issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bf0ddaba 26-Apr-2018 Omar Sandoval <osandov@fb.com>

blk-mq: fix sysfs inflight counter

When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6131837b 26-Apr-2018 Omar Sandoval <osandov@fb.com>

blk-mq: count allocated but not started requests in iostats inflight

In the legacy block case, we increment the counter right after we
allocate the request, not when the driver handles it. In both the legacy
and blk-mq cases, part_inc_in_flight() is called from
blk_account_io_start() right after we've allocated the request. blk-mq
only considers requests started requests as inflight, but this is
inconsistent with the legacy definition and the intention in the code.
This removes the started condition and instead counts all allocated
requests.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4412efec 24-Apr-2018 Ming Lei <ming.lei@redhat.com>

Revert "blk-mq: remove code for dealing with remapping queue"

This reverts commit 37c7c6c76d431dd7ef9c29d95f6052bd425f004c.

Turns out some drivers(most are FC drivers) may not use managed
IRQ affinity, and has their customized .map_queues meantime, so
still keep this code for avoiding regression.

Reported-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f4560231 16-Apr-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: start request gstate with gen 1

rq->gstate and rq->aborted_gstate both are zero before rqs are
allocated. If we have a small timeout, when the timer fires,
there could be rqs that are never allocated, and also there could
be rq that has been allocated but not initialized and started. At
the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
the blk_mq_terminate_expired will identify the rq is timed out and
invoke .timeout early.

For scsi, this will cause scsi_times_out to be invoked before the
scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
the moment, then we will get crash.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Martin Steigerwald <Martin@Lichtvoll.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 37c7c6c7 08-Apr-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: remove code for dealing with remapping queue

Firstly, from commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue any more after CPU topo is changed.

Secondly, set->nr_hw_queues can't be bigger than nr_cpu_ids, and now we map
all possible CPUs to hw queues, so at least one CPU is mapped to each hctx.

So queue mapping has became static and fixed just like percpu variable, and
we don't need to handle queue remapping any more.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# efea8450 08-Apr-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()

There are several reasons for removing the check:

1) blk_mq_hw_queue_mapped() returns true always now since each hctx
may be mapped by one CPU at least

2) when there isn't any online CPU mapped to this hctx, there won't
be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue
if there is IO queued to this hctx

3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(),
which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and
the hctx to be handled has to be mapped.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 15fe8a90 08-Apr-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: remove blk_mq_delay_queue()

No driver uses this interface any more, so remove it.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f82ddf19 08-Apr-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu

This patch introduces helper of blk_mq_hw_queue_first_cpu() for
figuring out the hctx's first cpu, and code duplication can be
avoided.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 476f8c98 08-Apr-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: avoid to write intermediate result to hctx->next_cpu

This patch figures out the final selected CPU, then writes
it to hctx->next_cpu once, then we can avoid to intermediate
next cpu observed from other dispatch paths.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a1c735fb 08-Apr-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: make sure that correct hctx->next_cpu is set

From commit 20e4d81393196 (blk-mq: simplify queue mapping & schedule
with each possisble CPU), one hctx can be mapped from all offline CPUs,
then hctx->next_cpu can be set as wrong.

This patch fixes this issue by making hctx->next_cpu pointing to the
first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Fixes: 20e4d81393196 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0bca799b 04-Apr-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: order getting budget and driver tag

This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:

1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.

2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.

3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B

4) both two IO pathes can't move on, and IO stall is caused.

This issue can be observed when running dbench on USB storage.

This patch fixes this issue by always getting budget before getting
driver tag.

Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7dfdbc73 07-Mar-2018 Bart Van Assche <bvanassche@acm.org>

block: Protect queue flag changes with the queue lock

Since the queue flags may be changed concurrently from multiple
contexts after a queue becomes visible in sysfs, make these changes
safe by protecting these with the queue lock.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8814ce8a 07-Mar-2018 Bart Van Assche <bvanassche@acm.org>

block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}()

Introduce functions that modify the queue flags and that protect
these modifications with the request queue lock. Except for moving
one wake_up_all() call from inside to outside a critical section,
this patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f78bac2c 07-Mar-2018 Bart Van Assche <bvanassche@acm.org>

block: Use the queue_flag_*() functions instead of open-coding these

Except for changing the atomic queue flag manipulations that are
protected by the queue lock into non-atomic manipulations, this
patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5ee0524b 28-Feb-2018 Bart Van Assche <bvanassche@acm.org>

block: Add 'lock' as third argument to blk_alloc_queue_node()

This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e9a99a63 27-Feb-2018 Omar Sandoval <osandov@fb.com>

block: clear ctx pending bit under ctx lock

When we insert a request, we set the software queue pending bit while
holding the software queue lock. However, we clear it outside of the
lock, so it's possible that a concurrent insert could reset the bit
after we clear it but before we empty the request list. Afterwards, the
bit would still be set but the software queue wouldn't have any requests
in it, leading us to do a spurious run in the future. This is mostly a
benign/theoretical issue, but it makes the following change easier to
justify.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 105976f5 23-Feb-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch

__blk_mq_requeue_request() covers two cases:

- one is that the requeued request is added to hctx->dispatch, such as
blk_mq_dispatch_rq_list()

- another case is that the request is requeued to io scheduler, such as
blk_mq_requeue_request().

We should call io sched's .requeue_request callback only for the 2nd
case.

Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Omar Sandoval <osandov@fb.com>
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 67b4110f 13-Feb-2018 Nitesh Shetty <nj.shetty@samsung.com>

blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task
state as TASK_RUNNING, if need_resched() returns true,
while polling for IO completion.
Earlier, polling task used to sleep, relying on interrupt to wake it up.
This made some IO take very long when interrupt-coalescing is enabled in
NVMe.

Reference:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html

Changes since v2->v3:
-using __set_current_state() instead of set_current_state()

Changes since v1->v2:
-setting task state once in blk_poll, instead of multiple
callers.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 86ff7c2a 30-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce BLK_STS_DEV_RESOURCE

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ae943d20 19-Jan-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays

Make sure that calling blk_mq_run_hw_queue() or
blk_mq_kick_requeue_list() triggers a queue run without delay even
if blk_mq_delay_run_hw_queue() has been called recently and if its
delay has not yet expired.

Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c77ff7fd 19-Jan-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()

Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
However, the function name blk_mq_request_direct_issue is an exception.
Hence rename this function. This patch does not change any functionality.

Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 23d4ee19 17-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

If we run into blk_mq_request_direct_issue(), when queue is busy, we
don't want to dispatch this request into hctx->dispatch_list, and
what we need to do is to return the queue busy info to caller, so
that caller can deal with it well.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Reported-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9e97d295 17-Jan-2018 Mike Snitzer <snitzer@redhat.com>

blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request

After commit:

923218f6166a ("blk-mq: don't allocate driver tag upfront for flush rq")

we no longer use the 'can_block' argument in
blk_mq_sched_insert_request(). Kill it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Added actual commit message as to why it's being removed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 396eaf21 17-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO. But dm-rq currently can't get the underlying queue's
dispatch feedback at all. Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging). This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request. Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0f95549c 17-Jan-2018 Mike Snitzer <snitzer@redhat.com>

blk-mq: factor out a few helpers from __blk_mq_try_issue_directly

No functional change. Just makes code flow more logically.

In following commit, __blk_mq_try_issue_directly() will be used to
return the dispatch result (blk_status_t) to DM. DM needs this
information to improve IO merging.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7df938fb 17-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk

We know this WARN_ON is harmless and in reality it may be trigged,
so convert it to printk() and dump_stack() to avoid to confusing
people.

Also add comment about two releated races here.

Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7bed4595 17-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: make sure hctx->next_cpu is set correctly

When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.

The race can be triggered in the following two sitations:

1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.

2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.

This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.

Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bf9ae8c5 14-Jan-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init()

A previous commit moved the clearing of rq->rq_flags later,
but we may have already set RQF_MQ_INFLIGHT when that happens.
Ensure that we correctly initialize rq->rq_flags to the
right value.

This is based on an original fix by Ming, just rewritten to not
require a conditional.

Fixes: 7c3fb70f0341 ("block: rearrange a few request fields for better cache layout")
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 20e4d813 11-Jan-2018 Christoph Hellwig <hch@lst.de>

blk-mq: simplify queue mapping & schedule with each possisble CPU

The previous patch assigns interrupt vectors to all possible CPUs, so
now hctx can be mapped to possible CPUs, this patch applies this fact
to simplify queue mapping & schedule so that we don't need to handle
CPU hotplug for dealing with physical CPU plug & unplug. With this
simplication, we can work well on physical CPU plug & unplug, which
is a normal use case for VM at least.

Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
set hctx->numa_node for possible CPUs which are mapped to this hctx. And
only choose the online CPUs for schedule.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
(merged the three into one because any single one may not work, and fix
selecting online CPUs for scheduler)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c27d53fb 10-Jan-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

This patch does not change any functionality but makes the
blk_mq_mark_tag_wait() code slightly easier to read.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b7435db8 10-Jan-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()

This patch avoids that sparse reports the following:

block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - unexpected unlock
block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count at exit

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7c3fb70f 10-Jan-2018 Jens Axboe <axboe@kernel.dk>

block: rearrange a few request fields for better cache layout

Move completion related items (like the call single data) near the
end of the struct, instead of mixing them in with the initial
queueing related fields.

Move queuelist below the bio structures. Then we have all
queueing related bits in the first cache line.

This yields a 1.5-2% increase in IOPS for a null_blk test, both for
sync and for high thread count access. Sync test goes form 975K to
992K, 32-thread case from 20.8M to 21.2M IOPS.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e14575b3 10-Jan-2018 Jens Axboe <axboe@kernel.dk>

block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

We only have one atomic flag left. Instead of using an entire
unsigned long for that, steal the bottom bit of the deadline
field that we already reserved.

Remove ->atomic_flags, since it's now unused.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a72e7f4 09-Jan-2018 Jens Axboe <axboe@kernel.dk>

block: add accessors for setting/querying request deadline

We reduce the resolution of request expiry, but since we're already
using jiffies for this where resolution depends on the kernel
configuration and since the timeout resolution is coarse anyway,
that should be fine.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 76a86f9d 10-Jan-2018 Jens Axboe <axboe@kernel.dk>

block: remove REQ_ATOM_POLL_SLEPT

We don't need this to be an atomic flag, it can be a regular
flag. We either end up on the same CPU for the polling, in which
case the state is sane, or we did the sleep which would imply
the needed barrier to ensure we see the right state.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fcd36c36 10-Jan-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Explain when 'active_queues' is decremented

It is nontrivial to derive from the blk-mq source code when
blk_mq_tags.active_queues is decremented. Hence add a comment that
explains this.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ee3e4de5 09-Jan-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Fix spelling in a source code comment

Change "nedeing" into "needing" and "caes" into "cases".

Fixes: commit f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 08b5a6e2 09-Jan-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: silence false positive warnings in hctx_unlock()

In some stupider versions of gcc, it complains:

block/blk-mq.c: In function ‘blk_mq_complete_request’:
./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized]
__srcu_read_unlock(sp, idx);
^
block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here
int srcu_idx;
^

which is completely bogus, since we only use srcu_idx when
hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where
hctx_lock() has initialized it.

Just set it to '0' in the normal path in hctx_lock() to silence
this annoying warning.

Fixes: 04ced159cec8 ("blk-mq: move hctx lock/unlock into a helper")
Fixes: 5197c05e16b4 ("blk-mq: protect completion path with RCU")
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 05707b64 09-Jan-2018 Tejun Heo <tj@kernel.org>

blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu

The RCU protection has been expanded to cover both queueing and
completion paths making ->queue_rq_srcu a misnomer. Rename it to
->srcu as suggested by Bart.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a61c363 09-Jan-2018 Tejun Heo <tj@kernel.org>

blk-mq: remove REQ_ATOM_STARTED

After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users
left and is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 634f9e46 09-Jan-2018 Tejun Heo <tj@kernel.org>

blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq

After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.

Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times. This removes atomic bitops from hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 358f70da 09-Jan-2018 Tejun Heo <tj@kernel.org>

blk-mq: make blk_abort_request() trigger timeout path

With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path. This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Note that this makes blk_abort_request() asynchronous - it initiates
abortion but the actual termination will happen after a short while,
even when the caller owns the request. AFAICS, SCSI and ATA should be
fine with that and I think mtip32xx and dasd should be safe but not
completely sure. It'd be great if people who know the drivers take a
look.

v2: - Add comment explaining the lack of synchronization around
->deadline update as requested by Bart.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi SP <asamymuthupa@micron.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 67818d25 09-Jan-2018 Tejun Heo <tj@kernel.org>

blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE

blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
REQ_ATOM_COMPLETE to determine the request state. Both uses are
speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
equivalent results. Replace the tests. This will allow removing
REQ_ATOM_COMPLETE usages from blk-mq.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1d9bd516 09-Jan-2018 Tejun Heo <tj@kernel.org>

blk-mq: replace timeout synchronization with a RCU and generation based scheme

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.

Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.

2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.

By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5197c05e 09-Jan-2018 Tejun Heo <tj@kernel.org>

blk-mq: protect completion path with RCU

Currently, blk-mq protects only the issue path with RCU. This patch
puts the completion path under the same RCU protection. This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 04ced159 09-Jan-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: move hctx lock/unlock into a helper

Move the RCU vs SRCU logic into lock/unlock helpers, which makes
the actual functional bits within the locked region much easier
to read.

tj: Reordered in front of timeout revamp patches and added the missing
blk_mq_run_hw_queue() conversion.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ab0b7dc 09-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: fix kernel oops in blk_mq_tag_idle()

HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
then we need to check it before calling blk_mq_tag_idle(), otherwise
the following kernel oops can be triggered, so fix it by checking if
the hw queue is unmapped since it doesn't make sense to idle the tags
any more after hw queues are unmapped.

[ 440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
[ 440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
[ 440.788359] RIP: 0010:[<ffffffffb730e2b4>] [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
[ 440.798697] RSP: 0018:ffff893bf9bcbd10 EFLAGS: 00010286
[ 440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
[ 440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
[ 440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
[ 440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
[ 440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
[ 440.849988] FS: 0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
[ 440.859955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
[ 440.876169] Call Trace:
[ 440.879818] [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0
[ 440.887051] [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160
[ 440.894465] [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150
[ 440.901881] [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core]
[ 440.910068] [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
[ 440.919026] [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
[ 440.928079] [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
[ 440.937126] [<ffffffffb70ab58a>] process_one_work+0x17a/0x440
[ 440.944517] [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0
[ 440.951607] [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 440.959760] [<ffffffffb70b352f>] kthread+0xcf/0xe0
[ 440.966055] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
[ 440.973715] [<ffffffffb76d8658>] ret_from_fork+0x58/0x90
[ 440.980586] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
[ 440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
[ 441.011620] RIP [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
[ 441.019301] RSP <ffff893bf9bcbd10>
[ 441.024052] CR2: 0000000000000008

Reported-by: Zhang Yi <yizhan@redhat.com>
Tested-by: Zhang Yi <yizhan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fb350e0a 06-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: fix race between updating nr_hw_queues and switching io sched

In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

This patch fixes the race be holding q->sysfs_lock.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7d4901a9 06-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: avoid to map CPU into stale hw queue

blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[ 101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[ 101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[ 101.216346] PGD 0 P4D 0
[ 101.219171] Oops: 0000 [#1] SMP
[ 101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[ 101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[ 101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[ 101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[ 101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[ 101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[ 101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[ 101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[ 101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[ 101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[ 101.364314] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[ 101.373342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[ 101.387714] Call Trace:
[ 101.390445] blk_mq_update_nr_hw_queues+0xbf/0x130
[ 101.395791] nvme_reset_work+0x6f4/0xc06 [nvme]
[ 101.400848] ? pick_next_task_fair+0x290/0x5f0
[ 101.405807] ? __switch_to+0x1f5/0x430
[ 101.409988] ? put_prev_entity+0x2f/0xd0
[ 101.414365] process_one_work+0x141/0x340
[ 101.418836] worker_thread+0x47/0x3e0
[ 101.422921] kthread+0xf5/0x130
[ 101.426424] ? rescuer_thread+0x380/0x380
[ 101.430896] ? kthread_associate_blkcg+0x90/0x90
[ 101.436048] ret_from_fork+0x1f/0x30
[ 101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[ 101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[ 101.468205] CR2: 0000000000000098
[ 101.471907] ---[ end trace 5fe710f98228a3ca ]---
[ 101.482489] Kernel panic - not syncing: Fatal exception
[ 101.488505] Kernel Offset: disabled
[ 101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 24f5a90f 06-Jan-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: quiesce queue during switching io sched and updating nr_requests

Dispatch may still be in-progress after queue is frozen, so we have to
quiesce queue before switching IO scheduler and updating nr_requests.

Also when switching io schedulers, blk_mq_run_hw_queue() may still be
called somewhere(such as from nvme_reset_work()), and io scheduler's
per-hctx data may not be setup yet, so cause oops even inside
blk_mq_hctx_has_pending(), such as it can be run just between:

ret = e->ops.mq.init_sched(q, e);
AND
ret = e->ops.mq.init_hctx(hctx, i)

inside blk_mq_init_sched().

This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue()
after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
won't be called if queue is quiesced.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen)
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 454be724 29-Nov-2017 Ming Lei <ming.lei@redhat.com>

block: drain queue before waiting for q_usage_counter becoming zero

Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
("block: Make q_usage_counter also track legacy requests"), but that
commit never runs and drains legacy queue before waiting for this counter
becoming zero, then IO hang is caused in the test of pulling disk during IO.

This patch fixes the issue by draining requests before waiting for
q_usage_counter becoming zero, both Mauricio and chenxiang reported this
issue, and observed that it can be fixed by this patch.

Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
Cc: Wen Xiong <wenxiong@us.ibm.com>
Tested-by: "chenxiang (M)" <chenxiang66@hisilicon.com>
Tested-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d904bfa7 15-Nov-2017 Johannes Thumshirn <jthumshirn@suse.de>

block/blk-mq.c: use kmalloc_array_node()

Now that we have a NUMA-aware version of kmalloc_array() we can use it
instead of kmalloc_node() without an overflow check in the size
calculation.

Link: http://lkml.kernel.org/r/20170927082038.3782-3-jthumshirn@suse.de
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Mike Marciniszyn <infinipath@intel.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff821d27 10-Nov-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: fixup some comment typos and lengths

Various typos and/or spelling errors in comments. Fixes a few > 80 char
lines as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f906a6a0 09-Nov-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: improve tag waiting setup for non-shared tags

If we run out of driver tags, we currently treat shared and non-shared
tags the same - both cases hook into the tag waitqueue. This is a bit
more costly than it needs to be on unshared tags, since we have to both
grab the hctx lock, and the waitqueue lock (and disable interrupts).
For the non-shared case, we can simply mark the queue as needing a
restart.

Split blk_mq_dispatch_wait_add() to account for both cases, and
rename it to blk_mq_mark_tag_wait() to better reflect what it
does now.

Without this patch, shared and non-shared performance is about the same
with 4 fio thread hammering on a single null_blk device (~410K, at 75%
sys). With the patch, the shared case is the same, but the non-shared
tags case runs at 431K at 71% sys.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 79f720a7 10-Nov-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: only run the hardware queue if IO is pending

Currently we are inconsistent in when we decide to run the queue. Using
blk_mq_run_hw_queues() we check if the hctx has pending IO before
running it, but we don't do that from the individual queue run function,
blk_mq_run_hw_queue(). This results in a lot of extra and pointless
queue runs, potentially, on flush requests and (much worse) on tag
starvation situations. This is observable just looking at top output,
with lots of kworkers active. For the !async runs, it just adds to the
CPU overhead of blk-mq.

Move the has-pending check into the run function instead of having
callers do it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9a95e4ef 09-Nov-2017 Bart Van Assche <bvanassche@acm.org>

block, nvme: Introduce blk_mq_req_flags_t

Several block layer and NVMe core functions accept a combination
of BLK_MQ_REQ_* flags through the 'flags' argument but there is
no verification at compile time whether the right type of block
layer flags is passed. Make it possible for sparse to verify this.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: linux-nvme@lists.infradead.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3a0a5299 09-Nov-2017 Bart Van Assche <bvanassche@acm.org>

block, scsi: Make SCSI quiesce and resume work reliably

The contexts from which a SCSI device can be quiesced or resumed are:
* Writing into /sys/class/scsi_device/*/device/state.
* SCSI parallel (SPI) domain validation.
* The SCSI device power management methods. See also scsi_bus_pm_ops.

It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after each resume the
fio job was still running:

for ((i=0; i<10; i++)); do
(
cd /sys/block/md0/md &&
while true; do
[ "$(<sync_action)" = "idle" ] && echo check > sync_action
sleep 1
done
) &
pids=($!)
for d in /sys/class/block/sd*[a-z]; do
bdev=${d#/sys/class/block/}
hcil=$(readlink "$d/device")
hcil=${hcil#../../../}
echo 4 > "$d/queue/nr_requests"
echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
--rw=randread --ioengine=libaio --numjobs=4 --iodepth=16 \
--iodepth_batch=1 --thread --loops=$((2**31)) &
pids+=($!)
done
sleep 1
echo "$(date) Hibernating ..." >>hibernate-test-log.txt
systemctl hibernate
sleep 10
kill "${pids[@]}"
echo idle > /sys/block/md0/md/sync_action
wait
echo "$(date) Done." >>hibernate-test-log.txt
done

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1b6d65a0 09-Nov-2017 Bart Van Assche <bvanassche@acm.org>

block: Introduce BLK_MQ_REQ_PREEMPT

Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to
blk_get_request_flags().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 055f6e18 09-Nov-2017 Ming Lei <ming.lei@redhat.com>

block: Make q_usage_counter also track legacy requests

This patch makes it possible to pause request allocation for
the legacy block layer by calling blk_mq_freeze_queue() and
blk_mq_unfreeze_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[ bvanassche: Combined two patches into one, edited a comment and made sure
REQ_NOWAIT is handled properly in blk_old_get_request() ]
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# eb619fdb 09-Nov-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: fix issue with shared tag queue re-running

This patch attempts to make the case of hctx re-running on driver tag
failure more robust. Without this patch, it's pretty easy to trigger a
stall condition with shared tags. An example is using null_blk like
this:

modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1

which sets up 4 devices, sharing the same tag set with a depth of 1.
Running a fio job ala:

[global]
bs=4k
rw=randread
norandommap
direct=1
ioengine=libaio
iodepth=4

[nullb0]
filename=/dev/nullb0
[nullb1]
filename=/dev/nullb1
[nullb2]
filename=/dev/nullb2
[nullb3]
filename=/dev/nullb3

will inevitably end with one or more threads being stuck waiting for a
scheduler tag. That IO is then stuck forever, until someone else
triggers a run of the queue.

Ensure that we always re-run the hardware queue, if the driver tag we
were waiting for got freed before we added our leftover request entries
back on the dispatch list.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0c6af1cc 07-Nov-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: put driver tag if dispatch budget can't be got

We have to put the driver tag if dispatch budget can't be got, otherwise
it might cause IO deadlock, especially in case that size of tags is very
small.

Fixes: de1482974080(blk-mq: introduce .get_budget and .put_budget in blk_mq_ops)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 923218f6 02-Nov-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: don't allocate driver tag upfront for flush rq

The idea behind it is simple:

1) for none scheduler, driver tag has to be borrowed for flush rq,
otherwise we may run out of tag, and that causes an IO hang. And
get/put driver tag is actually noop for none, so reordering tags
isn't necessary at all.

2) for a real I/O scheduler, we need not allocate a driver tag upfront
for flush rq. It works just fine to follow the same approach as
normal requests: allocate driver tag for each rq just before calling
->queue_rq().

One driver visible change is that the driver tag isn't shared in the
flush request sequence. That won't be a problem, since we always do that
in legacy path.

Then flush rq need not be treated specially wrt. get/put driver tag.
This cleans up the code - for instance, reorder_tags_to_front() can be
removed, and we needn't worry about request ordering in dispatch list
for avoiding I/O deadlock.

Also we have to put the driver tag before requeueing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 244c65a3 04-Nov-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: move blk_mq_put_driver_tag*() into blk-mq.h

We need this helper to put the driver tag for flush rq, since we will
not share tag in the flush request sequence in the following patch
in case that I/O scheduler is applied.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b0850297 02-Nov-2017 Ming Lei <ming.lei@redhat.com>

block: pass 'run_queue' to blk_mq_request_bypass_insert

Block flush need this function without running the queue, so add a
parameter controlling whether we run it or not.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6d6f167c 02-Nov-2017 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: put the driver tag of nxt rq before first one is requeued

When freeing the driver tag of the next rq with an I/O scheduler
configured, we get the first entry of the list. However, this can
race with requeue of a request, and we end up getting the wrong request
from the head of the list. Free the driver tag of next rq before the
failed one is requeued in the failure branch of queue_rq callback.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 88022d72 04-Nov-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: don't handle failure in .get_budget

It is enough to just check if we can get the budget via .get_budget().
And we don't need to deal with device state change in .get_budget().

For SCSI, one issue to be fixed is that we have to call
scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
to handle the request. And it isn't enough to simply call
blk_mq_end_request() to do that if this request is marked as
RQF_DONTPREP.

Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 21e768b4 16-Oct-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Make blk_mq_get_request() error path less confusing

blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
ignores its argument, this patch does not change any functionality.

References: commit 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c2e82a23 22-Sep-2017 weiping zhang <zhangweiping@didichuxing.com>

blk-mq: fix nr_requests wrong value when modify it from sysfs

if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/scheduler
echo 1000000 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
1000000

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ea435e1b 02-Nov-2017 Christoph Hellwig <hch@lst.de>

block: add a poll_fn callback to struct request_queue

That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1f460b63 26-Oct-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE

SCSI restarts its queue in scsi_end_request() automatically, so we don't
need to handle this case in blk-mq.

Especailly any request won't be dequeued in this case, we needn't to
worry about IO hang caused by restart vs. dispatch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b347689f 14-Oct-2017 Ming Lei <ming.lei@redhat.com>

blk-mq-sched: improve dispatching from sw queue

SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# de148297 14-Oct-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 85acb3ba 06-Oct-2017 Shaohua Li <shli@fb.com>

block: set request_list for request

Legacy queue sets request's request_list, mq doesn't. This makes mq does
the same thing, so we can find cgroup of a request. Note, we really
only use blkg field of request_list, it's pointless to allocate mempool
for request_list in mq case.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fc13457f 04-Oct-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: document the need to have STARTED and COMPLETED share a byte

For memory ordering guarantees on stores, we need to ensure that
these two bits share the same byte of storage in the unsigned
long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
we don't violate this requirement.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a7af0af3 06-Sep-2017 Peter Zijlstra <peterz@infradead.org>

blk-mq: attempt to fix atomic flag memory ordering

Attempt to untangle the ordering in blk-mq. The patch introducing the
single smp_mb__before_atomic() is obviously broken in that it doesn't
clearly specify a pairing barrier and an obtained guarantee.

The comment is further misleading in that it hints that the
deadline store and the COMPLETE store also need to be ordered, but
AFAICT there is no such dependency. However what does appear to be
important is the clear happening _after_ the store, and that worked by
pure accident.

This clarifies blk_mq_start_request() -- we should not get there with
STARTING set -- this simplifies the code and makes the barrier usage
sane (the old code could be read to allow not having _any_ atomic after
the barrier, in which case the barrier hasn't got anything to order). We
then also introduce the missing pairing barrier for it.

Also down-grade the barrier to smp_wmb(), this is cheaper for
PowerPC/ARM and doesn't cost anything extra on x86.

And it documents the STARTING vs COMPLETE ordering. Although I've not
been entirely successful in reverse engineering the blk-mq state
machine so there might still be more funnies around timeout vs
requeue.

If I got anything wrong, feel free to educate me by adding comments to
clarify things ;-)

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking rq as started")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7beb2f84 30-Sep-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: wire up completion notifier for laptop mode

For some reason, the laptop mode IO completion notified was never wired
up for blk-mq. Ensure that we trigger the callback appropriately, to arm
the laptop mode flush timer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 54724873 30-Sep-2017 weiping zhang <zhangweiping@didichuxing.com>

blk-mq: remove unused function hctx_allow_merges

since 9bddeb2a5b981 "blk-mq: make per-sw-queue bio merge as default .bio_merge"
there is no caller for this function.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 157f377b 11-Sep-2017 Jens Axboe <axboe@kernel.dk>

block: directly insert blk-mq request from blk_insert_cloned_request()

A NULL pointer crash was reported for the case of having the BFQ IO
scheduler attached to the underlying blk-mq paths of a DM multipath
device. The crash occured in blk_mq_sched_insert_request()'s call to
e->type->ops.mq.insert_requests().

Paolo Valente correctly summarized why the crash occured with:
"the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
blk_rq_prep_clone) creates a cloned request without invoking
e->type->ops.mq.prepare_request for the target elevator e. The cloned
request is therefore not initialized for the scheduler, but it is
however inserted into the scheduler by blk_mq_sched_insert_request."

All said, a request-based DM multipath device's IO scheduler should be
the only one used -- when the original requests are issued to the
underlying paths as cloned requests they are inserted directly in the
underlying dispatch queue(s) rather than through an additional elevator.

But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
schedulers") switched blk_insert_cloned_request() from using
blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
incorrectly added elevator machinery into a call chain that isn't
supposed to have any.

To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
that blk_insert_cloned_request() calls to insert the request without
involving any elevator that may be attached to the cloned request's
request_queue.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4d606219 17-Aug-2017 Bart Van Assche <bvanassche@acm.org>

block: Fix two comments that refer to .queue_rq() return values

Since patch "blk-mq: switch .queue_rq return value to blk_status_t"
.queue_rq() returns a BLK_STS_* value instead of a BLK_MQ_RQ_*
value. Hence refer to the former in comments about .queue_rq()
return values.

Fixes: commit 39a70c76b89b ("blk-mq: clarify dispatch may not be drained/blocked by stopping queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3280d66a 14-Aug-2017 Keith Busch <kbusch@kernel.org>

blk-mq: Fix queue usage on failed request allocation

blk_mq_get_request() does not release the callers queue usage counter
when allocation fails. The caller still needs to account for its own
queue usage when it is unable to allocate a request.

Fixes: 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")

Reported-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d4acf365 09-Aug-2017 Bart Van Assche <bvanassche@acm.org>

block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time

The blk_mq_delay_kick_requeue_list() function is used by the device
mapper and only by the device mapper to rerun the queue and requeue
list after a delay. This function is called once per request that
gets requeued. Modify this function such that the queue is run once
per path change event instead of once per request that is requeued.

Fixes: commit 2849450ad39d ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b8d62b3a 08-Aug-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: enable checking two part inflight counts at the same time

Modify blk_mq_in_flight() to count both a partition and root at
the same time. Then we only have to call it once, instead of
potentially looping the tags twice.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f299b7c7 08-Aug-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: provide internal in-flight variant

We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1ad43c00 01-Aug-2017 Ming Lei <minlei@redhat.com>

blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed

When blk_mq_get_request() failed, preempt counter isn't
released, and blk_mq_make_request() doesn't release the counter
too.

This patch fixes the issue, and makes sure that preempt counter
is only held if rq is allocated successfully. The same policy is
applied on .q_usage_counter too.

Signed-off-by: Ming Lei <minlei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b7a71e66 01-Aug-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabled

We recently had a bug in the IPR SCSI driver, where it would end up
making the SCSI mid layer run the mq hardware queue with interrupts
disabled. This isn't legal, since the software queue locking relies
on never being grabbed from interrupt context. Additionally, drivers
that set BLK_MQ_F_BLOCKING may schedule from this context.

Add a WARN_ON_ONCE() to catch bad users up front.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 18e9781d 27-Jul-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: blk_mq_requeue_work() doesn't need to save IRQ flags

We know we're in process context, so don't bother using the
IRQ safe versions of the spin lock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e23947bd 29-Jun-2017 Dmitry Monakhov <dmonakhov@openvz.org>

bio-integrity: fold bio_integrity_enabled to bio_integrity_prep

Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: merged with the latest block tree,
return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4b855ad3 25-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: Create hctx for each present CPU

Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations. Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 46685d1a 19-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: don't bounce by default

For historical reasons we default to bouncing highmem pages for all block
queues. But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0b0bcacc 19-Jun-2017 Christoph Hellwig <hch@lst.de>

block: don't bother with bounce limits for make_request drivers

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a9590fe1 22-Jun-2017 weiping <zhangweiping@didichuxing.com>

blk-mq: remove double set queue_num

hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need
set it again.

Signed-off-by: weiping <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 852ec809 21-Jun-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Make it safe to quiesce and unquiesce from an interrupt handler

Since blk_mq_quiesce_queue_nowait() can be called from interrupt
context, make this safe. Since this function is not in the hot
path, uninline it.

Fixes: commit f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8e8320c9 20-Jun-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: fix performance regression with shared tags

If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.

Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.

Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.

Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5435c023 20-Jun-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Warn when attempting to run a hardware queue that is not mapped

A queue must be frozen while the mapped state of a hardware queue
is changed. Additionally, any change of the mapped state is
followed by a call to blk_mq_map_swqueue() (see also
blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()).
Since blk_mq_map_swqueue() does not map any unmapped hardware
queue onto any software queue, no attempt will be made to run
an unmapped hardware queue. Hence issue a warning upon attempts
to run an unmapped hardware queue.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7b607814 20-Jun-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Document locking assumptions

Document the locking assumptions in functions that modify
blk_mq_ctx.rq_list to make it easier for humans to verify
this code.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c3a148d2 20-Jun-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Initialize .rq_flags in blk_mq_rq_ctx_init()

Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
is called after a value has been assigned to .rq_flags and .rq_flags
is initialized in __blk_mq_finish_request(). Initialize .rq_flags in
blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
Moving the initialization of .rq_flags is fine because all changes
and tests of .rq_flags occur between blk_get_request() and finishing
a request.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cd6ce148 20-Jun-2017 Bart Van Assche <bvanassche@acm.org>

block: Make request operation type argument declarations consistent

Instead of declaring the second argument of blk_*_get_request()
as int and passing it to functions that expect an unsigned int,
declare that second argument as unsigned int. Also because of
consistency, rename that second argument from 'rw' into 'op'.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 07319678 20-Jun-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Reduce blk_mq_hw_ctx size

Since the srcu structure is rather large (184 bytes on an x86-64
system with kernel debugging disabled), only allocate it if needed.

Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 03a07c92 20-Jun-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>

block: return on congested block device

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2055da97 19-Jun-2017 Ingo Molnar <mingo@kernel.org>

sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming

So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry

For example, this code:

rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ac6424b9 19-Jun-2017 Ingo Molnar <mingo@kernel.org>

sched/wait: Rename wait_queue_t => wait_queue_entry_t

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 641a9ed6 06-Jun-2017 Ming Lei <ming.lei@redhat.com>

Revert "blk-mq: don't use sync workqueue flushing from drivers"

This patch reverts commit 2719aa217e0d02(blk-mq: don't use
sync workqueue flushing from drivers) because only
blk_mq_quiesce_queue() need the sync flush, and now
we don't need to stop queue any more, so revert it.

Also changes to cancel_delayed_work() in blk_mq_stop_hw_queue().

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 39a70c76 06-Jun-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: clarify dispatch may not be drained/blocked by stopping queue

BLK_MQ_S_STOPPED may not be observed in other concurrent I/O paths,
we can't guarantee that dispatching won't happen after returning
from the APIs of stopping queue.

So clarify the fact and avoid potential misuse.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1d9e9bc6 06-Jun-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: don't stop queue for quiescing

Queue can be started by other blk-mq APIs and can be used in
different cases, this limits uses of blk_mq_quiesce_queue()
if it is based on stopping queue, and make its usage very
difficult, especially users have to use the stop queue APIs
carefully for avoiding to break blk_mq_quiesce_queue().

We have applied the QUIESCED flag for draining and blocking
dispatch, so it isn't necessary to stop queue any more.

After stopping queue is removed, blk_mq_quiesce_queue() can
be used safely and easily, then users won't worry about queue
restarting during quiescing at all.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 69e07c4a 06-Jun-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: update comments on blk_mq_quiesce_queue()

Actually what we want to get from blk_mq_quiesce_queue()
isn't only to wait for completion of all ongoing .queue_rq().

In the typical context of canceling requests, we need to
make sure that the following is done in the dispatch path
before starting to cancel requests:

- failed dispatched request is finished
- busy dispatched request is requeued, and the STARTED
flag is cleared

So update comment to keep code, doc and our expection consistent.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f4560ffe 18-Jun-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue

It is required that no dispatch can happen any more once
blk_mq_quiesce_queue() returns, and we don't have such requirement
on APIs of stopping queue.

But blk_mq_quiesce_queue() still may not block/drain dispatch in the
the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
sections for fixing this issue.

Also blk_mq_quiesce_queue() is implemented via stopping queue, which
limits its uses, and easy to cause race, because any queue restart in
other paths may break blk_mq_quiesce_queue(). With the introduced
flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
for quiescing any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e4e73913 06-Jun-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce blk_mq_unquiesce_queue

blk_mq_start_stopped_hw_queues() is used implictly
as counterpart of blk_mq_quiesce_queue() for unquiescing queue,
so we introduce blk_mq_unquiesce_queue() and make it
as counterpart of blk_mq_quiesce_queue() explicitly.

This function is for improving the current quiescing mechanism
in the following patches.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# af67c31f 17-Jun-2017 NeilBrown <neilb@suse.com>

blk: remove bio_set arg from blk_queue_split()

blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier GonzĂ¡lez <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier GonzĂ¡lez <javier@cnexlabs.com>
Tested-by: Javier GonzĂ¡lez <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e4cdf1a1 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: remove __blk_mq_alloc_request

Move most code into blk_mq_rq_ctx_init, and the rest into
blk_mq_get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5bbf4e5a 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq-sched: unify request prepare methods

This patch makes sure we always allocate requests in the core blk-mq
code and use a common prepare_request method to initialize them for
both mq I/O schedulers. For Kyber and additional limit_depth method
is added that is called before allocating the request.

Also because none of the intializations can really fail the new method
does not return an error - instead the bfq finish method is hardened
to deal with the no-IOC case.

Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
scheduling code as RQF_ELFPRIV is all that is needed now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 44e8c2bf 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: refactor blk_mq_sched_assign_ioc

blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
the schedule needs it (bfq only at the moment). The caller to the
per-request initializer is moved out so that it can be merged with
a similar call for the kyber I/O scheduler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 037cebb8 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: streamline blk_mq_get_request

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6af54051 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: simplify blk_mq_free_request

Merge three functions only tail-called by blk_mq_free_request into
blk_mq_free_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7b9e9361 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq-sched: unify request finished methods

No need to have two different callouts of bfq vs kyber.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ea511e3c 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_sched_{get,put}_rq_priv

Having these as separate helpers in a header really does not help
readability, or my chances to refactor this code sanely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d2c0d383 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: move blk_mq_sched_{get,put}_request to blk-mq.c

Having them out of line in blk-mq-sched.c just makes the code flow
unnecessarily complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6e15cf2a 16-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: mark blk_mq_rq_ctx_init static

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f06345ad 12-Jun-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: fixup type of 'ret' in __blk_mq_try_issue_directly()

Should be a blk_status_t type, not an integer.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fc17b653 03-Jun-2017 Christoph Hellwig <hch@lst.de>

blk-mq: switch ->queue_rq return value to blk_status_t

Use the same values for use for request completion errors as the return
value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2a842aca 03-Jun-2017 Christoph Hellwig <hch@lst.de>

block: introduce new block status code type

Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.

For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.

blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d964f04a 06-Jun-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: fix direct issue

If queue is stopped, we shouldn't dispatch request into driver and
hardware, unfortunately the check is removed in bd166ef183c2(blk-mq-sched:
add framework for MQ capable IO schedulers).

This patch fixes the issue by moving the check back into
__blk_mq_try_issue_directly().

This patch fixes request use-after-free[1][2] during canceling requets
of NVMe in nvme_dev_disable(), which can be triggered easily during
NVMe reset & remove test.

[1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
[ 103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
[ 103.412980] IP: bio_integrity_advance+0x48/0xf0
[ 103.412981] PGD 275a88067
[ 103.412981] P4D 275a88067
[ 103.412982] PUD 276c43067
[ 103.412983] PMD 0
[ 103.412984]
[ 103.412986] Oops: 0000 [#1] SMP
[ 103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
[ 103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
[ 103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
[ 103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
[ 103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
[ 103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
[ 103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
[ 103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
[ 103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
[ 103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
[ 103.413053] FS: 0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
[ 103.413054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
[ 103.413056] Call Trace:
[ 103.413063] bio_advance+0x2a/0xe0
[ 103.413067] blk_update_request+0x76/0x330
[ 103.413072] blk_mq_end_request+0x1a/0x70
[ 103.413074] blk_mq_dispatch_rq_list+0x370/0x410
[ 103.413076] ? blk_mq_flush_busy_ctxs+0x94/0xe0
[ 103.413080] blk_mq_sched_dispatch_requests+0x173/0x1a0
[ 103.413083] __blk_mq_run_hw_queue+0x8e/0xa0
[ 103.413085] __blk_mq_delay_run_hw_queue+0x9d/0xa0
[ 103.413088] blk_mq_start_hw_queue+0x17/0x20
[ 103.413090] blk_mq_start_hw_queues+0x32/0x50
[ 103.413095] nvme_kill_queues+0x54/0x80 [nvme_core]
[ 103.413097] nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
[ 103.413103] process_one_work+0x149/0x360
[ 103.413105] worker_thread+0x4d/0x3c0
[ 103.413109] kthread+0x109/0x140
[ 103.413111] ? rescuer_thread+0x380/0x380
[ 103.413113] ? kthread_park+0x60/0x60
[ 103.413120] ret_from_fork+0x2c/0x40
[ 103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
[ 103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
[ 103.413146] CR2: 000000000000000a
[ 103.413157] ---[ end trace cd6875d16eb5a11e ]---
[ 103.455368] Kernel panic - not syncing: Fatal exception
[ 103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 103.850916] ---[ end Kernel panic - not syncing: Fatal exception
[ 103.857637] sched: Unexpected reschedule of offline CPU#1!
[ 103.863762] ------------[ cut here ]------------

[2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
[ 247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
[ 247.137311] Not tainted 4.12.0-rc2.upstream+ #4
[ 247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 247.151704] Call Trace:
[ 247.154445] __schedule+0x28a/0x880
[ 247.158341] schedule+0x36/0x80
[ 247.161850] blk_mq_freeze_queue_wait+0x4b/0xb0
[ 247.166913] ? remove_wait_queue+0x60/0x60
[ 247.171485] blk_freeze_queue+0x1a/0x20
[ 247.175770] blk_cleanup_queue+0x7f/0x140
[ 247.180252] nvme_ns_remove+0xa3/0xb0 [nvme_core]
[ 247.185503] nvme_remove_namespaces+0x32/0x50 [nvme_core]
[ 247.191532] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
[ 247.196977] nvme_remove+0x70/0x110 [nvme]
[ 247.201545] pci_device_remove+0x39/0xc0
[ 247.205927] device_release_driver_internal+0x141/0x200
[ 247.211761] device_release_driver+0x12/0x20
[ 247.216531] pci_stop_bus_device+0x8c/0xa0
[ 247.221104] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 247.227420] remove_store+0x7c/0x90
[ 247.231320] dev_attr_store+0x18/0x30
[ 247.235409] sysfs_kf_write+0x3a/0x50
[ 247.239497] kernfs_fop_write+0xff/0x180
[ 247.243867] __vfs_write+0x37/0x160
[ 247.247757] ? selinux_file_permission+0xe5/0x120
[ 247.253011] ? security_file_permission+0x3b/0xc0
[ 247.258260] vfs_write+0xb2/0x1b0
[ 247.261964] ? syscall_trace_enter+0x1d0/0x2b0
[ 247.266924] SyS_write+0x55/0xc0
[ 247.270540] do_syscall_64+0x67/0x150
[ 247.274636] entry_SYSCALL64_slow_path+0x25/0x25
[ 247.279794] RIP: 0033:0x7f5c96740840
[ 247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
[ 247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
[ 247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
[ 247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
[ 247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[ 370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.

Fixes: 12d70958a2e8(blk-mq: don't fail allocating driver tag for stopped hw queue)
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# dad7a3be 06-Jun-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: pass correct hctx to blk_mq_try_issue_directly

When direct issue is done on request picked up from plug list,
the hctx need to be updated with the actual hw queue, otherwise
wrong hctx is used and may hurt performance, especially when
wrong SRCU readlock is acquired/released

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e4dc2b32 30-May-2017 Keith Busch <kbusch@kernel.org>

blk-mq: Take tagset lock when updating hw queues

The tagset lock needs to be held when iterating the tag_list, so a
lockdep assert was added when updating number of hardware queues. The
drivers calling this API, however, were unaware of the new requirement,
so are failing the assertion.

This patch takes the lock within the blk-mq function so the drivers do
not have to be modified in order to be safe.

Fixes: 705cda97e ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9bddeb2a 26-May-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: make per-sw-queue bio merge as default .bio_merge

Because what the per-sw-queue bio merge does is basically same with
scheduler's .bio_merge(), this patch makes per-sw-queue bio merge
as the default .bio_merge if no scheduler is used or io scheduler
doesn't provide .bio_merge().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ab42f35d 26-May-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: merge bio into sw queue before plugging

Before blk-mq is introduced, I/O is merged to elevator
before being putted into plug queue, but blk-mq changed the
order and makes merging to sw queue basically impossible.
Then it is observed that throughput of sequential I/O is degraded
about 10%~20% on virtio-blk in the test[1] if mq-deadline isn't used.

This patch moves the bio merging per sw queue before plugging,
like what blk_queue_bio() does, and the performance regression is
fixed under this situation.

[1]. test script:
sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=16 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/vdb --name=virtio_blk-test-$RW --rw=$RW --output-format=json

RW=read or write

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7254a50a 22-May-2017 Ming Lei <ming.lei@redhat.com>

blk-mq: remove blk_mq_abort_requeue_list()

No one uses it any more, so remove it.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# f36ea50c 10-May-2017 Wen Xiong <wenxiong@linux.vnet.ibm.com>

blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op

When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.

Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error

We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.

1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828

2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828

With the fix, It showed us both of the first op and the second op have
correct bi and pi address.

1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605
Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
0x00003028

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ebd76857 08-May-2017 Colin Ian King <colin.king@canonical.com>

blk-mq: make __blk_mq_stop_hw_queues static

Making __blk_mq_stop_hw_queues static fixes sparse warning:

block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
declared. Should it be static?

Fixes: 2719aa217e0d0 ("blk-mq: don't use sync workqueue flushing from drivers")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 51d638b1 07-May-2017 Wanpeng Li <wanpeng.li@hotmail.com>

block/mq: fix potential deadlock during cpu hotplug

This can be triggered by hot-unplug one cpu.

======================================================
[ INFO: possible circular locking dependency detected ]
4.11.0+ #17 Not tainted
-------------------------------------------------------
step_after_susp/2640 is trying to acquire lock:
(all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110

but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (cpu_hotplug.lock){+.+.+.}:
lock_acquire+0x11c/0x230
__mutex_lock+0x92/0x990
mutex_lock_nested+0x1b/0x20
get_online_cpus+0x64/0x80
blk_mq_init_allocated_queue+0x3a0/0x4e0
blk_mq_init_queue+0x3a/0x60
loop_add+0xe5/0x280
loop_init+0x124/0x177
do_one_initcall+0x53/0x1c0
kernel_init_freeable+0x1e3/0x27f
kernel_init+0xe/0x100
ret_from_fork+0x31/0x40

-> #0 (all_q_mutex){+.+...}:
__lock_acquire+0x189a/0x18a0
lock_acquire+0x11c/0x230
__mutex_lock+0x92/0x990
mutex_lock_nested+0x1b/0x20
blk_mq_queue_reinit_work+0x18/0x110
blk_mq_queue_reinit_dead+0x1c/0x20
cpuhp_invoke_callback+0x1f2/0x810
cpuhp_down_callbacks+0x42/0x80
_cpu_down+0xb2/0xe0
freeze_secondary_cpus+0xb6/0x390
suspend_devices_and_enter+0x3b3/0xa40
pm_suspend+0x129/0x490
state_store+0x82/0xf0
kobj_attr_store+0xf/0x20
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x135/0x1c0
__vfs_write+0x37/0x160
vfs_write+0xcd/0x1d0
SyS_write+0x58/0xc0
do_syscall_64+0x8f/0x710
return_from_SYSCALL_64+0x0/0x7a

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(all_q_mutex);
lock(cpu_hotplug.lock);
lock(all_q_mutex);

*** DEADLOCK ***

8 locks held by step_after_susp/2640:
#0: (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
#1: (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
#2: (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
#3: (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
#4: (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
#5: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
#6: (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
#7: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

stack backtrace:
CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
Call Trace:
dump_stack+0x99/0xce
print_circular_bug+0x1fa/0x270
__lock_acquire+0x189a/0x18a0
lock_acquire+0x11c/0x230
? lock_acquire+0x11c/0x230
? blk_mq_queue_reinit_work+0x18/0x110
? blk_mq_queue_reinit_work+0x18/0x110
__mutex_lock+0x92/0x990
? blk_mq_queue_reinit_work+0x18/0x110
? kmem_cache_free+0x2cb/0x330
? anon_transport_class_unregister+0x20/0x20
? blk_mq_queue_reinit_work+0x110/0x110
mutex_lock_nested+0x1b/0x20
? mutex_lock_nested+0x1b/0x20
blk_mq_queue_reinit_work+0x18/0x110
blk_mq_queue_reinit_dead+0x1c/0x20
cpuhp_invoke_callback+0x1f2/0x810
? __flow_cache_shrink+0x160/0x160
cpuhp_down_callbacks+0x42/0x80
_cpu_down+0xb2/0xe0
freeze_secondary_cpus+0xb6/0x390
suspend_devices_and_enter+0x3b3/0xa40
? rcu_read_lock_sched_held+0x79/0x80
pm_suspend+0x129/0x490
state_store+0x82/0xf0
kobj_attr_store+0xf/0x20
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x135/0x1c0
__vfs_write+0x37/0x160
? rcu_read_lock_sched_held+0x79/0x80
? rcu_sync_lockdep_assert+0x2f/0x60
? __sb_start_write+0xd9/0x1c0
? vfs_write+0x1ad/0x1d0
vfs_write+0xcd/0x1d0
SyS_write+0x58/0xc0
? rcu_read_lock_sched_held+0x79/0x80
do_syscall_64+0x8f/0x710
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL64_slow_path+0x25/0x25

The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
contend these two locks in the inversion order. This is due to commit eabe06595d62
(blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
issue because of hotplug rework, however the hotplug rework is still work-in-progress
and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
breaks the linus's tree in the merge window, so this patch reverts the lock order
and avoids to splat linus's tree.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9c1051aa 04-May-2017 Omar Sandoval <osandov@fb.com>

blk-mq: untangle debugfs and sysfs

Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.

The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# eabe0659 04-May-2017 Peter Zijlstra <peterz@infradead.org>

block/mq: Cure cpu hotplug lock inversion

By poking at /debug/sched_features I triggered the following splat:

[] ======================================================
[] WARNING: possible circular locking dependency detected
[] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
[] ------------------------------------------------------
[] bash/2109 is trying to acquire lock:
[] (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50
[]
[] but task is already holding lock:
[] (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
[]
[] which lock already depends on the new lock.
[]
[]
[] the existing dependency chain (in reverse order) is:
[]
[] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
[] lock_acquire+0x100/0x210
[] down_write+0x28/0x60
[] start_creating+0x5e/0xf0
[] debugfs_create_dir+0x13/0x110
[] blk_mq_debugfs_register+0x21/0x70
[] blk_mq_register_dev+0x64/0xd0
[] blk_register_queue+0x6a/0x170
[] device_add_disk+0x22d/0x440
[] loop_add+0x1f3/0x280
[] loop_init+0x104/0x142
[] do_one_initcall+0x43/0x180
[] kernel_init_freeable+0x1de/0x266
[] kernel_init+0xe/0x100
[] ret_from_fork+0x31/0x40
[]
[] -> #1 (all_q_mutex){+.+.+.}:
[] lock_acquire+0x100/0x210
[] __mutex_lock+0x6c/0x960
[] mutex_lock_nested+0x1b/0x20
[] blk_mq_init_allocated_queue+0x37c/0x4e0
[] blk_mq_init_queue+0x3a/0x60
[] loop_add+0xe5/0x280
[] loop_init+0x104/0x142
[] do_one_initcall+0x43/0x180
[] kernel_init_freeable+0x1de/0x266
[] kernel_init+0xe/0x100
[] ret_from_fork+0x31/0x40

[] *** DEADLOCK ***
[]
[] 3 locks held by bash/2109:
[] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0
[] #1: (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0
[] #2: (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
[]
[] stack backtrace:
[] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
[] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[] Call Trace:

[] lock_acquire+0x100/0x210
[] get_online_cpus+0x2a/0x90
[] static_key_slow_dec+0x1b/0x50
[] static_key_disable+0x20/0x30
[] sched_feat_write+0x131/0x170
[] full_proxy_write+0x97/0xd0
[] __vfs_write+0x28/0x120
[] vfs_write+0xb5/0x1a0
[] SyS_write+0x49/0xa0
[] entry_SYSCALL_64_fastpath+0x23/0xc2

This is because of the cpu hotplug lock rework. Break the chain at #1
by reversing the lock acquisition order. This way i_mutex_key#4 no
longer depends on cpu_hotplug_lock and things are good.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2719aa21 03-May-2017 Jens Axboe <axboe@fb.com>

blk-mq: don't use sync workqueue flushing from drivers

A previous commit introduced the sync flush, which we need from
internal callers like blk_mq_quiesce_queue(). However, we also
call the stop helpers from drivers, particularly from ->queue_rq()
when we have to stop processing for a bit. We can't block from
those locations, and we don't have to guarantee that we're
fully flushed.

Fixes: 9f993737906b ("blk-mq: unify hctx delayed_run_work and run_work")
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7a148c2f 01-May-2017 Ming Lei <ming.lei@redhat.com>

block: don't call blk_mq_quiesce_queue() after queue is frozen

After queue is frozen, no request in this queue can be in use at all, so
there can't be any .queue_rq() running on this queue. It isn't
necessary to call blk_mq_quiesce_queue() any more, so remove it in both
elevator_switch_mq() and blk_mq_update_nr_requests().

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Fixed up the description a bit.

Signed-off-by: Jens Axboe <axboe@fb.com>


# d6296d39 01-May-2017 Christoph Hellwig <hch@lst.de>

blk-mq: update ->init_request and ->exit_request prototypes

Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq. Except for a superflous check in
mtip32xx it was unused anyway.

Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 21c6e939 10-Apr-2017 Jens Axboe <axboe@fb.com>

blk-mq: unify hctx delay_work and run_work

The only difference between ->run_work and ->delay_work, is that
the latter is used to defer running a queue. This is done by
marking the queue stopped, and scheduling ->delay_work to run
sometime in the future. While the queue is stopped, direct runs
or runs through ->run_work will not run the queue.

If we combine the handlers, then we need to handle two things:

1) If a delayed/stopped run is scheduled, then we should not run
the queue before that has been completed.
2) If a queue is delayed/stopped, the handler needs to restart
the queue. Normally a run of a queue with the stopped bit set
would be a no-op.

Case 1 is handled by modifying a currently pending queue run
to the deadline set by the caller of blk_mq_delay_queue().
Subsequent attempts to queue a queue run will find the work
item already pending, and direct runs will see a stopped queue
as before.

Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
that tells the work handler that it should clear a stopped
queue and run the handler.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9f993737 10-Apr-2017 Jens Axboe <axboe@fb.com>

blk-mq: unify hctx delayed_run_work and run_work

They serve the exact same purpose. Get rid of the non-delayed
work variant, and just run it without delay for the normal case.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# abc25a69 21-Apr-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Fix preempt count imbalance

Avoid that the following kernel bug gets triggered:

BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:349
in_atomic(): 1, irqs_disabled(): 0, pid: 8019, name: find
CPU: 10 PID: 8019 Comm: find Tainted: G W I 4.11.0-rc4-dbg+ #2
Call Trace:
dump_stack+0x68/0x93
___might_sleep+0x16e/0x230
__might_sleep+0x4a/0x80
__ext4_get_inode_loc+0x1e0/0x4e0
ext4_iget+0x70/0xbc0
ext4_iget_normal+0x2f/0x40
ext4_lookup+0xb6/0x1f0
lookup_slow+0x104/0x1e0
walk_component+0x19a/0x330
path_lookupat+0x4b/0x100
filename_lookup+0x9a/0x110
user_path_at_empty+0x36/0x40
vfs_statx+0x67/0xc0
SYSC_newfstatat+0x20/0x40
SyS_newfstatat+0xe/0x10
entry_SYSCALL_64_fastpath+0x18/0xad

This happens since the big if/else in blk_mq_make_request() doesn't
have final else section that also drops the ctx. Add that.

Fixes: b00c53e8f411 ("blk-mq: fix schedule-while-atomic with scheduler attached")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>

Added a bit more to the commit log.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 99c749a4 21-Apr-2017 Jens Axboe <axboe@fb.com>

blk-stat: kill blk_stat_rq_ddir()

No point in providing and exporting this helper. There's just
one (real) user of it, just use rq_data_dir().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5feeacdd 20-Apr-2017 Jens Axboe <axboe@fb.com>

blk-mq: add might_sleep check to blk_mq_get_driver_tag()

If the caller passes in wait=true, it has to be able to block
for a driver tag. We just had a bug where flush insertion
would block on tag allocation, while we had preempt disabled.
Ensure that we catch cases like that earlier next time.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0206319f 20-Apr-2017 Stephen Bates <sbates@raithlin.com>

blk-mq: Fix poll_stat for new size-based bucketing.

Fixes an issue where the size of the poll_stat array in request_queue
does not match the size expected by the new size based bucketing for
IO completion polling.

Fixes: 720b8ccc4500 ("blk-mq: Add a polling specific stats function")
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b00c53e8 20-Apr-2017 Jens Axboe <axboe@fb.com>

blk-mq: fix schedule-while-atomic with scheduler attached

We must have dropped the ctx before we call
blk_mq_sched_insert_request() with can_block=true, otherwise we risk
that a flush request can block on insertion if we are currently out of
tags.

[ 47.667190] BUG: scheduling while atomic: jbd2/sda2-8/2089/0x00000002
[ 47.674493] Modules linked in: x86_pkg_temp_thermal btrfs xor zlib_deflate raid6_pq sr_mod cdre
[ 47.690572] Preemption disabled at:
[ 47.690584] [<ffffffff81326c7c>] blk_mq_sched_get_request+0x6c/0x280
[ 47.701764] CPU: 1 PID: 2089 Comm: jbd2/sda2-8 Not tainted 4.11.0-rc7+ #271
[ 47.709630] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[ 47.718081] Call Trace:
[ 47.720903] dump_stack+0x4f/0x73
[ 47.724694] ? blk_mq_sched_get_request+0x6c/0x280
[ 47.730137] __schedule_bug+0x6c/0xc0
[ 47.734314] __schedule+0x559/0x780
[ 47.738302] schedule+0x3b/0x90
[ 47.741899] io_schedule+0x11/0x40
[ 47.745788] blk_mq_get_tag+0x167/0x2a0
[ 47.750162] ? remove_wait_queue+0x70/0x70
[ 47.754901] blk_mq_get_driver_tag+0x92/0xf0
[ 47.759758] blk_mq_sched_insert_request+0x134/0x170
[ 47.765398] ? blk_account_io_start+0xd0/0x270
[ 47.770679] blk_mq_make_request+0x1b2/0x850
[ 47.775766] generic_make_request+0xf7/0x2d0
[ 47.780860] submit_bio+0x5f/0x120
[ 47.784979] ? submit_bio+0x5f/0x120
[ 47.789631] submit_bh_wbc.isra.46+0x10d/0x130
[ 47.794902] submit_bh+0xb/0x10
[ 47.798719] journal_submit_commit_record+0x190/0x210
[ 47.804686] ? _raw_spin_unlock+0x13/0x30
[ 47.809480] jbd2_journal_commit_transaction+0x180a/0x1d00
[ 47.815925] kjournald2+0xb6/0x250
[ 47.820022] ? kjournald2+0xb6/0x250
[ 47.824328] ? remove_wait_queue+0x70/0x70
[ 47.829223] kthread+0x10e/0x140
[ 47.833147] ? commit_timeout+0x10/0x10
[ 47.837742] ? kthread_create_on_node+0x40/0x40
[ 47.843122] ret_from_fork+0x29/0x40

Fixes: a4d907b6a33b ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 720b8ccc 07-Apr-2017 Stephen Bates <sbates@raithlin.com>

blk-mq: Add a polling specific stats function

Rather than bucketing IO statisics based on direction only we also
bucket based on the IO size. This leads to improved polling
performance. Update the bucket callback function and use it in the
polling latency estimation.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3a07bb1d 20-Apr-2017 Jens Axboe <axboe@fb.com>

blk-mq: fix potential oops with polling and blk-mq scheduler

If we have a scheduler attached, blk_mq_tag_to_rq() on the
scheduled tags will return NULL if a request is no longer
in flight. This is different than using the normal tags,
where it will always return the fixed request. Check for
this condition for polling, in case we happen to enter
polling for a completed request.

The request address remains valid, so this check and return
should be perfectly safe.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Tested-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# caf7df12 20-Apr-2017 Christoph Hellwig <hch@lst.de>

block: remove the errors field from struct request

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 453f8341 20-Apr-2017 Christoph Hellwig <hch@lst.de>

blk-mq: simplify __blk_mq_complete_request

Merge blk_mq_ipi_complete_request and blk_mq_stat_add into their only
caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 08e0029a 20-Apr-2017 Christoph Hellwig <hch@lst.de>

blk-mq: remove the error argument to blk_mq_complete_request

Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# da8d7f07 19-Apr-2017 Bart Van Assche <bvanassche@acm.org>

block: Export blk_init_request_from_bio()

Export this function such that it becomes available to block
drivers.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c05f8525 14-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq-sched: make completed_request() callback more useful

Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5b727272 14-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq: export helpers

blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 710c785f 07-Apr-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Clarify comments in blk_mq_dispatch_rq_list()

The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 705cda97 07-Apr-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list

Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d945a365 05-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq: use true instead of 1 for blk_mq_queue_data.last

Trivial cleanup.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 807b1041 05-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq: make driver tag failure path easier to follow

Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6d8c6c0f 07-Apr-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Restart a single queue if tag sets are shared

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7587a5ae 07-Apr-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Introduce blk_mq_delay_run_hw_queue()

Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ebe8bddb 07-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq: remap queues when adding/removing hardware queues

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 54d5329d 07-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq-sched: fix crash in switch error path

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 93252632 05-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq-sched: set up scheduler tags when bringing up new queues

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 81380ca1 07-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq: use the right hctx when getting a driver tag fails

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f2fbc9dd 05-Apr-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Remove blk_mq_queue_data.list

The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bf4907c0 30-Mar-2017 Jens Axboe <axboe@fb.com>

blk-mq: fix schedule-under-preempt for blocking drivers

Commit a4d907b6a33b unified the single and multi queue request handlers,
but in the process, it also screwed up the locking balance and calls
blk_mq_try_issue_directly() with the ctx preempt lock held. This is a
problem for drivers that have set BLK_MQ_F_BLOCKING, since now they
can't reliably sleep.

While in there, protect against similar issues in the future, by adding
a might_sleep() trigger in the BLOCKING path for direct issue or queue
run.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Josef Bacik <josef@toxicpanda.com>
Fixes: a4d907b6a33b ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ac77a0c4 29-Mar-2017 Minchan Kim <minchan@kernel.org>

block: do not put mq context in blk_mq_alloc_request_hctx

In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3e06eb3d 29-Mar-2017 Minchan Kim <minchan@kernel.org>

block: do not put mq context in blk_mq_alloc_request_hctx

In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3e8a7069 24-Mar-2017 Jens Axboe <axboe@fb.com>

blk-mq: include errors in did_work calculation

Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return. If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.

Tested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b58e1769 28-Mar-2017 Josef Bacik <josef@toxicpanda.com>

block-mq: don't re-queue if we get a queue error

When try to issue a request directly and we fail we will requeue the
request, but call blk_mq_end_request() as well. This leads to the
completed request being on a queuelist and getting ended twice, which
causes list corruption in schedulers and other shenanigans.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 48b99c9d 29-Mar-2017 Jens Axboe <axboe@fb.com>

blk-mq: fix a typo and a spelling mistake

Signed-off-by: Jens Axboe <axboe@fb.com>


# 0c9539a4 28-Mar-2017 Omar Sandoval <osandov@fb.com>

blk-mq: fix leak of q->stats

blk_alloc_queue_node() already allocates q->stats, so
blk_mq_init_allocated_queue() is overwriting it with a new allocation.

Fixes: a83b576c9c25 ("block: fix stacked driver stats init and free")
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1671d522 27-Mar-2017 Ming Lei <tom.leiming@gmail.com>

block: rename blk_mq_freeze_queue_start()

As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().

So rename it and we can use this function in both
paths.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d9d149a3 27-Mar-2017 Ming Lei <tom.leiming@gmail.com>

blk-mq: comment on races related with timeout handler

This patch adds comment on two races related with
timeout handler:

- requeue from queue busy vs. timeout
- rq free & reallocation vs. timeout

Both the races themselves and current solution aren't
explicit enough, so add comments on them.

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a4ef8e56 21-Mar-2017 Ming Lei <tom.leiming@gmail.com>

blk-mq: don't complete un-started request in timeout handler

When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[ 581.789754] ------------[ cut here ]------------
[ 581.789758] kernel BUG at block/blk-mq.c:374!
[ 581.789760] invalid opcode: 0000 [#1] SMP
[ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 581.789804] Workqueue: kblockd blk_mq_timeout_work
[ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[ 581.789820] Call Trace:
[ 581.789825] blk_mq_check_expired+0x76/0x80
[ 581.789828] bt_iter+0x45/0x50
[ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789840] ? __switch_to+0x140/0x450
[ 581.789841] blk_mq_timeout_work+0x88/0x170
[ 581.789845] process_one_work+0x165/0x410
[ 581.789847] worker_thread+0x137/0x4c0
[ 581.789851] kthread+0x101/0x140
[ 581.789853] ? rescuer_thread+0x3b0/0x3b0
[ 581.789855] ? kthread_park+0x90/0x90
[ 581.789860] ret_from_fork+0x2c/0x40
[ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451] ? mempool_free+0x2b/0x80
[ 6984.857455] ? bio_free+0x4e/0x60
[ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462] blk_mq_process_rq_list+0x133/0x170
[ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467] blk_mq_run_work_fn+0x12/0x20
[ 6984.857473] process_one_work+0x165/0x410
[ 6984.857475] worker_thread+0x137/0x4c0
[ 6984.857478] kthread+0x101/0x140
[ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481] ? kthread_park+0x90/0x90
[ 6984.857489] ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: stable@vger.kernel.org
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 88eeca49 27-Mar-2017 Shaohua Li <shli@fb.com>

block: track request size in blk_issue_stat

Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 93efe981 24-Mar-2017 Jens Axboe <axboe@fb.com>

blk-mq: include errors in did_work calculation

Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return. If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.

Tested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a4d907b6 22-Mar-2017 Christoph Hellwig <hch@lst.de>

blk-mq: streamline blk_mq_make_request

Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos. Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2299722c 22-Mar-2017 Christoph Hellwig <hch@lst.de>

blk-mq: split the plug and sync cases in blk_mq_make_request

Now that we have a nice direct issue heper this helps simplifying
the code a bit, and also gets rid of the old_rq variable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5eb6126e 22-Mar-2017 Christoph Hellwig <hch@lst.de>

blk-mq: improve blk_mq_try_issue_directly

Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 254d259d 22-Mar-2017 Christoph Hellwig <hch@lst.de>

blk-mq: merge mq and sq make_request instances

They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7642747d 22-Mar-2017 Christoph Hellwig <hch@lst.de>

blk-mq: remove BLK_MQ_F_DEFER_ISSUE

This flag was never used since it was introduced.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 95a49603 21-Mar-2017 Ming Lei <tom.leiming@gmail.com>

blk-mq: don't complete un-started request in timeout handler

When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[ 581.789754] ------------[ cut here ]------------
[ 581.789758] kernel BUG at block/blk-mq.c:374!
[ 581.789760] invalid opcode: 0000 [#1] SMP
[ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 581.789804] Workqueue: kblockd blk_mq_timeout_work
[ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[ 581.789820] Call Trace:
[ 581.789825] blk_mq_check_expired+0x76/0x80
[ 581.789828] bt_iter+0x45/0x50
[ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789840] ? __switch_to+0x140/0x450
[ 581.789841] blk_mq_timeout_work+0x88/0x170
[ 581.789845] process_one_work+0x165/0x410
[ 581.789847] worker_thread+0x137/0x4c0
[ 581.789851] kthread+0x101/0x140
[ 581.789853] ? rescuer_thread+0x3b0/0x3b0
[ 581.789855] ? kthread_park+0x90/0x90
[ 581.789860] ret_from_fork+0x2c/0x40
[ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451] ? mempool_free+0x2b/0x80
[ 6984.857455] ? bio_free+0x4e/0x60
[ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462] blk_mq_process_rq_list+0x133/0x170
[ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467] blk_mq_run_work_fn+0x12/0x20
[ 6984.857473] process_one_work+0x165/0x410
[ 6984.857475] worker_thread+0x137/0x4c0
[ 6984.857478] kthread+0x101/0x140
[ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481] ? kthread_park+0x90/0x90
[ 6984.857489] ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: stable@vger.kernel.org
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 34dbad5d 21-Mar-2017 Omar Sandoval <osandov@fb.com>

blk-stat: convert to callback-based statistics reporting

Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:

1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.

This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.

The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.

wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.

For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.

Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# fa2e39cb 21-Mar-2017 Omar Sandoval <osandov@fb.com>

blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE}

The stats buckets will become generic soon, so make the existing users
use the common READ and WRITE definitions instead of one internal to
blk-stat.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0315b159 21-Mar-2017 Omar Sandoval <osandov@fb.com>

block: remove extra calls to wbt_exit()

We always call wbt_exit() from blk_release_queue(), so these are
unnecessary.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9c621104 14-Mar-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()

If we have scheduling enabled, we jump directly to insert-and-run.
That's fine, but we run the queue async and we don't pass in information
on whether we can block from this context or not. Fixup both these
cases.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 01388df3 22-Feb-2017 Ming Lei <tom.leiming@gmail.com>

blk-mq: free hctx->cpumask in release handler of hctx's kobject

It is obviously that hctx->cpumask is per hctx, and both
share same lifetime, so this patch moves freeing of hctx->cpumask
into release handler of hctx's kobject.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6c8b232e 22-Feb-2017 Ming Lei <tom.leiming@gmail.com>

blk-mq: make lifetime consistent between hctx and its kobject

This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
and trys to keep lifetime consistent between hctx and hctx's kobject.

Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
totally symmetrical, and kobject's refcounter drops to zero just
when the hctx is freed.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7ea5fe31 22-Feb-2017 Ming Lei <tom.leiming@gmail.com>

blk-mq: make lifetime consitent between q/ctx and its kobject

Currently from kobject view, both q->mq_kobj and ctx->kobj can
be released during one cycle of blk_mq_register_dev() and
blk_mq_unregister_dev(). Actually, sw queue's lifetime is
same with its request queue's, which is covered by request_queue->kobj.

So we don't need to call kobject_put() for the two kinds of
kobject in __blk_mq_unregister_dev(), instead we do that
in release handler of request queue.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 737f98cf 22-Feb-2017 Ming Lei <tom.leiming@gmail.com>

blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()

Both q->mq_kobj and sw queues' kobjects should have been initialized
once, instead of doing that each add_disk context.

Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
because percpu allocator fills zero to allocated variable.

This patch fixes one issue[1] reported from Omar.

[1] kernel wearning when doing unbind/bind on one scsi-mq device

[ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
[ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
[ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[ 19.350920] Workqueue: events_unbound async_run_entry_fn
[ 19.350920] Call Trace:
[ 19.350920] dump_stack+0x63/0x83
[ 19.350920] kobject_init+0x77/0x90
[ 19.350920] blk_mq_register_dev+0x40/0x130
[ 19.350920] blk_register_queue+0xb6/0x190
[ 19.350920] device_add_disk+0x1ec/0x4b0
[ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
[ 19.350920] async_run_entry_fn+0x48/0x150
[ 19.350920] process_one_work+0x1d0/0x480
[ 19.350920] worker_thread+0x48/0x4e0
[ 19.350920] kthread+0x101/0x140
[ 19.350920] ? process_one_work+0x480/0x480
[ 19.350920] ? kthread_create_on_node+0x60/0x60
[ 19.350920] ret_from_fork+0x2c/0x40

Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 113285b4 02-Mar-2017 Jens Axboe <axboe@fb.com>

blk-mq: ensure that bd->last is always set correctly

When drivers are called with a request in blk-mq, blk-mq flags the
state such that the driver knows if this is the last request in
this call chain or not. The driver can then use that information
to defer kicking off IO until bd->last is true. However, with blk-mq
and scheduling, we need to allocate a driver tag for a request before
it can be issued. If we fail to allocate such a tag, we could end up
in the situation where the last request issued did not have
bd->last == true set. This can then cause a driver hang.

This fixes a hang with virtio-blk, which uses bd->last as a hint
on whether to kick the queue or not.

Reported-by: Chris Mason <clm@fb.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f91328c4 01-Mar-2017 Keith Busch <kbusch@kernel.org>

blk-mq: Provide freeze queue timeout

A driver may wish to take corrective action if queued requests do not
complete within a set time.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6bae363e 01-Mar-2017 Keith Busch <kbusch@kernel.org>

blk-mq: Export blk_mq_freeze_queue_wait

Drivers can start a freeze, so this provides a way to wait for frozen.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 562bef42 27-Feb-2017 Omar Sandoval <osandov@fb.com>

blk-mq: move update of tags->rqs to __blk_mq_alloc_request()

No functional difference, it just makes a little more sense to update
the tag map where we actually allocate the tag.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>


# 6d2809d5 27-Feb-2017 Omar Sandoval <osandov@fb.com>

blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request

blk_mq_alloc_request_hctx() allocates a driver request directly, unlike
its blk_mq_alloc_request() counterpart. It also crashes because it
doesn't update the tags->rqs map.

Fix it by making it allocate a scheduler request.

Reported-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>


# 415b806d 27-Feb-2017 Sagi Grimberg <sagi@grimberg.me>

blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Modified by me to also check at driver tag allocation time if the
original request was reserved, so we can be sure to allocate a
properly reserved tag at that point in time, too.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 59f082e4 01-Feb-2017 Shaohua Li <shli@fb.com>

blk-mq: allocate blk_mq_tags and requests in correct node

blk_mq_tags/requests of specific hardware queue are mostly used in
specific cpus, which might not be in the same numa node as disk. For
example, a nvme card is in node 0. half hardware queue will be used by
node 0, the other node 1.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 174cd4b1 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 105ab3d8 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h>

We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# da55f2cc 22-Feb-2017 Omar Sandoval <osandov@fb.com>

blk-mq: use sbq wait queues instead of restart for driver tags

Commit 50e1dab86aa2 ("blk-mq-sched: fix starvation for multiple hardware
queues and shared tags") fixed one starvation issue for shared tags.
However, we can still get into a situation where we fail to allocate a
tag because all tags are allocated but we don't have any pending
requests on any hardware queue.

One solution for this would be to restart all queues that share a tag
map, but that really sucks. Ideally, we could just block and wait for a
tag, but that isn't always possible from blk_mq_dispatch_rq_list().

However, we can still use the struct sbitmap_queue wait queues with a
custom callback instead of blocking. This has a few benefits:

1. It avoids iterating over all hardware queues when completing an I/O,
which the current restart code has to do.
2. It benefits from the existing rolling wakeup code.
3. It avoids punting to another thread just to have it block.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0c2a6fe4 17-Feb-2017 Jens Axboe <axboe@fb.com>

blk-mq: don't special case flush inserts for blk-mq-sched

The current request insertion machinery works just fine for
directly inserting flushes, so no need to special case
this anymore.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 2aa0f21d 17-Feb-2017 Jens Axboe <axboe@fb.com>

blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not

Currently we're almost there, but if we dispatch nothing, then we
still return success.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# f6f94300 10-Feb-2017 Josef Bacik <jbacik@fb.com>

block: set make_request_fn manually in blk_mq_update_nr_hw_queues

Calling blk_queue_make_request resets a bunch of settings on the
request_queue, but all we really want is to update the make_request_fn,
so do this directly so we don't lose things like the logical and
physical block sizes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1e739730 08-Feb-2017 Christoph Hellwig <hch@lst.de>

block: optionally merge discontiguous discard bios into a single request

Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code. I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.

Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 34fe7c05 08-Feb-2017 Christoph Hellwig <hch@lst.de>

block: enumify ELEVATOR_*_MERGE

Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 18fbda91 31-Jan-2017 Omar Sandoval <osandov@fb.com>

block: use same block debugfs directory for blk-mq and blktrace

When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 12d70958 31-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: don't fail allocating driver tag for stopped hw queue

We rely on blk_mq_get_driver_tag() not failing if 'wait' is true,
but it currently fails in that case if the queue happens to be
stopped at the time of the call.

We don't need to check for stopped here, it's just assigning
the tag. If the queue is stopped, we'll handle it when
attempting to run the queue.

This fixes a stall/crash on flush intensive workloads, where
we proceed to process a flush that doesn't have a valid tag
assigned.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 82ed4db4 27-Jan-2017 Christoph Hellwig <hch@lst.de>

block: split scsi_request out of struct request

And require all drivers that want to support BLOCK_PC to allocate it
as the first thing of their private data. To support this the legacy
IDE and BSG code is switched to set cmd_size on their queues to let
the block layer allocate the additional space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bd6737f1 27-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()

Instead of letting the caller check this and handle the details
of inserting a flush request, put the logic in the scheduler
insertion function. This fixes direct flush insertion outside
of the usual make_request_fn calls, like from dm via
blk_insert_cloned_request().

Signed-off-by: Jens Axboe <axboe@fb.com>


# f73f44eb 27-Jan-2017 Christoph Hellwig <hch@lst.de>

block: add a op_is_flush helper

This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c13660a0 26-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: change ->dispatch_requests() to ->dispatch_request()

When we invoke dispatch_requests(), the scheduler empties everything
into the passed in list. This isn't always a good thing, since it
means that we remove items that we could have potentially merged
with.

Change the function to dispatch single requests at the time. If
we do that, we can backoff exactly at the point where the device
can't consume more IO, and leave the rest with the scheduler for
better merging and future dispatch decision making.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>


# 50e1dab8 26-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: fix starvation for multiple hardware queues and shared tags

If we have both multiple hardware queues and shared tag map between
devices, we need to ensure that we propagate the hardware queue
restart bit higher up. This is because we can get into a situation
where we don't have any IO pending on a hardware queue, yet we fail
getting a tag to start new IO. If that happens, it's not enough to
mark the hardware queue as needing a restart, we need to bubble
that up to the higher level queue as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>


# 99cf1dc5 26-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: release driver tag on a requeue event

We don't want to hold on to this resource when we have a scheduler
attached.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>


# 3c782d67 26-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: fix potential race in queue restart and driver tag allocation

Once we mark the queue as needing a restart, re-check if we can
get a driver tag. This fixes a theoretical issue where the needed
IO completes _after_ blk_mq_get_driver_tag() fails, but before we
manage to set the restart bit.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>


# 0abad774 26-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: improve scheduler queue sync/async running

We'll use the same criteria for whether we need to run the queue sync
or async when we have a scheduler, as we do without one.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>


# 07e4fead 25-Jan-2017 Omar Sandoval <osandov@fb.com>

blk-mq: create debugfs directory tree

In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:

# tree -d /sys/kernel/debug/block
/sys/kernel/debug/block
|-- nvme0n1
| `-- mq
| |-- 0
| | `-- cpu0
| |-- 1
| | `-- cpu1
| |-- 2
| | `-- cpu2
| `-- 3
| `-- cpu3
`-- vda
`-- mq
`-- 0
|-- cpu0
|-- cpu1
|-- cpu2
`-- cpu3

Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5a797e00 26-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: don't lose flags passed in to blk_mq_alloc_request()

If we come in from blk_mq_alloc_requst() with NOWAIT set in flags,
we must ensure that we don't later overwrite that in
blk_mq_sched_get_request(). Initialize alloc_data->flags before
passing it in.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 200e86b3 25-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: only apply active queue tag throttling for driver tags

If we have a scheduler attached, we have two sets of tags. We don't
want to apply our active queue throttling for the scheduler side
of tags, that only applies to driver tags since that's the resource
we need to dispatch an IO.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 70f36b60 19-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: allow resize of scheduler requests

Add support for growing the tags associated with a hardware queue, for
the scheduler tags. Currently we only support resizing within the
limits of the original depth, change that so we can grow it as well by
allocating and replacing the existing scheduler tag set.

This is similar to how we could increase the software queue depth with
the legacy IO stack and schedulers.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 7e79dadc 19-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: stop hardware queue in blk_mq_delay_queue()

The run handler we register for the delayed work requires that the
queue be stopped, yet we leave that up to the caller. Let's move
it into blk_mq_delay_queue() itself, so that the API is sane.

This fixes a stall with SCSI, where it calls blk_mq_delay_queue()
without having stopped the queue. Hence the queue is never run.

Reported-by: Hannes Reinecke <hare@suse.com>
Fixes: 70f4db639c5b ("blk-mq: add blk_mq_delay_queue")
Signed-off-by: Jens Axboe <axboe@fb.com>


# 88a75033 18-Jan-2017 Keith Busch <kbusch@kernel.org>

blk-mq: Remove unused variable

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d3484991 13-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: allow setting of default IO scheduler

Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# bd166ef1 17-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: add framework for MQ capable IO schedulers

This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 2af8cbe3 13-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: split tag ->rqs[] into two

This is in preparation for having two sets of tags available. For
that we need a static index, and a dynamically assignable one.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# fd2d3326 12-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: add support for carrying internal tag information in blk_qc_t

No functional change in this patch, just in preparation for having
two types of tags available to the block layer for a single request.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# cc71a6f4 11-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: abstract out helpers for allocating/freeing tag maps

Prep patch for adding an extra tag map for scheduler requests.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 4941115b 13-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-tag: cleanup the normal/reserved tag allocation

This is in preparation for having another tag set available. Cleanup
the parameters, and allow passing in of tags for blk_mq_put_tag().

Signed-off-by: Jens Axboe <axboe@fb.com>
[hch: even more cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 2c3ad667 14-Dec-2016 Jens Axboe <axboe@fb.com>

blk-mq: export some helpers we need to the scheduling framework

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 16a3c2a7 15-Dec-2016 Jens Axboe <axboe@fb.com>

blk-mq: un-export blk_mq_free_hctx_request()

It's only used in blk-mq, kill it from the main exported header
and kill the symbol export as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# f8a5b122 13-Dec-2016 Jens Axboe <axboe@fb.com>

blk-mq: make mq_ops a const pointer

We never change it, make that clear.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>


# 8b0e1953 24-Dec-2016 Thomas Gleixner <tglx@linutronix.de>

ktime: Cleanup ktime_set() usage

ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>


# d1b1cea1 14-Dec-2016 Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

blk-mq: Fix failed allocation path when mapping queues

In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped. Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags. There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path. The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jens' for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
lr: c000000000575328: __bt_get+0x48/0xd0
sp: c000000fe99eb390
msr: 900000010280b033
dar: 28
dsisr: 40000000
current = 0xc000000fe9966800
paca = 0xc000000007e80300 softe: 0 irq_happened: 0x01
pid = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
[c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
[c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
[c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
[c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
[c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
[c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
[c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
[c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
[c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
[c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
[c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
[c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
[c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
[c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
[c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
[c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
[c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
[c000000fe99ebe30] c000000000009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 36e1f3d1 06-Dec-2016 Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

blk-mq: Avoid memory reclaim when remapping queues

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system. The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
- Use GFP_NOIO instead of GFP_NOWAIT.

Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>


# f04c3df3 07-Dec-2016 Jens Axboe <axboe@fb.com>

blk-mq: abstract out blk_mq_dispatch_rq_list() helper

Takes a list of requests, and dispatches it. Moves any residual
requests to the dispatch list.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>


# ae911c5e 08-Dec-2016 Jens Axboe <axboe@fb.com>

blk-mq: add blk_mq_start_stopped_hw_queue()

We have a variant for all hardware queues, but not one for a single
hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>


# 6e85eaf3 02-Dec-2016 Jens Axboe <axboe@fb.com>

blk-mq: blk_account_io_start() takes a bool

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>


# 415d3dab 28-Nov-2016 Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

blk-mq: Drop explicit timeout sync in hotplug

After commit 287922eb0b18 ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 64f1c21e 14-Nov-2016 Jens Axboe <axboe@fb.com>

blk-mq: make the polling code adaptive

The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1 Never enter hybrid sleep mode
0 Use half of the completion mean for the sleep delay
>0 Use this specific value as the sleep delay

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>


# 06426adf 14-Nov-2016 Jens Axboe <axboe@fb.com>

blk-mq: implement hybrid poll mode for sync O_DIRECT

This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>


# 0a6219a9 16-Nov-2016 Ming Lei <ming.lei@canonical.com>

block: deal with stale req count of plug list

In both legacy and mq path, req count of plug list is computed
before allocating request, so the number can be stale when falling
back to slept allocation, also the new introduced wbt can sleep
too.

This patch deals with the case by checking if plug list becomes
empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
which is introduced by Shaohua's patches of dispatching big request.

Fixes: 600271d900002(blk-mq: immediately dispatch big size request)
Fixes: 50d24c34403c6(block: immediately dispatch big size request)
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bbd7bb70 04-Nov-2016 Jens Axboe <axboe@fb.com>

block: move poll code to blk-mq

The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 066a4a73 11-Nov-2016 Jens Axboe <axboe@fb.com>

blk-mq: blk_mq_try_issue_directly() should lookup hardware queue

A previous commit changed this to pass in the hardware queue, but
it was using the wrong hardware queue. Hence a request that was
allocated on one hardware queue ended up being issued on another
one, and that caused IO timeouts and oopses on some drivers. Since
the request holds hardware queue private resources, like a tag,
we can't just issue it on a different hardware queue.

Fixes: 2253efc850c4 ("blk-mq: Move more code into blk_mq_direct_issue_request()")
Signed-off-by: Jens Axboe <axboe@fb.com>


# 87760e5e 09-Nov-2016 Jens Axboe <axboe@fb.com>

block: hook up writeback throttling

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe <axboe@fb.com>


# cf43e6be 07-Nov-2016 Jens Axboe <axboe@fb.com>

block: add scalable completion tracking of requests

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.

Signed-off-by: Jens Axboe <axboe@fb.com>


# c02ebfdd 27-Sep-2016 Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

blk-mq: Always schedule hctx->next_cpu

Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu. This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 600271d9 03-Nov-2016 Shaohua Li <shli@fb.com>

blk-mq: immediately dispatch big size request

This is corresponding part for blk-mq. Disk with multiple hardware
queues doesn't need this as we only hold 1 request at most.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2b053aca 28-Oct-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()

Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6a83e74d 02-Nov-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Introduce blk_mq_quiesce_queue()

blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
have finished. This function does *not* wait until all outstanding
requests have finished (this means invocation of request.end_io()).
The algorithm used by blk_mq_quiesce_queue() is as follows:
* Hold either an RCU read lock or an SRCU read lock around
.queue_rq() calls. The former is used if .queue_rq() does not
block and the latter if .queue_rq() may block.
* blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
followed by synchronize_srcu() or synchronize_rcu(). The latter
call waits for .queue_rq() invocations that started before
blk_mq_quiesce_queue() was called.
* The blk_mq_hctx_stopped() calls that control whether or not
.queue_rq() will be called are called with the (S)RCU read lock
held. This is necessary to avoid race conditions against
blk_mq_quiesce_queue().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9b7dd572 28-Oct-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Remove blk_mq_cancel_requeue_work()

Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 52d7f1b5 28-Oct-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Avoid that requeueing starts stopped queues

Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
if __dm_suspend() or __dm_resume() is called and not if the
requeue list is processed.
* In the SCSI core queue stopping and starting should only be
performed by the scsi_internal_device_block() and
scsi_internal_device_unblock() functions but not by any other
function. Although the blk_mq_stop_hw_queue() call in
scsi_queue_rq() may help to reduce CPU load if a LLD queue is
full, figuring out whether or not a queue should be restarted
when requeueing a command would require to introduce additional
locking in scsi_mq_requeue_cmd() to avoid a race with
scsi_internal_device_block(). Avoid this complexity by removing
the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
blk_mq_start_stopped_hw_queues() explicitly should start stopped
queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
xen-blkfront driver in its blkif_recover() function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2253efc8 28-Oct-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Move more code into blk_mq_direct_issue_request()

Move the "hctx stopped" test and the insert request calls into
blk_mq_direct_issue_request(). Rename that function into
blk_mq_try_issue_directly() to reflect its new semantics. Pass
the hctx pointer to that function instead of looking it up a
second time. These changes avoid that code has to be duplicated
in the next patch.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# fd001443 28-Oct-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Introduce blk_mq_queue_stopped()

The function blk_queue_stopped() allows to test whether or not a
traditional request queue has been stopped. Introduce a helper
function that allows block drivers to query easily whether or not
one or more hardware contexts of a blk-mq queue have been stopped.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5d1b25c1 28-Oct-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Introduce blk_mq_hctx_stopped()

Multiple functions test the BLK_MQ_S_STOPPED bit so introduce
a helper function that performs this test.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bc27c01b 28-Oct-2016 Bart Van Assche <bvanassche@acm.org>

blk-mq: Do not invoke .queue_rq() for a stopped queue

The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ef295ecf 28-Oct-2016 Christoph Hellwig <hch@lst.de>

block: better op and flags encoding

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e8064021 20-Oct-2016 Christoph Hellwig <hch@lst.de>

block: split out request-only flags into a new namespace

A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2552e3f8 27-Oct-2016 Jens Axboe <axboe@fb.com>

blk-mq: get rid of confusing blk_map_ctx structure

We can just use struct blk_mq_alloc_data - it has a few more
members, but we allocate it further down the stack anyway. So
this cleans up the code, and reduces the stack overhead a bit.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 7dd2fb68 27-Oct-2016 Jens Axboe <axboe@fb.com>

blk-mq: update hardware and software queues for sleeping alloc

If we end up sleeping due to running out of requests, we should
update the hardware and software queues in the map ctx structure.
Otherwise we could end up having rq->mq_ctx point to the pre-sleep
context, and risk corrupting ctx->rq_list since we'll be
grabbing the wrong lock when inserting the request.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Chris Mason <clm@fb.com>
Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request")
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7fe31130 27-Oct-2016 Jens Axboe <axboe@fb.com>

blk-mq: update hardware and software queues for sleeping alloc

If we end up sleeping due to running out of requests, we should
update the hardware and software queues in the map ctx structure.
Otherwise we could end up having rq->mq_ctx point to the pre-sleep
context, and risk corrupting ctx->rq_list since we'll be
grabbing the wrong lock when inserting the request.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Chris Mason <clm@fb.com>
Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request")
Signed-off-by: Jens Axboe <axboe@fb.com>


# c8712c6a 23-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: skip unmapped queues in blk_mq_alloc_request_hctx

This provides the caller a feedback that a given hctx is not mapped and thus
no command can be sent on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 97a32864 23-Sep-2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: fixup "Convert to new hotplug state machine"

The "blk_mq_queue_reinit_dead()" just cleared the cpumask instead doing
a copy. Since we might never had an online callback we could end up with
a ZERO mask which in turn leads to crash as test robot demonstarted.

Fixes: 65d5291eee66 ("blk-mq: Convert to new hotplug state machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1b792f2f 21-Sep-2016 Jens Axboe <axboe@fb.com>

blk-mq: add flag for drivers wanting blocking ->queue_rq()

If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Josef Bacik <jbacik@fb.com>


# 63581af3 22-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: remove non-blocking pass in blk_mq_map_request

bt_get already does a non-blocking pass as well as running the queue
when scheduling internally, no need to duplicate it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 841bac2c 21-Sep-2016 Jens Axboe <axboe@fb.com>

blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()

Two cases:

1) blk_mq_alloc_request() needlessly re-runs the queue, after
calling into the tag allocation without NOWAIT set. We don't
need to do that.

2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
the async flag set to false.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 65d5291e 22-Sep-2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Convert to new hotplug state machine

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing <hch@lst.de>
Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9467f859 22-Sep-2016 Thomas Gleixner <tglx@linutronix.de>

blk-mq/cpu-notif: Convert to new hotplug state machine

Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 40aabb67 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: push per-cpu last_tag into sbitmap_queue

Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 88459642 17-Sep-2016 Omar Sandoval <osandov@fb.com>

blk-mq: abstract tag allocation out into sbitmap library

This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 703fd1c0 16-Sep-2016 Jens Axboe <axboe@fb.com>

blk-mq: account higher order dispatch

We currently account a '0' dispatch, and anything above that still falls
below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
we don't account it.

Change the last bucket to be inclusive of anything above the range we
track, and have the sysfs file reflect that by including a '+' in the
output:

$ cat /sys/block/nvme0n1/mq/0/dispatched
0 1006
1 20229
2 1
4 0
8 0
16 0
32+ 0

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 9151bcb4 15-Sep-2016 Jens Axboe <axboe@fb.com>

blk-mq: kill unused blk_mq_create_mq_map()

Fixes 1b157939f92a ("blk-mq: get rid of the cpumask in struct blk_mq_tags")
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1b157939 14-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: get rid of the cpumask in struct blk_mq_tags

Unused now that NVMe sets up irq affinity before calling into blk-mq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# da695ba2 14-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: allow the driver to pass in a queue mapping

This allows drivers specify their own queue mapping by overriding the
setup-time function that builds the mq_map. This can be used for
example to build the map based on the MSI-X vector mapping provided
by the core interrupt layer for PCI devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7d7e0f90 14-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: remove ->map_queue

All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bdd17e75 14-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: only allocate a single mq_map per tag_set

The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple. Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering too much from the additional pointer chasing.
The same would apply to the mq_ops pointer as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4e68a011 14-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: don't redistribute hardware queues on a CPU hotplug event

Currently blk-mq will totally remap hardware context when a CPU hotplug
even happened, which causes major havoc for drivers, as they are never
told about this remapping. E.g. any carefully sorted out CPU affinity
will just be completely messed up.

The rebuild also doesn't really help for the common case of cpu
hotplug, which is soft onlining / offlining of cpus - in this case we
should just leave the queue and irq mapping as is. If it actually
worked it would have helped in the case of physical cpu hotplug,
although for that we'd need a way to actually notify the driver.
Note that drivers may already be able to accommodate such a topology
change on their own, e.g. using the reset_controller sysfs file in NVMe
will cause the driver to get things right for this case.

With the rebuild removed we will simplify retain the queue mapping for
a soft offlined CPU that will work when it comes back online, and will
map any newly onlined CPU to queue 0 until the driver initiates
a rebuild of the queue map.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2849450a 14-Sep-2016 Mike Snitzer <snitzer@redhat.com>

blk-mq: introduce blk_mq_delay_kick_requeue_list()

blk_mq_delay_kick_requeue_list() provides the ability to kick the
q->requeue_list after a specified time. To do this the request_queue's
'requeue_work' member was changed to a delayed_work.

blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
requests while it doesn't make sense to immediately requeue them
(e.g. when all paths in a DM multipath have failed).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 88c7b2b7 25-Aug-2016 Jens Axboe <axboe@fb.com>

blk-mq: prefetch request in blk_mq_tag_to_rq()

When drivers or the core calls this function, they usually
dereference the request shortly there after. Prefetch the first
cache line.

Profiling IO workloads shows that this is the most common cache
miss on the block side of things.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 27489a3c 24-Aug-2016 Jens Axboe <axboe@fb.com>

blk-mq: turn hctx->run_work into a regular work struct

We don't need the larger delayed work struct, since we always run it
immediately.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 0e87e58b 24-Aug-2016 Jens Axboe <axboe@fb.com>

blk-mq: improve warning for running a queue on the wrong CPU

__blk_mq_run_hw_queue() currently warns if we are running the queue on a
CPU that isn't set in its mask. However, this can happen if a CPU is
being offlined, and the workqueue handling will place the work on CPU0
instead. Improve the warning so that it only triggers if the batch cpu
in the hardware queue is currently online. If it triggers for that
case, then it's indicative of a flow problem in blk-mq, so we want to
retain it for that case.

Signed-off-by: Jens Axboe <axboe@fb.com>


# e57690fe 24-Aug-2016 Jens Axboe <axboe@fb.com>

blk-mq: don't overwrite rq->mq_ctx

We do this in a few places, if the CPU is offline. This isn't allowed,
though, since on multi queue hardware, we can't just move a request
from one software queue to another, if they map to different hardware
queues. The request and tag isn't valid on another hardware queue.

This can happen if plugging races with CPU offlining. But it does
no harm, since it can only happen in the window where we are
currently busy freezing the queue and flushing IO, in preparation
for redoing the software <-> hardware queue mappings.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 1eff9d32 05-Aug-2016 Jens Axboe <axboe@fb.com>

block: rename bio bi_rw to bi_opf

Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 71f79fb3 01-Aug-2016 Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

blk-mq: Allow timeouts to run while queue is freezing

In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread. This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion. This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive. This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process. I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis. I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0c4de0f3 19-Jul-2016 Christoph Hellwig <hch@lst.de>

block: ensure bios return from blk_get_request are properly initialized

blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API. Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1f5bd336 13-Jun-2016 Ming Lin <ming.l@ssi.samsung.com>

blk-mq: add blk_mq_alloc_request_hctx

For some protocols like NVMe over Fabrics we need to be able to send
initialization commands to a specific queue.

Based on an earlier patch from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
[hch: disallow sleeping allocation, req_op fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 52b9c330 08-Jun-2016 Omar Sandoval <osandov@fb.com>

blk-mq: actually hook up defer list when running requests

If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
over the rest of the loop body. However, dptr is assigned later in the
loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
want it for.

NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
in-tree driver, but if the code's going to be there, it might as well
work.

Fixes: 74c450521dd8 ("blk-mq: add a 'list' parameter to ->queue_rq()")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 28a8f0d3 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d9d8c5c4 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block: convert is_sync helpers to use REQ_OPs.

This patch converts the is_sync helpers to use separate variables
for the operation and flags.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# cc6e3b10 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block: prepare mq request creation to use REQ_OPs

This patch modifies the blk mq request creation code to use
separate variables for the operation and flags, because in the
the next patches the struct request users will be converted like
was done for bios.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 87c279e6 01-Jun-2016 Omar Sandoval <osandov@fb.com>

blk-mq: really fix plug list flushing for nomerge queues

Commit 0809e3ac6231 ("block: fix plug list flushing for nomerge queues")
updated blk_mq_make_request() to set request_count even when
blk_queue_nomerges() returns true. However, blk_mq_make_request() only
does limited plugging and doesn't use request_count;
blk_sq_make_request() is the one that should have been fixed. Do that
and get rid of the unnecessary work in the mq version.

Fixes: 0809e3ac6231 ("block: fix plug list flushing for nomerge queues")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c7de5726 26-May-2016 Ming Lin <ming.l@samsung.com>

blk-mq: clear q->mq_ops if init fail

blk_mq_init_queue() calls blk_mq_init_allocated_queue(), but q->mq_ops
was not cleared when blk_mq_init_allocated_queue() fails.
Then blk_cleanup_queue() calls blk_mq_free_queue() which will crash because:
- q->all_q_node is not added to all_q_list yet
- q->tag_set is NULL
- hctx was not setup yet or already freed

Fixed it by clearing q->mq_ops on error path.

Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b3a834b1 16-May-2016 Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

blk-mq: fix undefined behaviour in order_to_size()

When this_order variable in blk_mq_init_rq_map() becomes zero
the code incorrectly decrements the variable and passes the result
to order_to_size() helper causing undefined behaviour:

UBSAN: Undefined behaviour in block/blk-mq.c:1459:27
shift exponent 4294967295 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc6-00072-g33656a1 #22

Fix the code by checking this_order variable for not having the zero
value first.

Reported-by: Meelis Roos <mroos@linux.ee>
Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism")
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a21f2a3e 03-May-2016 Michael Callahan <michaelcallahan@fb.com>

block: Minor blk_account_io_start usage cleanup

blk_account_io_start does not need to be wrapped with blk_do_io_stat
ais it already checks for that condition.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 897bb0c7 19-Mar-2016 Thomas Gleixner <tglx@linutronix.de>

blk-mq: Use proper cpumask iterator

queue_for_each_ctx() iterates over per_cpu variables under the assumption that
the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
holes. In case there are holes the iteration ends up accessing uninitialized
memory and crashing as a result.

Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
macro blk_ctx_sum() which references queue_for_each_ctx().

Reported-by: Xiong Zhou <jencce.kernel@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4ee86bab 15-Mar-2016 Hannes Reinecke <hare@suse.de>

blk-mq: add bounds check on tag-to-rq conversion

We need to check for a valid index before accessing the array
element to avoid accessing invalid memory regions.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Modified by Jens to drop the unlikely(), and make the fall through
path be having a valid tag.

Signed-off-by: Jens Axboe <axboe@fb.com>


# e9137d4b 18-Feb-2016 Keith Busch <kbusch@kernel.org>

blk-mq: Fix NULL pointer updating nr_requests

A h/w context's tags are freed if it was not assigned a CPU. Check if
the context has tags before updating the depth.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 66841672 12-Feb-2016 Ming Lei <tom.leiming@gmail.com>

blk-mq: mark request queue as mq asap

Currently q->mq_ops is used widely to decide if the queue
is mq or not, so we should set the 'flag' asap so that both
block core and drivers can get the correct mq info.

For example, commit 868f2f0b720(blk-mq: dynamic h/w context count)
moves the hctx's initialization before setting q->mq_ops in
blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
to think the queue is non-mq and don't allocate command size
for the per-hctx flush rq.

This patches should fix the problem reported by Sasha.

Cc: Keith Busch <keith.busch@intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count")
Signed-off-by: Jens Axboe <axboe@fb.com>


# a59e0f57 11-Feb-2016 Keith Busch <kbusch@kernel.org>

blk-mq: End unstarted requests on dying queue

Go directly to ending a request if it wasn't started. Previously, completing a
request may invoke a driver callback for a request it didn't initialize.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 868f2f0b 17-Dec-2015 Keith Busch <kbusch@kernel.org>

blk-mq: dynamic h/w context count

The hardware's provided queue count may change at runtime with resource
provisioning. This patch allows a block driver to alter the number of
h/w queues available when its resource count changes.

The main part is a new blk-mq API to request a new number of h/w queues
for a given live tag set. The new API freezes all queues using that set,
then adjusts the allocated count prior to remapping these to CPUs.

The bulk of the rest just shifts where h/w contexts and all their
artifacts are allocated and freed.

The number of max h/w contexts is capped to the number of possible cpus
since there is no use for more than that. As such, all pre-allocated
memory for pointers need to account for the max possible rather than
the initial number of queues.

A side effect of this is that the blk-mq will proceed successfully as
long as it can allocate at least one h/w context. Previously it would
fail request queue initialization if less than the requested number
was allocated.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bbc758ec 07-Nov-2015 Christoph Hellwig <hch@lst.de>

block: remove REQ_NO_TIMEOUT flag

This was added for the 'magic' AEN requests in the NVMe driver that never
return. We now handle them purely inside the driver and don't need this
core hack any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 287922eb 30-Oct-2015 Christoph Hellwig <hch@lst.de>

block: defer timeouts to a workqueue

Timer context is not very useful for drivers to perform any meaningful abort
action from. So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals. But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
start a freeze on its own queue. The timer enters the queue, so timer
context can only start a freeze, but not wait for frozen."

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bffed457 02-Dec-2015 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

blk-mq: Avoid memoryless numa node encoded in hctx numa_node

In architecture like powerpc, we can have cpus without any local memory
attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping
can result in memory allocation hints for block hctx->numa_node populated
with node values which does not have real memory.

Instead use local_memory_node(), which is guaranteed to have memory.
local_memory_node is a noop in other architectures that does not support
memoryless nodes.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e0e827b9 02-Dec-2015 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

blk-mq: Reuse hardware context cpumask for tags

hctx->cpumask is already populated and let the tag cpumask follow that
instead of going through a new for loop.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6f3b0e8b 26-Nov-2015 Christoph Hellwig <hch@lst.de>

blk-mq: add a flags parameter to blk_mq_alloc_request

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b094f89c 20-Nov-2015 Jens Axboe <axboe@fb.com>

blk-mq: fix calling unplug callbacks with preempt disabled

Liu reported that running certain parts of xfstests threw the
following error:

BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
#0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
[<ffffffff8130191b>] dump_stack+0x4f/0x74
[<ffffffff8108ed95>] ___might_sleep+0x185/0x240
[<ffffffff8108eea2>] __might_sleep+0x52/0x90
[<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
[<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
[<ffffffff8109a6d1>] ? local_clock+0x21/0x40
[<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
[<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
[<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
[<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
[<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
[<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
[<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
[<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
[<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
[<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
[<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
[<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
[<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
[<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
[<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
[<ffffffff812d0ca0>] submit_bio+0x70/0x140
[<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
[<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
[<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
[<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
[<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
[<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
[<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
[<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
[<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
[<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
[<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
[<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
[<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
[<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
[<ffffffff81184df3>] do_writepages+0x23/0x40
[<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
[<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
[<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
[<ffffffff811e6805>] ? trylock_super+0x25/0x60
[<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
[<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
[<ffffffff812130b5>] wb_writeback+0x2b5/0x500
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
[<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
[<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
[<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
[<ffffffff81213842>] wb_workfn+0x92/0x330
[<ffffffff8107f133>] process_one_work+0x223/0x730
[<ffffffff8107f083>] ? process_one_work+0x173/0x730
[<ffffffff8108035f>] ? worker_thread+0x18f/0x430
[<ffffffff810802ed>] worker_thread+0x11d/0x430
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810858df>] kthread+0xef/0x110
[<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816673bf>] ret_from_fork+0x3f/0x70
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70

The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.

Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.

This only affects blk-mq driven devices, and only those that
register a single queue.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1fa8cc52 05-Nov-2015 Jens Axboe <axboe@fb.com>

blk-mq: mark __blk_mq_complete_request() static

It's no longer used outside of blk-mq core.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 7b371636 05-Nov-2015 Jens Axboe <axboe@fb.com>

blk-mq: return tag/queue combo in the make_request_fn handlers

Return a cookie, blk_qc_t, from the blk-mq make request functions, that
allows a later caller to uniquely identify a specific IO. The cookie
doesn't mean anything to the caller, but the caller can use it to later
pass back to the block layer. The block layer can then identify the
hardware queue and request from that cookie.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>


# dece1635 05-Nov-2015 Jens Axboe <axboe@fb.com>

block: change ->make_request_fn() and users to return a queue cookie

No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>


# 71baba4b 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0164adc 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2404e607 03-Nov-2015 Jeff Moyer <jmoyer@redhat.com>

blk-mq: avoid excessive boot delays with large lun counts

Hi,

Zhangqing Luo reported long boot times on a system with thousands of
LUNs when scsi-mq was enabled. He narrowed the problem down to
blk_mq_add_queue_tag_set, where every queue is frozen in order to set
the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues
added before it in sequence, which involves waiting for an RCU grace
period for each one. We don't need to do this. After the second queue
is added, only new queues need to be initialized with the shared tag.
We can do that by percolating the flag up to the blk_mq_tag_set, and
updating the newly added queue's hctxs if the flag is set.

This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
support for shared tags maps).

Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# cfd0c552 20-Oct-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: mark ctx as pending at batch in flush plug path

Most of times, flush plug should be the hottest I/O path,
so mark ctx as pending after all requests in the list are
inserted.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 676d0607 20-Oct-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: fix for trace_block_plug()

The trace point is for tracing plug event of each request
queue instead of each task, so we should check the request
count in the plug list from current queue instead of
current task.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e18378a6 20-Oct-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: check bio_mergeable() early before merging

It isn't necessary to try to merge the bio which is marked
as NOMERGE.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0809e3ac 20-Oct-2015 Jeff Moyer <jmoyer@redhat.com>

block: fix plug list flushing for nomerge queues

Request queues with merging disabled will not flush the plug list after
BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
on blk_attempt_plug_merge to compute the request_count. Fix this by
computing the number of queued requests even for nomerge queues.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3ef28e83 21-Oct-2015 Dan Williams <dan.j.williams@intel.com>

block: generic request_queue reference counting

Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios. This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request(). The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

BUG: unable to handle kernel paging request at ffff880140000000
[..]
Call Trace:
[<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
[<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
[<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
[<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
[<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
[<ffffffff8141f966>] submit_bio+0x76/0x170
[<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
[<ffffffff81286e62>] submit_bh+0x12/0x20
[<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
[<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
[<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
[<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
[<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
[<ffffffff81303494>] ext4_put_super+0x64/0x360
[<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f42d79ab 13-Oct-2015 Junichi Nomura <j-nomura@ce.jp.nec.com>

blk-mq: fix use-after-free in blk_mq_free_tag_set()

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0bf6cd5b 27-Sep-2015 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a helper to iterate all tags for a request_queue

And replace the blk_mq_tag_busy_iter with it - the driver use has been
replaced with a new helper a while ago, and internal to the block we
only need the new version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f4829a9b 27-Sep-2015 Christoph Hellwig <hch@lst.de>

blk-mq: fix racy updates of rq->errors

blk_mq_complete_request may be a no-op if the request has already
been completed by others means (e.g. a timeout or cancellation), but
currently drivers have to set rq->errors before calling
blk_mq_complete_request, which might leave us with the wrong error value.

Add an error parameter to blk_mq_complete_request so that we can
defer setting rq->errors until we known we won the race to complete the
request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 60de074b 26-Sep-2015 Akinobu Mita <akinobu.mita@gmail.com>

blk-mq: fix deadlock when reading cpu_list

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
be blocked until the active reference of the kernfs_node to be zero.

On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
/sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
blk_mq_hw_sysfs_cpus_show().

If these happen at the same time, a deadlock can happen. Because one
can wait for the active reference to be zero with holding all_q_mutex,
and the other tries to acquire all_q_mutex with holding the active
reference.

The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
while hctx->cpumask is being updated.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5778322e 26-Sep-2015 Akinobu Mita <akinobu.mita@gmail.com>

blk-mq: avoid inserting requests before establishing new mapping

Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined. So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before establishing new mapping which is done by
blk_mq_queue_reinit_notify().

This can cause a problem when the CPU has just been onlined first time
since the request queue was initialized. At this time ctx->index_hw
for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
zero before blk_mq_queue_reinit_notify() is called by notifier
callbacks for CPU_ONLINE action.

For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.

And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
request in ctx1->rq_list is ignored.

Fix it by ensuring that new mapping is established before onlined cpu
starts running.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0e626368 26-Sep-2015 Akinobu Mita <akinobu.mita@gmail.com>

blk-mq: fix q->mq_usage_counter access race

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
q->mq_usage_counter while freezing all request queues in all_q_list.
On the other hand, q->mq_usage_counter is deinitialized in
blk_mq_free_queue() before deleting the queue from all_q_list.

So if CPU hotplug event occurs in the window, percpu_ref_kill() is
called with q->mq_usage_counter which has already been marked dead,
and it triggers warning. Fix it by deleting the queue from all_q_list
earlier than destroying q->mq_usage_counter.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a723bab3 26-Sep-2015 Akinobu Mita <akinobu.mita@gmail.com>

blk-mq: Fix use after of free q->mq_map

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
q->mq_map by blk_mq_update_queue_map() for all request queues in
all_q_list. On the other hand, q->mq_map is released before deleting
the queue from all_q_list.

So if CPU hotplug event occurs in the window, invalid memory access
can happen. Fix it by releasing q->mq_map in blk_mq_release() to make
it happen latter than removal from all_q_list.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4593fdbe 26-Sep-2015 Akinobu Mita <akinobu.mita@gmail.com>

blk-mq: fix sysfs registration/unregistration race

There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

null_add_dev
--> blk_mq_init_queue
--> blk_mq_init_allocated_queue
--> add to 'all_q_list' (*)
--> add_disk
--> blk_register_queue
--> blk_mq_register_disk (++)

null_del_dev
--> del_gendisk
--> blk_unregister_queue
--> blk_mq_unregister_disk (--)
--> blk_cleanup_queue
--> blk_mq_free_queue
--> del from 'all_q_list' (*)

blk_mq_queue_reinit
--> blk_mq_sysfs_unregister (-)
--> blk_mq_sysfs_register (+)

While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback. But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished. Because '/sys/block/*/mq/' is not exists.

There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.

In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking. So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.

Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1356aae0 26-Sep-2015 Akinobu Mita <akinobu.mita@gmail.com>

blk-mq: avoid setting hctx->tags->cpumask before allocation

When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f75782e4 14-Sep-2015 Catalin Marinas <catalin.marinas@arm.com>

block: kmemleak: Track the page allocations for struct request

The pages allocated for struct request contain pointers to other slab
allocations (via ops->init_request). Since kmemleak does not track/scan
page allocations, the slab objects will be reported as leaks (false
positives). This patch adds kmemleak callbacks to allow tracking of such
pages.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche<bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0048b483 09-Aug-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: fix race between timeout and freeing request

Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[ 439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[ 439.700653] Dumping ftrace buffer:^M
[ 439.700653] (ftrace buffer empty)^M
[ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[ 439.730500] RIP: 0010:[<ffffffff812d89ba>] [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M
[ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[ 439.730500] Stack:^M
[ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[ 439.755663] Call Trace:^M
[ 439.755663] <IRQ> ^M
[ 439.755663] [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
[ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[ 439.755663] [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
[ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[ 439.755663] [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
[ 439.755663] [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
[ 439.755663] [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
[ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[ 439.755663] [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
[ 439.755663] [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
[ 439.755663] [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
[ 439.755663] [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
[ 439.755663] [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
[ 439.755663] <EOI> ^M
[ 439.755663] [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
[ 439.755663] [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
[ 439.755663] [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
[ 439.755663] [<ffffffff81550066>] __schedule+0x469/0x6cd^M
[ 439.755663] [<ffffffff8155039b>] schedule+0x82/0x9a^M
[ 439.789267] [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
[ 439.790911] [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
[ 439.790911] [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
[ 439.790911] [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
[ 439.790911] [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
[ 439.790911] [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
[ 439.790911] [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[ 439.790911] RIP [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.790911] RSP <ffff880819203da0>^M
[ 439.790911] CR2: 0000000000000158^M
[ 439.790911] ---[ end trace d40af58949325661 ]---^M

Cc: <stable@vger.kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 54efd50b 23-Apr-2015 Kent Overstreet <kent.overstreet@gmail.com>

block: make generic_make_request handle arbitrarily sized bios

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4246a0b6 20-Jul-2015 Christoph Hellwig <hch@lst.de>

block: add a bi_error field to struct bio

Currently we have two different ways to signal an I/O error on a BIO:

(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e56f698b 16-Jul-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: set default timeout as 30 seconds

It is reasonable to set default timeout of request as 30 seconds instead of
30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
based systems may choose 100 HZ.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
Signed-off-by: Jens Axboe <axboe@fb.com>


# c3b4afca 04-Jun-2015 Ming Lei <tom.leiming@gmail.com>

blk-mq: free hctx->ctxs in queue's release handler

Now blk_cleanup_queue() can be called before calling
del_gendisk()[1], inside which hctx->ctxs is touched
from blk_mq_unregister_hctx(), but the variable has
been freed by blk_cleanup_queue() at that time.

So this patch moves freeing of hctx->ctxs into queue's
release handler for fixing the oops reported by Stefan.

[1], 6cd18e711dd8075 (block: destroy bdi before blockdev is
unregistered)

Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f26cdc85 01-Jun-2015 Keith Busch <kbusch@kernel.org>

blk-mq: Shared tag enhancements

Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4ecd4fef 07-May-2015 Christoph Hellwig <hch@lst.de>

block: use an atomic_t for mq_freeze_depth

lockdep gets unhappy about the not disabling irqs when using the queue_lock
around it. Instead of trying to fix that up just switch to an atomic_t
and get rid of the lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5b3f341f 08-May-2015 Shaohua Li <shli@fb.com>

blk-mq: make plug work for mutiple disks and queues

Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.

V2: use NULL parameter in !mq case. Fix a bug. Add comments in
blk_attempt_plug_merge to make it less (hopefully) confusion.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f984df1f 08-May-2015 Shaohua Li <shli@fb.com>

blk-mq: do limited block plug for multiple queue case

plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.

V2: check blk_queue_nomerges() as suggested by Jeff.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 239ad215 08-May-2015 Shaohua Li <shli@fb.com>

blk-mq: avoid re-initialize request which is failed in direct dispatch

If we directly issue a request and it fails, we use
blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e6c4438b 08-May-2015 Jeff Moyer <jmoyer@redhat.com>

blk-mq: fix plugging in blk_sq_make_request

The following appears in blk_sq_make_request:

/*
* If we have multiple hardware queues, just go directly to
* one of those for sync IO.
*/

We clearly don't have multiple hardware queues, here! This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):

We want slightly different behavior from them:

- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.

- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.

The old code had this:

use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

and that was converted to:

use_plug = !is_flush_fua && !is_sync;

which is not equivalent. For the single queue case, that second half of
the && expression is always true. So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).

V2: delete the 'likely', which should not be a big deal

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9ba52e58 04-May-2015 Shaohua Li <shli@fb.com>

blk-mq: don't lose requests if a stopped queue restarts

Normally if driver is busy to dispatch a request the logic is like below:
block layer: driver:
__blk_mq_run_hw_queue
a. blk_mq_stop_hw_queue
b. rq add to ctx->dispatch

later:
1. blk_mq_start_hw_queue
2. __blk_mq_run_hw_queue

But it's possible step 1-2 runs between a and b. And since rq isn't in
ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
there are no subsequent requests kick in.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2a34c087 20-Apr-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: fix CPU hotplug handling

hctx->tags has to be set as NULL in case that it is to be unmapped
no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
because shared tags can be freed already from another request queue.

The same situation has to be considered during handling CPU online too.
Unmapped hw queue can be remapped after CPU topo is changed, so we need
to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
allocation for hw queue can be removed in hctx cpu online notifier, and it
is reasonable to do that after mapping is updated.

Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f054b56c 20-Apr-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: fix race between timeout and CPU hotplug

Firstly during CPU hotplug, even queue is freezed, timeout
handler still may come and access hctx->tags, which may cause
use after free, so this patch deactivates timeout handler
inside CPU hotplug notifier.

Secondly, tags can be shared by more than one queues, so we
have to check if the hctx has been unmapped, otherwise
still use-after-free on tags can be triggered.

Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 569fd0ce 17-Apr-2015 Jens Axboe <axboe@fb.com>

blk-mq: fix iteration of busy bitmap

Commit 889fa31f00b2 was a bit too eager in reducing the loop count,
so we ended up missing queues in some configurations. Ensure that
our division rounds up, so that's not the case.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 889fa31f00b2 ("blk-mq: reduce unnecessary software queue looping")
Signed-off-by: Jens Axboe <axboe@fb.com>


# 889fa31f 15-Apr-2015 Chong Yuan <chong.yuan@memblaze.com>

blk-mq: reduce unnecessary software queue looping

In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
Change ->map_size to contain the actually mapped software queues, so we
only loop for as many iterations as we have to.

And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
since they are all re-done in blk_mq_map_swqueue().
blk_mq_map_swqueue().

Signed-off-by: Chong Yuan <chong.yuan@memblaze.com>
Reviewed-by: Wenbo Wang <wenbo.wang@memblaze.com>

Updated by me for formatting and commenting.

Signed-off-by: Jens Axboe <axboe@fb.com>


# ac211175 09-Apr-2015 Linus Torvalds <torvalds@linux-foundation.org>

blk-mq: initialize 'struct request' and associated data to zero

Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
pointer in scsi_init_cmd_errh() with the blk-mq code.

The sense_buffer pointer should have been initialized by the call to
scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
some non-repeatable memory corruptor.

This patch makes sure we initialize the whole struct request allocation
(and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
using __GFP_ZERO in the allocation. The old code initialized a couple
of individual fields, leaving the rest undefined (although many of them
are then initialized in later phases, like blk_mq_rq_ctx_init() etc.

It's not entirely clear why this matters, but it's the rigth thing to do
regardless, and with 4.0 imminent this is the defensive "let's just make
sure everything is initialized properly" patch.

Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c76cbbcf4 30-Mar-2015 Wei Fang <fangwei1@huawei.com>

blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()

Don't assign ->rq_timeout twice.

Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f9018ac9 29-Mar-2015 Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>

block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()

At the beginning of blk_mq_alloc_tag_set(), we have already checked whether
'set->nr_hw_queues' is zero, so here remove this redundant check.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bfd343aa 11-Mar-2015 Keith Busch <kbusch@kernel.org>

blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set

Return -EBUSY if we're unable to enter a queue immediately when
allocating a blk-mq request without __GFP_WAIT.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b94ec296 11-Mar-2015 Mike Snitzer <snitzer@redhat.com>

blk-mq: export blk_mq_run_hw_queues

Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
and export it.

DM's suspend support must be able to run the queue without starting
stopped hw queues.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b62c21b7 12-Mar-2015 Mike Snitzer <snitzer@redhat.com>

blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk

Add a variant of blk_mq_init_queue that allows a previously allocated
queue to be initialized. blk_mq_init_allocated_queue models
blk_init_allocated_queue -- which was also created for DM's use.

DM's approach to device creation requires a placeholder request_queue be
allocated for use with alloc_dev() but the decision about what type of
request_queue will be ultimately created is deferred until all component
devices referenced in the DM table are processed to determine the table
type (request-based, blk-mq request-based, or bio-based).

Also, because of DM's late finalization of the request_queue type
the call to blk_mq_register_disk() doesn't happen during alloc_dev().
Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
once the blk-mq queue is fully allocated.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9a30b096 12-Mar-2015 Mike Snitzer <snitzer@redhat.com>

blk-mq: fix use of incorrect goto label in blk_mq_init_queue error path

If percpu_ref_init() fails the allocated q and hctxs must get cleaned
up; using 'err_map' doesn't allow that to happen.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>


# 201f201c 10-Feb-2015 Jens Axboe <axboe@fb.com>

blk-mq: make blk_mq_run_queues() static

We no longer use it outside of blk-mq.c, so we can make it static
and stop exporting it. Additionally, kill the 'async' argument, as
there's only one used of it.

Signed-off-by: Jens Axboe <axboe@fb.com>


# e09aae7e 29-Jan-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: release mq's kobjects in blk_release_queue()

The kobject memory inside blk-mq hctx/ctx shouldn't have been freed
before the kobject is released because driver core can access it freely
before its release.

We can't do that in all ctx/hctx/mq_kobj's release handler because
it can be run before blk_cleanup_queue().

Given mq_kobj shouldn't have been introduced, this patch simply moves
mq's release into blk_release_queue().

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 74170118 29-Jan-2015 Ming Lei <ming.lei@canonical.com>

Revert "blk-mq: fix hctx/ctx kobject use-after-free"

This reverts commit 76d697d10769048e5721510100bf3a9413a56385.

The commit 76d697d10769048 causes general protection fault
reported from Bart Van Assche:

https://lkml.org/lkml/2015/1/28/334

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 24391c0d 23-Jan-2015 Shaohua Li <shli@fb.com>

blk-mq: add tag allocation policy

This is the blk-mq part to support tag allocation policy. The default
allocation policy isn't changed (though it's not a strict FIFO). The new
policy is round-robin for libata. But it's a try-best implementation. If
multiple tasks are competing, the tags returned will be mixed (which is
unavoidable even with !mq, as requests from different tasks can be
mixed in queue)

Cc: Jens Axboe <axboe@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 76d697d1 19-Jan-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: fix hctx/ctx kobject use-after-free

The kobject memory shouldn't have been freed before the kobject
is released because driver core can access it freely before its
release.

This patch frees hctx in its release callback. For ctx, they
share one single per-cpu variable which is associated with
the request queue, so free ctx in q->mq_kobj's release handler.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(fix ctx kobjects)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# eb130dbf 08-Jan-2015 Keith Busch <kbusch@kernel.org>

blk-mq: End unstarted requests on a dying queue

Requests that haven't been started prior to a queue dying can be ended
in error without waiting for them to start and time out.

Signed-off-by: Keith Busch <keith.busch@intel.com>

Added code comment to explain why this is done.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 5b3f25fc 07-Jan-2015 Keith Busch <kbusch@kernel.org>

blk-mq: Allow requests to never expire

Some types of requests may be started that are not gauranteed to ever
complete. This adds a request flag that a driver can use so mark the
request as such.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1885b24d 07-Jan-2015 Jens Axboe <axboe@fb.com>

blk-mq: Add helper to abort requeued requests

Adds a helper function a driver can use to abort requeued requests in
case any are pending when h/w queues are being removed.

Signed-off-by: Jens Axboe <axboe@fb.com>


# c68ed59f 07-Jan-2015 Keith Busch <kbusch@kernel.org>

blk-mq: Let drivers cancel requeue_work

Kicking requeued requests will start h/w queues in a work_queue, which
may alter the driver's requested state to temporarily stop them. This
patch exports a method to cancel the q->requeue_work so a driver can be
assured stopped h/w queues won't be started up before it is ready.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 973c0191 07-Jan-2015 Keith Busch <kbusch@kernel.org>

blk-mq: Export if requests were started

Drivers can iterate over all allocated request tags, but their callback
needs a way to know if the driver started the request in the first place.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3fd5940c 08-Jan-2015 Keith Busch <kbusch@kernel.org>

blk-mq: Wake tasks entering queue on dying

When the queue is set to dying, wake up tasks that are waiting on frozen
queue so they realize it is dying and abandon their request.

Signed-off-by: Keith Busch <keith.busch@intel.com>

Modified by me to add a code comment on the need for the wakeup.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 17ded320 07-Jan-2015 Jens Axboe <axboe@fb.com>

blk-mq: get rid of ->cmd_size in the hardware queue

We store it in the tag set, we don't need it in the hardware queue.
While removing cmd_size, place ->queue_num further down to avoid
a hole on 64-bit archs. It's not used in any fast paths, so we
can safely move it.

Signed-off-by: Jens Axboe <axboe@fb.com>


# c761d96b 02-Jan-2015 Jens Axboe <axboe@fb.com>

blk-mq: export blk_mq_freeze_queue()

Commit b4c6a028774b exported the start and unfreeze, but we need
the regular blk_mq_freeze_queue() for the loop conversion.

Signed-off-by: Jens Axboe <axboe@fb.com>


# aed3ea94 22-Dec-2014 Jens Axboe <axboe@fb.com>

block: wake up waiters when a queue is marked dying

If it's dying, we can't expect new request to complete and come
in an wake up other tasks waiting for requests. So after we
have marked it as dying, wake up everybody currently waiting
for a request. Once they wake, they will retry their allocation
and fail appropriately due to the state of the queue.

Tested-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b4c6a028 19-Dec-2014 Keith Busch <kbusch@kernel.org>

blk-mq: Export freeze/unfreeze functions

Let drivers prevent entering a queue that isn't available.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c76541a9 19-Dec-2014 Keith Busch <kbusch@kernel.org>

blk-mq: Exit queue on alloc failure

Fixes usage counter when a request could not be allocated.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 19c66e59 03-Dec-2014 Ming Lei <ming.lei@canonical.com>

blk-mq: prevent unmapped hw queue from being scheduled

When one hardware queue has no mapped software queues, it
shouldn't have been scheduled. Otherwise WARNING or OOPS
can triggered.

blk_mq_hw_queue_mapped() helper is introduce for fixing
the problem.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6637fadf 30-Nov-2014 Shaohua Li <shli@kernel.org>

blk-mq: move the kdump check to blk_mq_alloc_tag_set

We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are
allocated in the former function. So the kdump check should be moved to there
to really save memory.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b657d7e6 24-Nov-2014 Christoph Hellwig <hch@lst.de>

blk-mq: handle the single queue case in blk_mq_hctx_next_cpu

Don't duplicate the code to handle the not cpu bounce case in the
caller, do it inside blk_mq_hctx_next_cpu instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7c7f2f2b 17-Nov-2014 Jens Axboe <axboe@fb.com>

blk-mq: add blk_mq_free_hctx_request()

It's silly to use blk_mq_free_request() which in turn maps the
request to the hardware queue, for places where we already know
what the hardware queue is. This saves us an extra mapping of a
hardware queue on request completion, if the caller knows this
information already.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 1a3b595a 17-Nov-2014 Jens Axboe <axboe@fb.com>

blk-mq: export blk_mq_free_request()

Drivers that know they are blk-mq should just use this function
instead of calling through blk_put_request().

Signed-off-by: Jens Axboe <axboe@fb.com>


# 205fb5f5 30-Oct-2014 Bart Van Assche <bvanassche@acm.org>

blk-mq: add blk_mq_unique_tag()

The queuecommand() callback functions in SCSI low-level drivers
need to know which hardware context has been selected by the
block layer. Since this information is not available in the
request structure, and since passing the hctx pointer directly to
the queuecommand callback function would require modification of
all SCSI LLDs, add a function to the block layer that allows to
query the hardware context index.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 2a90d4aa 07-Nov-2014 Paolo Bonzini <pbonzini@redhat.com>

blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable

blk-mq is using preempt_disable/enable in order to ensure that the
queue runners are placed on the right CPU. This does not work with
the RT patches, because __blk_mq_run_hw_queue takes a non-raw
spinlock with the preemption-disabled region. If there is contention
on the lock, this violates the rules for preemption-disabled regions.

While this should be easily fixable within the RT patches just by doing
migrate_disable/enable, we can do better and document _why_ this
particular region runs with disabled preemption. After the previous
patch, it is trivial to switch it to get/put_cpu; the RT patches then
can change it to get_cpu_light, which lets virtio-blk run under RT
kernels.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 398205b8 07-Nov-2014 Paolo Bonzini <pbonzini@redhat.com>

blk_mq: call preempt_disable/enable in blk_mq_run_hw_queue, and only if needed

preempt_disable/enable surrounds every call to blk_mq_run_hw_queue,
except the one in blk-flush.c. In fact that one is always asynchronous,
and it does not need smp_processor_id().

We can do the same for all other calls, avoiding preempt_disable when
async is true. This avoids peppering blk-mq.c with preemption-disabled
regions.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f3af020b 04-Nov-2014 Tejun Heo <tj@kernel.org>

blk-mq: make mq_queue_reinit_notify() freeze queues in parallel

q->mq_usage_counter is a percpu_ref which is killed and drained when
the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit()
which involves freezing the queue is invoked on all existing queues.
Because percpu_ref killing and draining involve a RCU grace period,
doing the above on one queue after another may take a long time if
there are many queues on the system.

This patch splits out initiation of freezing and waiting for its
completion, and updates blk_mq_queue_reinit_notify() so that the
queues are frozen in parallel instead of one after another. Note that
freezing and unfreezing are moved from blk_mq_queue_reinit() to
blk_mq_queue_reinit_notify().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e167dfb5 29-Oct-2014 Jens Axboe <axboe@fb.com>

blk-mq: add BLK_MQ_F_DEFER_ISSUE support flag

Drivers can now tell blk-mq if they take advantage of the deferred
issue through 'last' or not. If they do, don't do queue-direct
for sync IO. This is a preparation patch for the nvme conversion.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 74c45052 29-Oct-2014 Jens Axboe <axboe@fb.com>

blk-mq: add a 'list' parameter to ->queue_rq()

Since we have the notion of a 'last' request in a chain, we can use
this to have the hardware optimize the issuing of requests. Add
a list_head parameter to queue_rq that the driver can use to
temporarily store hw commands for issue when 'last' is true. If we
are doing a chain of requests, pass in a NULL list for the first
request to force issue of that immediately, then batch the remainder
for deferred issue until the last request has been sent.

Instead of adding yet another argument to the hot ->queue_rq path,
encapsulate the passed arguments in a blk_mq_queue_data structure.
This is passed as a constant, and has been tested as faster than
passing 4 (or even 3) args through ->queue_rq. Update drivers for
the new ->queue_rq() prototype. There are no functional changes
in this patch for drivers - if they don't use the passed in list,
then they will just queue requests individually like before.

Signed-off-by: Jens Axboe <axboe@fb.com>


# a86073e4 13-Oct-2014 Jens Axboe <axboe@fb.com>

blk-mq: allocate cpumask on the home node

All other allocs are done on the specific node, somehow the
cpumask for hw queue runs was missed. Fix that by using
zalloc_cpumask_var_node() in blk_mq_init_queue().

Signed-off-by: Jens Axboe <axboe@fb.com>


# f70ced09 25-Sep-2014 Ming Lei <ming.lei@canonical.com>

blk-mq: support per-distpatch_queue flush machinery

This patch supports to run one single flush machinery for
each blk-mq dispatch queue, so that:

- current init_request and exit_request callbacks can
cover flush request too, then the buggy copying way of
initializing flush request's pdu can be fixed

- flushing performance gets improved in case of multi hw-queue

In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
increased a lot over my test environment:
- throughput: +70% in case of virtio-blk over null_blk
- throughput: +30% in case of virtio-blk over SSD image

The multi virtqueue feature isn't merged to QEMU yet, and patches for
the feature can be found in below tree:

git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4

And simply passing 'num_queues=4 vectors=5' should be enough to
enable multi queue(quad queue) feature for QEMU virtio-blk.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e97c293c 25-Sep-2014 Ming Lei <ming.lei@canonical.com>

block: introduce 'blk_mq_ctx' parameter to blk_get_flush_queue

This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
so that this function can find the corresponding blk_flush_queue
bound with current mq context since the flush queue will become
per hw-queue.

For legacy queue, the parameter can be simply 'NULL'.

For multiqueue case, the parameter should be set as the context
from which the related request is originated. With this context
info, the hw queue and related flush queue can be found easily.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ba483388 25-Sep-2014 Ming Lei <ming.lei@canonical.com>

block: remove blk_init_flush() and its pair

Now mission of the two helpers is over, and just call
blk_alloc_flush_queue() and blk_free_flush_queue() directly.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7c94e1c1 25-Sep-2014 Ming Lei <ming.lei@canonical.com>

block: introduce blk_flush_queue to drive flush machinery

This patch introduces 'struct blk_flush_queue' and puts all
flush machinery related fields into this structure, so that

- flush implementation details aren't exposed to driver
- it is easy to convert to per dispatch-queue flush machinery

This patch is basically a mechanical replacement.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f3552655 25-Sep-2014 Ming Lei <ming.lei@canonical.com>

block: introduce blk_init_flush and its pair

These two temporary functions are introduced for holding flush
initialization and de-initialization, so that we can
introduce 'flush queue' easier in the following patch. And
once 'flush queue' and its allocation/free functions are ready,
they will be removed for sake of code readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1bcb1ead 25-Sep-2014 Ming Lei <ming.lei@canonical.com>

blk-mq: allocate flush_rq in blk_mq_init_flush()

It is reasonable to allocate flush req in blk_mq_init_flush().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 08e98fc6 25-Sep-2014 Ming Lei <ming.lei@canonical.com>

blk-mq: handle failure path for initializing hctx

Failure of initializing one hctx isn't handled, so this patch
introduces blk_mq_init_hctx() and its pair to handle it explicitly.
Also this patch makes code cleaner.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 17497acb 24-Sep-2014 Tejun Heo <tj@kernel.org>

blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take above ten seconds when
scsi-mq is used.

[ 0.949892] scsi host0: Virtio SCSI HBA
[ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

<stall>

[ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 16.194099] osd: LOADED open-osd 0.2.1
[ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
[ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

This is also the reason why request_queues start in bypass mode which
is ended on blk_register_queue() as shutting down a fully functional
queue also involves a RCU grace period and the queues for non-existent
SCSI devices never reach registration.

blk-mq basically needs to do the same thing - start the mq in a
degraded mode which is faster to shut down and then make it fully
functional only after the queue reaches registration. percpu_ref
recently grew facilities to force atomic operation until explicitly
switched to percpu mode, which can be used for this purpose. This
patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
switch it to percpu mode only once blk_register_queue() is reached.

Note that this issue was previously worked around by 0a30288da1ae
("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
probe") for v3.17. The temp fix was reverted in preparation of adding
persistent atomic mode to percpu_ref by 9eca80461a45 ("Revert "blk-mq,
percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
This patch and the prerequisite percpu_ref changes will be merged
during v3.18 devel cycle.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>


# 2aad2a86 24-Sep-2014 Tejun Heo <tj@kernel.org>

percpu_ref: add PERCPU_REF_INIT_* flags

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>


# 9eca8046 24-Sep-2014 Tejun Heo <tj@kernel.org>

Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"

This reverts commit 0a30288da1aec914e158c2d7a3482a85f632750f, which
was a temporary fix for SCSI blk-mq stall issue. The following
patches will fix the issue properly by introducing atomic mode to
percpu_ref.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>


# 0a30288d 23-Sep-2014 Tejun Heo <tj@kernel.org>

blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take more than ten seconds
when scsi-mq is used.

This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment). As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it. This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 90415837 22-Sep-2014 Christoph Hellwig <hch@lst.de>

block: fix blk_abort_request on blk-mq

Signed-off-by: Christoph Hellwig <hch@lst.de>

Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header.

Signed-off-by: Jens Axboe <axboe@fb.com>


# aedcd72f 17-Sep-2014 Jens Axboe <axboe@fb.com>

blk-mq: limit memory consumption if a crash dump is active

It's not uncommon for crash dump kernels to be limited to 128MB or
something low in that area. This is normally not a problem for
devices as we don't use that much memory, but for some shared SCSI
setups with huge queue depths, it can potentially fill most of
memory with tons of request allocations. blk-mq does scale back
when it fails to allocate memory, but it scales back just enough
so that blk-mq succeeds. This could still leave the system with
not enough memory to make any real progress.

Check if we are in a kdump environment and limit the hardware
queues and tag depth.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 2edd2c74 17-Sep-2014 Ming Lei <ming.lei@canoical.com>

blk-mq: remove unnecessary blk_clear_rq_complete()

This patch removes two unnecessary blk_clear_rq_complete(),
the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(),
so:

- The blk_clear_rq_complete() in blk_flush_restore_request()
needn't because the request will be freed later, and clearing
it here may open a small race window with timeout.

- The blk_clear_rq_complete() in blk_mq_requeue_request() isn't
necessary too, even though REQ_ATOM_STARTED is cleared in
__blk_mq_requeue_request(), in theory it still may cause a small
race window with timeout since the two clear_bit() may be
reordered.

Signed-off-by: Ming Lei <ming.lei@canoical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0152fb6b 13-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: pass a reserved argument to the timeout handler

Allow blk-mq to pass an argument to the timeout handler to indicate
if we're timing out a reserved or regular command. For many drivers
those need to be handled different.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 46f92d42 13-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: unshared timeout handler

Duplicate the (small) timeout handler in blk-mq so that we can pass
arguments more easily to the driver timeout handler. This enables
the next patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 81481eb4 13-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: fix and simplify tag iteration for the timeout handler

Don't do a kmalloc from timer to handle timeouts, chances are we could be
under heavy load or similar and thus just miss out on the timeouts.
Fortunately it is very easy to just iterate over all in use tags, and doing
this properly actually cleans up the blk_mq_busy_iter API as well, and
prepares us for the next patch by passing a reserved argument to the
iterator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c8a446ad 13-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: rename blk_mq_end_io to blk_mq_end_request

Now that we've changed the driver API on the submission side use the
opportunity to fix up the name on the completion side to fit into the
general scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e2490073 13-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: call blk_mq_start_request from ->queue_rq

When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.

Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bf572297 13-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: remove REQ_END

Pass an explicit parameter for the last request in a batch to ->queue_rq
instead of using a request flag. Besides being a cleaner and non-stateful
interface this is also required for the next patch, which fixes the blk-mq
I/O submission code to not start a time too early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8b957415 19-Sep-2014 Jens Axboe <axboe@fb.com>

blk-mq: use blk_mq_start_hw_queues() when running requeue work

When requests are retried due to hw or sw resource shortages,
we often stop the associated hardware queue. So ensure that we
restart the queues when running the requeue work, otherwise the
queue run will be a no-op.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 6b55e1f2 19-Sep-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix potential oops on out-of-memory in __blk_mq_alloc_rq_maps()

__blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale
back the queue depth if we are low on memory. So don't clear
set->tags when we fail, this is handled directly in
the parent function, blk_mq_alloc_tag_set().

Reported-by: Robert Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a57a178a 16-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: avoid infinite recursion with the FUA flag

We should not insert requests into the flush state machine from
blk_mq_insert_request. All incoming flush requests come through
blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
should only be called for BLOCK_PC requests. All other callers
deal with requests that already went through the flush statemchine
and shouldn't be reinserted into it.

Reported-by: Robert Elliott <Elliott@hp.com>
Debugged-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 683d0e12 18-Sep-2014 David Hildenbrand <dahi@linux.vnet.ibm.com>

blk-mq: Avoid race condition with uninitialized requests

This patch should fix the bug reported in
https://lkml.org/lkml/2014/9/11/249.

We have to initialize at least the atomic_flags and the cmd_flags when
allocating storage for the requests.

Otherwise blk_mq_timeout_check() might dereference uninitialized
pointers when racing with the creation of a request.

Also move the reset of cmd_flags for the initializing code to the point
where a request is freed. So we will never end up with pending flush
request indicators that might trigger dereferences of invalid pointers
in blk_mq_timeout_check().

Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reported-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com>
Tested-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 538b7534 16-Sep-2014 Jens Axboe <axboe@fb.com>

blk-mq: request deadline must be visible before marking rq as started

When we start the request, we set the deadline and flip the bits
marking the request as started and non-complete. However, it's
important that the deadline store is ordered before flipping the
bits, otherwise we could have a small window where the request is
marked started but with an invalid deadline. This can confuse the
timeout handling.

Suggested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a5164405 10-Sep-2014 Jens Axboe <axboe@fb.com>

blk-mq: scale depth and rq map appropriate if low on memory

If we are running in a kdump environment, resources are scarce.
For some SCSI setups with a huge set of shared tags, we run out
of memory allocating what the drivers is asking for. So implement
a scale back logic to reduce the tag depth for those cases, allowing
the driver to successfully load.

We should extend this to detect low memory situations, and implement
a sane fallback for those (1 queue, 64 tags, or something like that).

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a34375ef 07-Sep-2014 Tejun Heo <tj@kernel.org>

percpu-refcount: add @gfp to percpu_ref_init()

Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing. Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>


# 5676e7b6 02-Sep-2014 Robert Elliott <elliott@hp.com>

blk-mq: cleanup after blk_mq_init_rq_map failures

In blk-mq.c blk_mq_alloc_tag_set, if:
set->tags = kmalloc_node()
succeeds, but one of the blk_mq_init_rq_map() calls fails,
goto out_unwind;
needs to free set->tags so the caller is not obligated
to do so. None of the current callers (null_blk,
virtio_blk, virtio_blk, or the forthcoming scsi-mq)
do so.

set->tags needs to be set to NULL after doing so,
so other tag cleanup logic doesn't try to free
a stale pointer later. Also set it to NULL
in blk_mq_free_tag_set.

Tested with error injection on the forthcoming
scsi-mq + hpsa combination.

Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a492f075 28-Aug-2014 Joe Lawrence <joe.lawrence@stratus.com>

block,scsi: fixup blk_get_request dead queue scenarios

The blk_get_request function may fail in low-memory conditions or during
device removal (even if __GFP_WAIT is set). To distinguish between these
errors, modify the blk_get_request call stack to return the appropriate
ERR_PTR. Verify that all callers check the return status and consider
IS_ERR instead of a simple NULL pointer check.

For consistency, make a similar change to the blk_mq_alloc_request leg
of blk_get_request. It may fail if the queue is dead, or the caller was
unwilling to wait.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd]
Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6f4a1626 22-Aug-2014 Tony Battersby <tonyb@cybernetics.com>

scsi-mq: fix requests that use a separate CDB buffer

This patch fixes code such as the following with scsi-mq enabled:

rq = blk_get_request(...);
blk_rq_set_block_pc(rq);

rq->cmd = my_cmd_buffer; /* separate CDB buffer */

blk_execute_rq_nowait(...);

Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for
large CDBs only). Without this patch, scsi_mq_prep_fn() will set
rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# cddd5d17 16-Aug-2014 Tejun Heo <tj@kernel.org>

blk-mq: blk_mq_freeze_queue() should allow nesting

While converting to percpu_ref for freezing, add703fda981 ("blk-mq:
use percpu_ref for mq usage count") incorrectly made
blk_mq_freeze_queue() misbehave when freezing is nested due to
percpu_ref_kill() being invoked on an already killed ref.

Fix it by making blk_mq_freeze_queue() kill and kick the queue only
for the outermost freeze attempt. All the nested ones can simply wait
for the ref to reach zero.

While at it, remove unnecessary @wake initialization from
blk_mq_unfreeze_queue().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a68aafa5 15-Aug-2014 Jens Axboe <axboe@fb.com>

blk-mq: correct a few wrong/bad comments

Just grammar or spelling errors, nothing major.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 274a5843 15-Aug-2014 Jens Axboe <axboe@fb.com>

blk-mq: don't allow merges if turned off for the queue

blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time,
to determine whether it should merge IO or not. However, this could
also be disabled by the admin, if merging is switched off through
sysfs. So check the general queue state as well before attempting
to merge IO.

Reported-by: Rob Elliott <Elliott@hp.com>
Tested-by: Rob Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# dd840087 15-Aug-2014 Ming Lei <ming.lei@canonical.com>

blk-mq: fix WARNING "percpu_ref_kill() called more than once!"

Before doing queue release, the queue has been freezed already
by blk_cleanup_queue(), so needn't to freeze queue for deleting
tag set.

This patch fixes the WARNING of "percpu_ref_kill() called more than once!"
which is triggered during unloading block driver.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# add703fd 01-Jul-2014 Tejun Heo <tj@kernel.org>

blk-mq: use percpu_ref for mq usage count

Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight. The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.

This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq. percpu_ref can serve the same purpose after the recent
changes. This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.

blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put. blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero. blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 72d6f02a 01-Jul-2014 Tejun Heo <tj@kernel.org>

blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()

Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
anything and it's gonna be further simplified. Let's flatten it into
its caller.

This patch doesn't make any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 780db207 01-Jul-2014 Tejun Heo <tj@kernel.org>

blk-mq: decouble blk-mq freezing from generic bypassing

blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order. This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.

However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones. This causes problems as bypassing assumes that request
processing is online. blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.

Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones. This shouldn't break
anything but again highlights that something isn't quite right here.

The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms. The only intersecting purpose
that they serve is during queue cleanup. Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.

* request_queue->mq_freeze_depth is added and
blk_mq_[un]freeze_queue() now operate on this counter instead of
->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
but the counter is tested directly. This will be further updated by
later changes.

* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
blk_mq_freeze_queue(). Queue cleanup path now calls
blk_mq_freeze_queue() directly.

* blk_queue_enter()'s fast path condition is simplified to simply
check @q->mq_freeze_depth. Previously, the condition was

!blk_queue_dying(q) &&
(!blk_queue_bypass(q) || !blk_queue_init_done(q))

mq_freeze_depth is incremented right after dying is set and
blk_queue_init_done() exception isn't necessary as blk-mq doesn't
start frozen, which only leaves the blk_queue_bypass() test which
can be replaced by @q->mq_freeze_depth test.

This change simplifies the code and reduces confusion in the area.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 776687bc 01-Jul-2014 Tejun Heo <tj@kernel.org>

block, blk-mq: draining can't be skipped even if bypass_depth was non-zero

Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
skip queue draining if bypass_depth was already above zero. The
assumption is that the one which bumped the bypass_depth should have
performed draining already; however, there's nothing which prevents a
new instance of bypassing/freezing from starting before the previous
one finishes draining. The current code may allow the later
bypassing/freezing instances to complete while there still are
in-flight requests which haven't finished draining.

Fix it by draining regardless of bypass_depth. We still skip draining
from blk_queue_bypass_start() while the queue is initializing to avoid
introducing excessive delays during boot. INIT_DONE setting is moved
above the initial blk_queue_bypass_end() so that bypassing attempts
can't slip inbetween.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 531ed626 18-Jun-2014 Tejun Heo <tj@kernel.org>

blk-mq: fix a memory ordering bug in blk_mq_queue_enter()

blk-mq uses a percpu_counter to keep track of how many usages are in
flight. The percpu_counter is drained while freezing to ensure that
no usage is left in-flight after freezing is complete.

blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism; unfortunately, it contains a subtle bug -
smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
if freezing happens inbetween the caller can slip through and freezing
can be complete while there are active users.

Use smp_mb() instead so that bypass_depth and mq_usage_counter
modifications and tests are properly interlocked.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0ffbce80 25-Jun-2014 Jens Axboe <axboe@fb.com>

blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()

Currently it calls __blk_mq_run_hw_queue(), which depends on the
CPU placement being correct. This means it's not possible to call
blk_mq_start_hw_queues(q) from a context that is correct for all
queues, leading to triggering the

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));

in __blk_mq_run_hw_queue().

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 95ed0681 13-Jun-2014 Christoph Hellwig <hch@lst.de>

blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8f5280f4 13-Jun-2014 Christoph Hellwig <hch@lst.de>

blk-mq: properly drain stopped queues

If we need to drain a queue we need to run all queues, even if they
are marked stopped to make sure the driver has a chance to error out
on all queued requests.

This fixes surprise removal with scsi-mq.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>

Signed-off-by: Jens Axboe <axboe@fb.com>


# 2b8393b4 09-Jun-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: add timer in blk_mq_start_request

This way will become consistent with non-mq case, also
avoid to update rq->deadline twice for mq.

The comment said: "We do this early, to ensure we are on
the right CPU.", but no percpu stuff is used in blk_add_timer(),
so it isn't necessary. Even when inserting from plug list, there
is no such guarantee at all.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3ee32372 09-Jun-2014 Jens Axboe <axboe@fb.com>

blk-mq: always initialize request->start_time

The blk-mq core only initializes this if io stats are enabled, since
blk-mq only reads the field in that case. But drivers could
potentially use it internally, so ensure that we always set it to
the current time when the request is allocated.

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f6be4fb4 06-Jun-2014 Jens Axboe <axboe@fb.com>

blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()

It'll be used in blk_mq_start_request() to set a potential timeout
for the request, so clear it to zero at alloc time to ensure that
we know if someone has set it or not.

Fixes random early timeouts on NVMe testing.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 3b632cf0 06-Jun-2014 Keith Busch <kbusch@kernel.org>

blk-mq: don't allow queue entering for a dying queue

If the queue is going away, don't let new allocs or queueing
happen on it. Go through the normal wait process, and exit with
ENODEV in that case.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a4391c64 05-Jun-2014 Jens Axboe <axboe@fb.com>

blk-mq: bump max tag depth to 10K tags

For some scsi-mq cases, the tag map can be huge. So increase the
max number of tags we support.

Additionally, don't fail with EINVAL if a user requests too many
tags. Warn that the tag depth has been adjusted down, and store
the new value inside the tag_set passed in.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 0e62f51f 04-Jun-2014 Jens Axboe <axboe@fb.com>

blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameter

We currently pass in the hardware queue, and get the tags from there.
But from scsi-mq, with a shared tag space, it's a lot more convenient
to pass in the blk_mq_tags instead as the hardware queue isn't always
directly available. So instead of having to re-map to a given
hardware queue from rq->mq_ctx, just pass in the tags structure.

Signed-off-by: Jens Axboe <axboe@fb.com>


# f899fed4 04-Jun-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix regression from commit 624dbe475416

When the code was collapsed to avoid duplication, the recent patch
for ensuring that a queue is idled before free was dropped, which was
added by commit 19c5d84f14d2.

Add back the blk_mq_tag_idle(), to ensure we don't leak a reference
to an active queue when it is freed.

Signed-off-by: Jens Axboe <axboe@fb.com>


# ff87bcec 03-Jun-2014 Jens Axboe <axboe@fb.com>

blk-mq: handle NULL req return from blk_map_request in single queue mode

blk_mq_map_request() can return NULL if we fail entering the queue
(dying, or removed), in which case it has already ended IO on the
bio. So nothing more to do, except just return.

Signed-off-by: Jens Axboe <axboe@fb.com>


# e6cdb092 02-Jun-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix sparse warning on missed __percpu annotation

'struct blk_mq_ctx' is __percpu, so add the annotation
and fix the sparse warning reported from Fengguang:

[block:for-linus 2/3] block/blk-mq.h:75:16: sparse: incorrect
type in initializer (different address spaces)

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# cb96a42c 31-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix schedule from atomic context

blk_mq_put_ctx() has to be called before io_schedule() in
bt_get().

This patch fixes the problem by taking similar approach from
percpu_ida allocation for the situation.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1aecfe48 31-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: move blk_mq_get_ctx/blk_mq_put_ctx to mq private header

The blk-mq tag code need these helpers.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ed851860 30-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: push IPI or local end_io decision to __blk_mq_complete_request()

We have callers outside of the blk-mq proper (like timeouts) that
want to call __blk_mq_complete_request(), so rename the function
and put the decision code for whether to use ->softirq_done_fn
or blk_mq_endio() into __blk_mq_complete_request().

This also makes the interface more logical again.
blk_mq_complete_request() attempts to atomically mark the request
completed, and calls __blk_mq_complete_request() if successful.
__blk_mq_complete_request() then just ends the request.

Signed-off-by: Jens Axboe <axboe@fb.com>


# feff6894 30-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: remember to start timeout handler for direct queue

Commit 07068d5b8e added a direct-to-hw-queue mode, but this mode
needs to remember to add the request timeout handler as well.
Without it, we don't track timeouts for these requests.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 67aec14c 30-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: make the sysfs mq/ layout reflect current mappings

Currently blk-mq registers all the hardware queues in sysfs,
regardless of whether it uses them (e.g. they have CPU mappings)
or not. The unused hardware queues lack the cpux/ directories,
and the other sysfs entries (like active, pending, etc) are all
zeroes.

Change this so that sysfs correctly reflects the current mappings
of the hardware queues.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 22302375 30-May-2014 Shaohua Li <shli@kernel.org>

blk-mq: blk_mq_tag_to_rq should handle flush request

flush request is special, which borrows the tag from the parent
request. Hence blk_mq_tag_to_rq needs special handling to return
the flush request from the tag.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4b570521 29-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: request initialization optimizations

We currently clear a lot more than we need to, so make that a bit
more clever. Make some of the init dependent on features, like
only setting start_time if we are going to use it.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 05f1dd53 29-May-2014 Jens Axboe <axboe@fb.com>

block: add queue flag for disabling SG merging

If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.

Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.

Signed-off-by: Jens Axboe <axboe@fb.com>


# cdef54dd 28-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: remove alloc_hctx and free_hctx methods

There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 75bb4625 28-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: add file comments and update copyright notices

None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.

Signed-off-by: Jens Axboe <axboe@fb.com>


# d852564f 27-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_alloc_request_pinned

We now only have one caller left and can open code it there in a cleaner
way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 793597a6 27-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request

We already do a non-blocking allocation in blk_mq_map_request, no need
to repeat it. Just call __blk_mq_alloc_request to wait directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a3bd7756 27-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_wait_for_tags

The current logic for blocking tag allocation is rather confusing, as we
first allocated and then free again a tag in blk_mq_wait_for_tags, just
to attempt a non-blocking allocation and then repeat if someone else
managed to grab the tag before us.

Instead change blk_mq_alloc_request_pinned to simply do a blocking tag
allocation itself and use the request we get back from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5dee8577 27-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: initialize request in __blk_mq_alloc_request

Both callers if __blk_mq_alloc_request want to initialize the request, so
lift it into the common path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4ce01dd1 27-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request

Instead of having two almost identical copies of the same code just let
the callers pass in the reserved flag directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6fca6a61 28-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: add helper to insert requests from irq context

Both the cache flush state machine and the SCSI midlayer want to submit
requests from irq context, and the current per-request requeue_work
unfortunately causes corruption due to sharing with the csd field for
flushes. Replace them with a per-request_queue list of requests to
be requeued.

Based on an earlier test by Ming Lei.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 95f09684 27-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: allow non-softirq completions

Right now we export two ways of completing a request:

1) blk_mq_complete_request(). This uses an IPI (if needed) and
completes through q->softirq_done_fn(). It also works with
timeouts.

2) blk_mq_end_io(). This completes inline, and ignores any timeout
state of the request.

Let blk_mq_complete_request() handle non-softirq_done_fn completions
as well, by just completing inline. If a driver has enough completion
ports to place completions correctly, it need not define a
mq_ops->complete() and we can avoid an indirect function call by
doing the completion inline.

Signed-off-by: Jens Axboe <axboe@fb.com>


# f14bbe77 27-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: pass in suggested NUMA node to ->alloc_hctx()

Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.

So just pass in the suggested node directly to the alloc
function.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 3d2936f4 27-May-2014 Ming Lei <tom.leiming@gmail.com>

block: only allocate/free mq_usage_counter in blk-mq

The percpu counter is only used for blk-mq, so move
its allocation and free inside blk-mq, and don't
allocate it for legacy queue device.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 624dbe47 27-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: avoid code duplication

blk_mq_exit_hw_queues() and blk_mq_free_hw_queues()
are introduced to avoid code duplication.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1f9f07e9 27-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix leak of hctx->ctx_map

hctx->ctx_map should have been freed inside blk_mq_free_queue().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 19c5d84f 26-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: idle all hardware contexts before freeing a queue

Without this we can leak the active_queues reference if a command is
freed while it is considered active.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c22d9d8a 23-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: allow setting of per-request timeouts

Currently blk-mq uses the queue timeout for all requests. But
for some commands, drivers may want to set a specific timeout
for special requests. Allow this to be passed in through
request->timeout, and use it if set.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 07068d5b 22-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: split make request handler for multi and single queue

We want slightly different behavior from them:

- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.

- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.

Split blk_mq_make_request() into a blk_sq_make_request() for single
queue devices, and retain blk_mq_make_request() for multi queue
devices. Then we don't need multiple checks for q->nr_hw_queues
in the request mapping.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 484b4061 21-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: save memory by freeing requests on unused hardware queues

Depending on the topology of the machine and the number of queues
exposed by a device, we can end up in a situation where some of
the hardware queues are unused (as in, they don't map to any
software queues). For this case, free up the memory used by the
request map, as we will not use it. This can be a substantial
amount of memory, depending on the number of queues vs CPUs and
the queue depth of the device.

Signed-off-by: Jens Axboe <axboe@fb.com>


# e814e71b 21-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: allow the hctx cpu hotplug notifier to return errors

Prepare this for the next patch which adds more smarts in the
plugging logic, so that we can save some memory.

Signed-off-by: Jens Axboe <axboe@fb.com>


# da41a589 20-May-2014 Robert Elliott <elliott@hp.com>

blk-mq: Micro-optimize blk_queue_nomerges() check

In blk_mq_make_request(), do the blk_queue_nomerges() check
outside the call to blk_attempt_plug_merge() to eliminate
function call overhead when nomerges=2 (disabled)

Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# eba71768 20-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: initialize q->nr_requests after calling blk_queue_make_request()

blk_queue_make_requests() overwrites our set value for q->nr_requests,
turning it into the default of 128. Set this appropriately after
initializing queue values in blk_queue_make_request().

Signed-off-by: Jens Axboe <axboe@fb.com>


# e3a2b3f9 20-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: allow changing of queue depth through sysfs

For request_fn based devices, the block layer exports a 'nr_requests'
file through sysfs to allow adjusting of queue depth on the fly.
Currently this returns -EINVAL for blk-mq, since it's not wired up.
Wire this up for blk-mq, so that it now also always dynamic
adjustments of the allowed queue depth for any given block device
managed by blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 1429d7c9 19-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: switch ctx pending map to the sparser blk_align_bitmap

Each hardware queue has a bitmap of software queues with pending
requests. When new IO is queued on a software queue, the bit is
set, and when IO is pruned on a hardware queue run, the bit is
cleared. This causes a lot of traffic. Switch this from the regular
BITS_PER_LONG bitmap to a sparser layout, similarly to what was
done for blk-mq tagging.

20% performance increase was observed for single threaded IO, and
about 15% performanc increase on multiple threads driving the
same device.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 0d2602ca 13-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: improve support for shared tags maps

This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.

If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.

The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.

Signed-off-by: Jens Axboe <axboe@fb.com>


# cf4b50af 09-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix race in IO start accounting

Commit c6d600c6 opened up a small race where we could attempt to
account IO completion on a request, racing with IO start accounting.
Fix this up by ensuring that we've accounted for IO start before
inserting the request.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 4bb659b1 09-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: implement new and more efficient tagging scheme

blk-mq currently uses percpu_ida for tag allocation. But that only
works well if the ratio between tag space and number of CPUs is
sufficiently high. For most devices and systems, that is not the
case. The end result if that we either only utilize the tag space
partially, or we end up attempting to fully exhaust it and run
into lots of lock contention with stealing between CPUs. This is
not optimal.

This new tagging scheme is a hybrid bitmap allocator. It uses
two tricks to both be SMP friendly and allow full exhaustion
of the space:

1) We cache the last allocated (or freed) tag on a per blk-mq
software context basis. This allows us to limit the space
we have to search. The key element here is not caching it
in the shared tag structure, otherwise we end up dirtying
more shared cache lines on each allocate/free operation.

2) The tag space is split into cache line sized groups, and
each context will start off randomly in that space. Even up
to full utilization of the space, this divides the tag users
efficiently into cache line groups, avoiding dirtying the same
one both between allocators and between allocator and freeer.

This scheme shows drastically better behaviour, both on small
tag spaces but on large ones as well. It has been tested extensively
to show better performance for all the cases blk-mq cares about.

Signed-off-by: Jens Axboe <axboe@fb.com>


# af76e555 05-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: initialize struct request fields individually

This allows us to avoid a non-atomic memset over ->atomic_flags as well
as killing lots of duplicate initializations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9fccfed8 08-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: update a hotplug comment for grammar

Signed-off-by: Jens Axboe <axboe@fb.com>


# 506e931f 07-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: add basic round-robin of what CPU to queue workqueue work on

Right now we just pick the first CPU in the mask, but that can
easily overload that one. Add some basic batching and round-robin
all the entries in the mask instead.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 74814b1c 02-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: remove extra requeue trace

We already issue a blktrace requeue event in
__blk_mq_requeue_request(), don't do it from the original caller
as well.

Signed-off-by: Jens Axboe <axboe@fb.com>


# c6d600c6 30-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: refactor request insertion/merging

Refactor the logic around adding a new bio to a software queue,
so we nest the ctx->lock where we really need it (merge and
insertion) and don't hold it when we don't (init and IO start
accounting).

Signed-off-by: Jens Axboe <axboe@fb.com>


# 98bc1f27 30-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq remove debug BUG_ON() when draining software queues

It's never been of any use, lets get rid of it.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 5810d903 29-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix waiting for reserved tags

blk_mq_wait_for_tags() is only able to wait for "normal" tags,
not reserved tags. Pass in which one we should attempt to get
a tag for, so that waiting for reserved tags will work.

Reserved tags are used for internal commands, which are usually
serialized. Hence no waiting generally takes place, but we should
ensure that it actually works if users need that functionality.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 38535201 25-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: respect rq_affinity

The blk-mq code is using it's own version of the I/O completion affinity
tunables, which causes a few issues:

- the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
still is present, thus breaking existing tuning setups.
- the rq_affinity = 1 mode, which is the defauly for legacy request based
drivers isn't implemented at all.
- blk-mq drivers don't implement any completion affinity with the default
flag settings.

This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
respects the queue-wide rq_affinity flags and also implements the
rq_affinity = 1 mode.

This means I/O completion affinity can now only be tuned block-queue wide
instead of per context, which seems more sensible to me anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 87ee7b11 24-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix race with timeouts and requeue events

If a requeue event races with a timeout, we can get into the
situation where we attempt to complete a request from the
timeout handler when it's not start anymore. This causes a crash.
So have the timeout handler check that REQ_ATOM_STARTED is still
set on the request - if not, we ignore the event. If this happens,
the request has now been marked as complete. As a consequence, we
need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
as to maintain proper request state.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 70ab0b2d 24-Apr-2014 Jens Axboe <axboe@fb.com>

Revert "blk-mq: initialize req->q in allocation"

This reverts commit 6a3c8a3ac0e68dcfc2a01f4aa1ca0edd1a1701eb.

We need selective clearing of the request to make the init-at-free
time completely safe. Otherwise we end up stomping on
rq->atomic_flags, which we don't want to do.


# 981bd189 23-Apr-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix leak of set->tags

set->tags should be freed in blk_mq_free_tag_set().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6a3c8a3a 19-Apr-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: initialize req->q in allocation

The patch basically reverts the patch of(blk-mq:
initialize request on allocation) in Jens's tree(already
in -next), and only initialize req->q in allocation
for two reasons:

- presumed cache hotness on completion
- blk_rq_tagged(rq) depends on reset of req->mq_ctx

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4ca08500 19-Apr-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: user (1 << order) to implement order_to_size()

Cc: Jörg-Volker Peetz <jvpeetz@web.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 48479005 19-Apr-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix allocation of set->tags

type of set->tags is struct blk_mq_tags **.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 11471e0d 19-Apr-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: free hctx->ctx_map when init failed

Avoid memory leak in the failure path.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ed0791b2 16-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: add blk_mq_requeue_request

This allows to requeue a request that has been accepted by ->queue_rq
earlier. This is needed by the SCSI layer in various error conditions.

The existing internal blk_mq_requeue_request is renamed to
__blk_mq_requeue_request as it is a lower level building block for this
funtionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2f268556 16-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: add blk_mq_start_hw_queues

Add a helper to unconditionally kick contexts of a queue. This will
be needed by the SCSI layer to provide fair queueing between multiple
devices on a single host.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 70f4db63 16-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: add blk_mq_delay_queue

Add a blk-mq equivalent to blk_delay_queue so that the scsi layer can ask
to be kicked again after a delay.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified by me to kill the unnecessary preempt disable/enable
in the delayed workqueue handler.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 1b4a3258 16-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: add async parameter to blk_mq_start_stopped_hw_queues

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 91b63639 16-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: bidi support

Add two unlinkely branches to make sure the resid is initialized correctly
for bidi request pairs, and the second request gets properly freed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 63151a44 16-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: allow drivers to hook into I/O completion

Split out the bottom half of blk_mq_end_io so that drivers can perform
work when they know a request has been completed, but before it has been
freed. This also obsoletes blk_mq_end_io_partial as drivers can now
pass any value to blk_update_request directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6700a678 16-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: kill preempt disable/enable in blk_mq_work_fn()

blk_mq_work_fn() is always invoked off the bounded workqueues,
so it can happily preempt among the queues in that set without
causing any issues for blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>


# fd1270d5 16-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: don't use preempt_count() to check for right CPU

UP or CONFIG_PREEMPT_NONE will return 0, and what we really
want to check is whether or not we are on the right CPU.
So don't make PREEMPT part of this, just test the CPU in
the mask directly.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 24d2f903 15-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: split out tag initialization, support shared tags

Add a new blk_mq_tag_set structure that gets set up before we initialize
the queue. A single blk_mq_tag_set structure can be shared by multiple
queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modular export of blk_mq_{alloc,free}_tagset added by me.

Signed-off-by: Jens Axboe <axboe@fb.com>


# ed44832d 14-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: initialize request on allocation

If we want to share tag and request allocation between queues we cannot
initialize the request at init/free time, but need to initialize it
at allocation time as it might get used for different queues over its
lifetime.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e9b267d9 15-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: add ->init_request and ->exit_request methods

The current blk_mq_init_commands/blk_mq_free_commands interface has a
two problems:

1) Because only the constructor is passed to blk_mq_init_commands there
is no easy way to clean up when a comman initialization failed. The
current code simply leaks the allocations done in the constructor.

2) There is no good place to call blk_mq_free_commands: before
blk_cleanup_queue there is no guarantee that all outstanding
commands have completed, so we can't free them yet. After
blk_cleanup_queue the queue has usually been freed. This can be
worked around by grabbing an unconditional reference before calling
blk_cleanup_queue and dropping it after blk_mq_free_commands is
done, although that's not exatly pretty and driver writers are
guaranteed to get it wrong sooner or later.

Both issues are easily fixed by making the request constructor and
destructor normal blk_mq_ops methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8727af4b 14-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: make ->flush_rq fully transparent to drivers

Drivers shouldn't have to care about the block layer setting aside a
request to implement the flush state machine. We already override the
mq context and tag to make it more transparent, but so far haven't deal
with the driver private data in the request. Make sure to override this
as well, and while we're at it add a proper helper sitting in blk-mq.c
that implements the full impersonation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9d74e257 14-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: do not initialize req->special

Drivers can reach their private data easily using the blk_mq_rq_to_pdu
helper and don't need req->special. By not initializing it code can
be simplified nicely, and we also shave off a few more instructions from
the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 742ee69b 14-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: initialize resid_len

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e4043dcf 09-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: ensure that hardware queues are always run on the mapped CPUs

Instead of providing soft mappings with no guarantees on hardware
queues always being run on the right CPU, switch to a hard mapping
guarantee that ensure that we always run the hardware queue on
(one of, if more) the mapped CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 59c3d45e 08-Apr-2014 Jens Axboe <axboe@fb.com>

block: remove 'q' parameter from kblockd_schedule_*_work()

The queue parameter is never used, just get rid of it.

Signed-off-by: Jens Axboe <axboe@fb.com>


# bccb5f7c 04-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix potential stall during CPU unplug with IO pending

When a CPU is unplugged, we move the blk_mq_ctx request entries
to the current queue. The current code forgets to remap the
blk_mq_hw_ctx before marking the software context pending,
which breaks if old-cpu and new-cpu don't map to the same
hardware queue.

Additionally, if we mark entries as pending in the new
hardware queue, then make sure we schedule it for running.
Otherwise request could be sitting there until someone else
queues IO for that hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 27fbf4e8 19-Feb-2014 Shaohua Li <shli@kernel.org>

blk-mq: add REQ_SYNC early

Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init
is set correctly.

Signed-off-by: Shaohua Li<shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7237c740 20-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: support partial I/O completions

Add a new blk_mq_end_io_partial function to partially complete requests
as needed by the SCSI layer. We do this by reusing blk_update_request
to advance the bio instead of having a simplified version of it in
the blk-mq code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# eeabc850 21-Mar-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: merge blk_mq_insert_request and blk_mq_run_request

It's almost identical to blk_mq_insert_request, so fold the two into one
slightly more generic function by making the flush special case a bit
smarted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 081241e5 20-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: remove blk_mq_alloc_rq

There's only one caller, which is a straight wrapper and fits the naming
scheme of the related functions a lot better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 676141e4 20-Mar-2014 Jens Axboe <axboe@fb.com>

blk-mq: don't dump CPU -> hw queue map on driver load

Now that we are out of initial debug/bringup mode, remove
the verbose dump of the mapping table.

Provide the mapping table in sysfs, under the hardware queue
directory, in the cpu_list file.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 5d12f905 19-Mar-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix wrong usage of hctx->state vs hctx->flags

BLK_MQ_F_* flags are for hctx->flags, and are non-atomic and
set at registration time. BLK_MQ_S_* flags are dynamic and
atomic, and are accessed through hctx->state.

Some of the BLK_MQ_S_STOPPED uses were wrong. Additionally,
the header file should not use a bit shift for the _S_ flags,
as they are done through the set/test_bit functions.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 95363efd 14-Mar-2014 Jens Axboe <axboe@fb.com>

blk-mq: allow blk_mq_init_commands() to return failure

If drivers do dynamic allocation in the hardware command init
path, then we need to be able to handle and return failures.

And if they do allocations or mappings in the init command path,
then we need a cleanup function to free up that space at exit
time. So add blk_mq_free_commands() as the cleanup function.

This is required for the mtip32xx driver conversion to blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 739c3eea 19-Feb-2014 Shaohua Li <shli@kernel.org>

blk-mq: add REQ_SYNC early

Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init
is set correctly.

Signed-off-by: Shaohua Li<shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# af5040da 04-Mar-2014 Roman Pen <r.peniaev@gmail.com>

blktrace: fix accounting of partially completed requests

trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:

C R 232 + 240 [0]
C R 240 + 232 [0]
C R 248 + 224 [0]
C R 256 + 216 [0]

but should be:

C R 232 + 8 [0]
C R 240 + 8 [0]
C R 248 + 8 [0]
C R 256 + 8 [0]

Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.

This patch takes into account real completion size of the request and
fixes wrong completion accounting.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>


# c46fff2a 24-Feb-2014 Frederic Weisbecker <fweisbec@gmail.com>

smp: Rename __smp_call_function_single() to smp_call_function_single_async()

The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().

The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.

As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.

Lets rename the function and enhance the comments so that they reflect
these properties.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# fce8ad15 24-Feb-2014 Frederic Weisbecker <fweisbec@gmail.com>

smp: Remove wait argument from __smp_call_function_single()

The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.

This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.

This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.

Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:

1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
in an object. It looks like a nice and convenient pattern at the first
sight because we can then retrieve the object from the IPI handler easily.

But actually it is a waste of memory space in the object since the csd
can be allocated from the stack by smp_call_function_single(wait = 1)
and the object can be passed an the IPI argument.

Besides that, embedding the csd in an object is more error prone
because the caller must take care of the serialization of the IPIs
for this csd.

2) The user calls __smp_call_function_single(wait = 1) on a csd that
is allocated on the stack. It's ok but smp_call_function_single()
can do it as well and it already takes care of the allocation on the
stack. Again it's more simple and less error prone.

Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.

There was a single user of that which has just been converted.
So lets remove this option to discourage further users.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d6a25b31 20-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: support partial I/O completions

Add a new blk_mq_end_io_partial function to partially complete requests
as needed by the SCSI layer. We do this by reusing blk_update_request
to advance the bio instead of having a simplified version of it in
the blk-mq code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# feb71dae 20-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: merge blk_mq_insert_request and blk_mq_run_request

It's almost identical to blk_mq_insert_request, so fold the two into one
slightly more generic function by making the flush special case a bit
smarted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# fd694131 20-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: remove blk_mq_alloc_rq

There's only one caller, which is a straight wrapper and fits the naming
scheme of the related functions a lot better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 49f5baa5 11-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: pair blk_mq_start_request / blk_mq_requeue_request

Make sure we have a proper pairing between starting and requeueing
requests. Move the dma drain and REQ_END setup into blk_mq_start_request,
and make sure blk_mq_requeue_request properly undoes them, giving us
a pair of function to prepare and unprepare a request without leaving
side effects.

Together this ensures we always clean up properly after
BLK_MQ_RQ_QUEUE_BUSY returns from ->queue_rq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1e93b8c2 11-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq

rq->errors never has been part of the communication protocol between drivers
and the block stack and most drivers will not have initialized it.

Return -EIO to upper layers when the driver returns BLK_MQ_RQ_QUEUE_ERROR
unconditionally. If a driver want to return a different error it can easily
do so by returning success after calling blk_mq_end_io itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 18741986 10-Feb-2014 Christoph Hellwig <hch@lst.de>

blk-mq: rework flush sequencing logic

Witch to using a preallocated flush_rq for blk-mq similar to what's done
with the old request path. This allows us to set up the request properly
with a tag from the actually allowed range and ->rq_disk as needed by
some drivers. To make life easier we also switch to dynamic allocation
of ->flush_rq for the old path.

This effectively reverts most of

"blk-mq: fix for flush deadlock"

and

"blk-mq: Don't reserve a tag for flush request"

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 30a91cb4 10-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: rework I/O completions

Rework I/O completions to work more like the old code path. blk_mq_end_io
now stays out of the business of deferring completions to others CPUs
and calling blk_mark_rq_complete. The latter is very important to allow
completing requests that have timed out and thus are already marked completed,
the former allows using the IPI callout even for driver specific completions
instead of having to reimplement them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 14ec77f3 07-Feb-2014 Nicholas Bellinger <nab@linux-iscsi.org>

blk-mq: Add bio_integrity setup to blk_mq_make_request

This patch adds the missing bio_integrity_enabled() +
bio_integrity_prep() setup into blk_mq_make_request()
in order to use DIF protection with scsi-mq.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1be036e9 07-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: initialize sg_reserved_size

To behave the same way as the old request path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4f7f418c 07-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: handle dma_drain_size

Make blk-mq handle the dma_drain_size field the same way as the old request
path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 72a0a36e 07-Feb-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: support at_head inserations for blk_execute_rq

This is neede for proper SG_IO operation as well as various uses of
blk_execute_rq from the SCSI midlayer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f0276924 30-Dec-2013 Shaohua Li <shli@kernel.org>

blk-mq: Don't reserve a tag for flush request

Reserving a tag (request) for flush to avoid dead lock is a overkill. A
tag is valuable resource. We can track the number of flush requests and
disallow having too many pending flush requests allocated. With this
patch, blk_mq_alloc_request_pinned() could do a busy nop (but not a dead
loop) if too many pending requests are allocated and new flush request
is allocated. But this should not be a problem, too many pending flush
requests are very rare case.

I verified this can fix the deadlock caused by too many pending flush
requests.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6753471c 08-Jan-2014 Dave Hansen <dave.hansen@linux.intel.com>

blk-mq: uses page->list incorrectly

'struct page' has two list_head fields: 'lru' and 'list'. Conveniently,
they are unioned together. This means that code can use them
interchangably, which gets horribly confusing.

The blk-mq made the logical decision to try to use page->list. But, that
field was actually introduced just for the slub code. ->lru is the right
field to use outside of slab/slub.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3d6efbf6 08-Jan-2014 Christoph Hellwig <hch@infradead.org>

blk-mq: use __smp_call_function_single directly

__smp_call_function_single already avoids multiple IPIs by internally
queing up the items, and now also is available for non-SMP builds as
a trivially correct stub, so there is no need to wrap it. If the
additional lock roundtrip cause problems my patch to convert the
generic IPI code to llists is waiting to get merged will fix it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0fec08b4 03-Jan-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix initializing request's start time

blk_rq_init() is called in req's complete handler to initialize
the request, so the members of start_time and start_time_ns might
become inaccurate when it is allocated in future.

The patch initializes the two members in blk_mq_rq_ctx_init() to
fix the problem.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3edcc0ce 26-Dec-2013 Ming Lei <tom.leiming@gmail.com>

block: blk-mq: don't export blk_mq_free_queue()

blk_mq_free_queue() is called from release handler of
queue kobject, so it needn't be called from drivers.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f04c1fe7 26-Dec-2013 Ming Lei <tom.leiming@gmail.com>

block: blk-mq: make blk_sync_queue support mq

This patch moves synchronization on mq->delay_work
from blk_mq_free_queue() to blk_sync_queue(), so that
blk_sync_queue can work on mq.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 43a5e4e2 26-Dec-2013 Ming Lei <tom.leiming@gmail.com>

block: blk-mq: support draining mq queue

blk_mq_drain_queue() is introduced so that we can drain
mq queue inside blk_cleanup_queue().

Also don't accept new requests any more if queue is marked
as dying.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0d11e6ac 05-Dec-2013 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix use-after-free of request

If accounting is on, we will do the IO completion accounting after
we have freed the request. Fix that by moving it sooner instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 959a35f1 03-Dec-2013 Jeff Moyer <jmoyer@redhat.com>

blk-mq: fix dereference of rq->mq_ctx if allocation fails

If __GFP_WAIT isn't set and we fail allocating, when we go
to drop the reference on the ctx, we will attempt to dereference
the NULL rq. Fix that.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f024f37 11-Oct-2013 Kent Overstreet <kmo@daterainc.com>

block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6


# 01b983c9 19-Nov-2013 Jens Axboe <axboe@kernel.dk>

blk-mq: add blktrace insert event trace

We need it to make 'btt' from blktrace happy, otherwise
we are missing one state transition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 94eddfbe 19-Nov-2013 Jens Axboe <axboe@kernel.dk>

blk-mq: ensure that we set REQ_IO_STAT so diskstats work

If disk stats are enabled on the queue, a request needs to
be marked with REQ_IO_STAT for accounting to be active on
that request. This fixes an issue with virtio-blk not
showing up in /proc/diskstats after the conversion to
blk-mq.

Add QUEUE_FLAG_MQ_DEFAULT, setting stats and same cpu-group
completion on by default.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a06ff06 14-Nov-2013 Christoph Hellwig <hch@infradead.org>

kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS

We've switched over every architecture that supports SMP to it, so
remove the new useless config variable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f618ef7c 14-Nov-2013 Paul Gortmaker <paul.gortmaker@windriver.com>

blk-mq: remove newly added instances of __cpuinit

The new blk-mq code added new instances of __cpuinit usage.
We removed this a couple versions ago; we now want to remove
the compat no-op stubs. Introducing new users is not what
we want to see at this point in time, as it will break once
the stubs are gone.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 92f399c7 29-Oct-2013 Shaohua Li <shli@fusionio.com>

blk-mq: mq plug list breakage

We switched to plug mq_list for mq, but some code are still using old list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3228f48b 28-Oct-2013 Christoph Hellwig <hch@lst.de>

blk-mq: fix for flush deadlock

The flush state machine takes in a struct request, which then is
submitted multiple times to the underling driver. The old block code
requeses the same request for each of those, so it does not have an
issue with tapping into the request pool. The new one on the other hand
allocates a new request for each of the actualy steps of the flush
sequence. If have already allocated all of the tags for IO, we will
fail allocating the flush request.

Set aside a reserved request just for flushes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 280d45f6 25-Oct-2013 Christoph Hellwig <hch@lst.de>

blk-mq: add blk_mq_stop_hw_queues

Add a helper to iterate over all hw queues and stop them. This is useful
for driver that implement PM suspend functionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to just call blk_mq_stop_hw_queue() by Jens.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 320ae51f 24-Oct-2013 Jens Axboe <axboe@kernel.dk>

blk-mq: new multi-queue block IO queueing mechanism

Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.

- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>