History log of /linux-master/block/blk-mq-tag.c
Revision Date Author Comments
# 4f1731df 09-Jun-2023 Yu Kuai <yukuai3@huawei.com>

blk-mq: fix potential io hang by wrong 'wake_batch'

In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating
'wake_batch' is not atomic:

t1: t2:
_blk_mq_tag_busy blk_mq_tag_busy
inc active_queues
// assume 1->2
inc active_queues
// 2 -> 3
blk_mq_update_wake_batch
// calculate based on 3
blk_mq_update_wake_batch
/* calculate based on 2, while active_queues is actually 3. */

Fix this problem by protecting them wih 'tags->lock', this is not a hot
path, so performance should not be concerned. And now that all writers
are inside the lock, switch 'actives_queues' from atomic to unsigned
int.

Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3e94d54e 22-May-2023 Tian Lan <tian.lan@twosigma.com>

blk-mq: fix race condition in active queue accounting

If multiple CPUs are sharing the same hardware queue, it can
cause leak in the active queue counter tracking when __blk_mq_tag_busy()
is executed simultaneously.

Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
Signed-off-by: Tian Lan <tian.lan@twosigma.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20230522210555.794134-1-tilan7663@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 90110e04 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: include <linux/blk-mq.h> in block/blk-mq.h

block/blk-mq.h needs various definitions from <linux/blk-mq.h>,
include it there instead of relying on the source files to include
both.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bebe84eb 13-Apr-2023 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk-mq-tag.h

blk-mq-tag.h is always included by blk-mq.h, and causes recursive
inclusion hell with further changes. Just merge it into blk-mq.h
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4acb8341 09-Sep-2022 Keith Busch <kbusch@kernel.org>

sbitmap: fix batched wait_cnt accounting

Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bce1b56c 04-Sep-2022 Jens Axboe <axboe@kernel.dk>

Revert "sbitmap: fix batched wait_cnt accounting"

This reverts commit 16ede66973c84f890c03584f79158dd5b2d725f5.

This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.

Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 16ede669 25-Aug-2022 Keith Busch <kbusch@kernel.org>

sbitmap: fix batched wait_cnt accounting

Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4cf6e6c0 06-Jul-2022 John Garry <john.garry@huawei.com>

blk-mq: Drop local variable for reserved tag

The local variable is now only referenced once so drop it.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-7-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2dd6532e 06-Jul-2022 John Garry <john.garry@huawei.com>

blk-mq: Drop 'reserved' arg of busy_tag_iter_fn

We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter
function so it may be dropped.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ee78ec10 25-Jun-2022 Liu Song <liusong@linux.alibaba.com>

blk-mq: blk_mq_tag_busy is no need to return a value

Currently "blk_mq_tag_busy" return value has no effect, so adjust it.
Some code implementations have also been adjusted to enhance
readability.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1656170121-1619-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ff47dbd1 02-Jun-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

block: remove useless BUG_ON() in blk_mq_put_tag()

Since the if condition in blk_mq_put_tag() checks that the tag to put is
not a reserved one, the BUG_ON() check in the else branch checking if
the tag is indeed a reserved one is useless. Remove it.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220602075159.1273366-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e5cc99e 08-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: manage hctx map via xarray

First code becomes more clean by switching to xarray from plain array.

Second use-after-free on q->queue_hw_ctx can be fixed because
queue_for_each_hw_ctx() may be run when updating nr_hw_queues is
in-progress. With this patch, q->hctx_table is defined as xarray, and
this structure will share same lifetime with request queue, so
queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably.

Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com
[axboe: fix blk_mq_hw_ctx forward declaration]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f481208 08-Mar-2022 Ming Lei <ming.lei@redhat.com>

blk-mq: prepare for implementing hctx table via xarray

It is inevitable to cause use-after-free on q->queue_hw_ctx between
queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting
to xarray can fix the uaf, meantime code gets cleaner.

Prepare for converting q->queue_hctx_ctx into xarray, one thing is that
xa_for_each() can only accept 'unsigned long' as index, so changes type
of hctx index of queue_for_each_hw_ctx() into 'unsigned long'.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3f607293 08-Feb-2022 John Garry <john.garry@huawei.com>

sbitmap: Delete old sbitmap_queue_get_shallow()

Since __sbitmap_queue_get_shallow() was introduced in commit c05e66733788
("sbitmap: add sbitmap_get_shallow() operation"), it has not been used.

Delete __sbitmap_queue_get_shallow() and rename public
__sbitmap_queue_get_shallow() -> sbitmap_queue_get_shallow() as it is odd
to have public __foo but no foo at all.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1644322024-105340-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 180dccb0 12-Jan-2022 Laibin Qiu <qiulaibin@huawei.com>

blk-mq: fix tag_get wait task can't be awakened

In case of shared tags, there might be more than one hctx which
allocates from the same tags, and each hctx is limited to allocate at
most:
hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);

tag idle detection is lazy, and may be delayed for 30sec, so there
could be just one real active hctx(queue) but all others are actually
idle and still accounted as active because of the lazy idle detection.
Then if wake_batch is > hctx_max_depth, driver tag allocation may wait
forever on this real active hctx.

Fix this by recalculating wake_batch when inc or dec active_queues.

Fixes: 0d2602ca30e41 ("blk-mq: improve support for shared tags maps")
Suggested-by: Ming Lei <ming.lei@redhat.com>
Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220113025536.1479653-1-qiulaibin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fea9f92f 06-Dec-2021 John Garry <john.garry@huawei.com>

blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags

Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees
using megaraid SAS RAID card since moving to shared tags [0].

Previously, when shared tags was shared sbitmap, this function was less
than optimum since we would iter through all tags for all hctx's,
yet only ever match upto tagset depth number of rqs.

Since the change to shared tags, things are even less efficient if we have
parallel callers of blk_mq_queue_tag_busy_iter(). This is because in
bt_iter() -> blk_mq_find_and_get_req() there would be more contention on
accessing each request ref and tags->lock since they are now shared among
all HW queues.

Optimise by having separate calls to bt_for_each() for when we're using
shared tags. In this case no longer pass a hctx, as it is no longer
relevant, and teach bt_iter() about this.

Ming suggested something along the lines of this change, apart from a
different implementation.

[0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reported-and-tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support")
Link: https://lore.kernel.org/r/1638794990-137490-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fc39f8d2 06-Dec-2021 John Garry <john.garry@huawei.com>

blk-mq: Delete busy_iter_fn

Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete
busy_iter_fn to reduce duplication.

It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is
less specific.

However busy_tag_iter_fn is used in many different parts of the tree,
unlike busy_iter_fn which is just use in block/, so just take the
straightforward path now, so that we could rename later treewide.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ab30a33 06-Dec-2021 John Garry <john.garry@huawei.com>

blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument

The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is
blk_mq_rq_inflight().

Function blk_mq_rq_inflight() uses the hctx to find the associated request
queue to match against the request. However this same check is already
done in caller bt_iter(), so drop this check.

With that change there are no more users of busy_iter_fn blk_mq_hw_ctx
argument, so drop the argument.

Reviewed-by Hannes Reinecke <hare@suse.de>

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a467d0f 14-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: switch to atomic_t for request references

refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.

This borrows that same implementation, which in turn is based on the
mm implementation from Linus.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0994c64e 18-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags

Since it is now possible for a tagset to share a single set of tags, the
iter function should not re-iter the tags for the count of #hw queues in
that case. Rather it should just iter once.

Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1634550083-202815-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f794f335 08-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: add support for blk_mq_end_request_batch()

Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 349302da 09-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: improve batched tag allocation

Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().

This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 079a2e3e 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Change shared sbitmap naming to shared tags

Now that shared sbitmap support really means shared tags, rename symbols
to match that.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ae0f1a73 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Stop using pointers for blk_mq_tags bitmap tags

Now that we use shared tags for shared sbitmap support, we don't require
the tags sbitmap pointers, so drop them.

This essentially reverts commit 222a5ae03cdd ("blk-mq: Use pointers for
blk_mq_tags bitmap tags").

Function blk_mq_init_bitmap_tags() is removed also, since it would be only
a wrappper for blk_mq_init_bitmaps().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e155b0c2 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Use shared tags for shared sbitmap support

Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.

However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.

As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.

So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static
rqs.

Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 645db34e 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()

Refactor blk_mq_free_map_and_requests() such that it can be used at many
sites at which the tag map and rqs are freed.

Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the
alloc equivalent.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 63064be1 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Add blk_mq_alloc_map_and_rqs()

Add a function to combine allocating tags and the associated requests,
and factor out common patterns to use this new function.

Some function only call blk_mq_alloc_map_and_rqs() now, but more
functionality will be added later.

Also make blk_mq_alloc_rq_map() and blk_mq_alloc_rqs() static since they
are only used in blk-mq.c, and finally rename some functions for
conciseness and consistency with other function names:
- __blk_mq_alloc_map_and_{request -> rqs}()
- blk_mq_alloc_{map_and_requests -> set_map_and_rqs}()

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-11-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a7e7388d 05-Oct-2021 John Garry <john.garry@huawei.com>

blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap()

Put the functionality to update the sched shared sbitmap size in a common
function.

Since the same formula is always used to resize, and it can be got from
the request queue argument, so just pass the request queue pointer.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-10-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 67f3b2f8 06-Sep-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: avoid to iterate over stale request

blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.

So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].

scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.

Fixes the issue by not iterating over stale request.

Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Reported-by: luojiaxing <luojiaxing@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d97e594c 13-May-2021 John Garry <john.garry@huawei.com>

blk-mq: Use request queue-wide tags for tagset-wide sbitmap

The tags used for an IO scheduler are currently per hctx.

As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.

This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.

Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.

Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).

For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 56b68085 13-May-2021 John Garry <john.garry@huawei.com>

blk-mq: Some tag allocation code refactoring

The tag allocation code to alloc the sbitmap pairs is common for regular
bitmaps tags and shared sbitmap, so refactor into a common function.

Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bd63141d 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: clear stale request in tags->rq[] before freeing one request pool

refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.

Fix the issue by the following approach:

1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()

2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]

So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.

The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2e315dc0 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter

Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.

Fix request use-after-free(UAF) related with completion race or queue
releasing:

- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.

- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.

However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 39aa56db 11-Mar-2021 Nikolay Borisov <nborisov@suse.com>

blk-mq: Always use blk_mq_is_sbitmap_shared

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20210311081713.2763171-1-nborisov@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9cf1adc6 19-Mar-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

blk-mq: Sentence reconstruct for better readability

Sentence reconstruction for better readability.

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 76cffccd 18-Sep-2020 yangerkun <yangerkun@huawei.com>

block-mq: fix comments in blk_mq_queue_tag_busy_iter

'f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with
blk_mq_queue_tag_busy_iter")' introduce a bug what we may sleep between
rcu lock. Then '530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter
callbacks")' fix it by get request_queue's ref. And 'a9a808084d6a ("block:
Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()")'
remove the synchronize_rcu in __blk_mq_update_nr_hw_queues. We need
update the confused comments in blk_mq_queue_tag_busy_iter.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 28500850 11-Sep-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: always allow reserved allocation in hctx_may_queue

NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.

Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.

Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f1b49fdc 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 32bc15af 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Facilitate a shared sbitmap per tagset

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 222a5ae0 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Use pointers for blk_mq_tags bitmap tags

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1c0706a7 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Pass flags for tag init/free

Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.

For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4d063237 19-Aug-2020 Hannes Reinecke <hare@suse.de>

blk-mq: Free tags in blk_mq_init_tags() upon error

Since the tags are allocated in blk_mq_init_tags(), it's better practice
to free in that same function upon error, rather than a callee which is to
init the bitmap tags (blk_mq_init_tags()).

[jpg: Split from an earlier patch with a new commit message]

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 570e9b73 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: move blk_mq_get_driver_tag into blk-mq.c

blk_mq_get_driver_tag() is only used by blk-mq.c and is supposed to
stay in blk-mq.c, so move it and preparing for cleanup code of
get/put driver tag.

Meantime hctx_may_queue() is moved to header file and it is fine
since it is defined as inline always.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 42fdc5e4 29-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove the BLK_MQ_REQ_INTERNAL flag

Just check for a non-NULL elevator directly to make the code more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a8a5e383 15-Jun-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-mq: Remove redundant 'return' statement

The blk_mq_all_tag_iter() is a void function, thus remove
the redundant 'return' statement in this function.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 22f614bc 05-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: fix blk_mq_all_tag_iter

blk_mq_all_tag_iter() is added to iterate all requests, so we should
fetch the request from ->static_rqs][] instead of ->rqs[] which is for
holding in-flight request only.

Fix it by adding flag of BT_TAG_ITER_STATIC_RQS.

Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: John Garry <john.garry@huawei.com>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d94ecfc3 05-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: split out a __blk_mq_get_driver_tag helper

Allocation of the driver tag in the case of using a scheduler shares very
little code with the "normal" tag allocation. Split out a new helper to
streamline this path, and untangle it from the complex normal tag
allocation.

This way also avoids to fail driver tag allocation because of inactive hctx
during cpu hotplug, and fixes potential hang risk.

Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bf0beec0 29-May-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: drain I/O when all CPUs in a hctx are offline

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

The driver/subsystem has to quiesce the interrupt line and the associated
queue _before_ it gets shutdown in CPU unplug and not fiddle with it
until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests. This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

[hch: different retry mechanism, merged two patches, minor cleanups]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 602380d2 29-May-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: add blk_mq_all_tag_iter

Add a new blk_mq_all_tag_iter function to iterate over all allocated
scheduler tags and driver tags. This is more flexible than the existing
blk_mq_all_tag_busy_iter function as it allows the callers to do whatever
they want on allocated request instead of being limited to started
requests.

It will be used to implement draining allocated requests on specified
hctx in this patchset.

[hch: switch from the two booleans to a more readable flags field and
consolidate the tags iter functions]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 76647368 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: use BLK_MQ_NO_TAG in more places

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 419c3d5e 29-May-2020 Christoph Hellwig <hch@lst.de>

blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG

To prepare for wider use of this constant give it a more applicable name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cae740a0 26-Feb-2020 John Garry <john.garry@huawei.com>

blk-mq: Remove some unused function arguments

The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(),
blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove
it.

Overall obj code size shows a minor reduction, before:
text data bss dec hex filename
27306 1312 0 28618 6fca block/blk-mq.o
4303 272 0 4575 11df block/blk-mq-tag.o

after:
27282 1312 0 28594 6fb2 block/blk-mq.o
4311 272 0 4583 11e7 block/blk-mq-tag.o

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
--
This minor patch had been carried as part of the blk-mq shared tags RFC,
I'd rather not carry it anymore as it required rebasing, so now or never..
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cb711b91 13-Nov-2019 John Garry <john.garry@huawei.com>

blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()

These functions are not referenced, so delete them.

Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f9934a80 23-Jul-2019 Ming Lei <ming.lei@redhat.com>

blk-mq: introduce blk_mq_tagset_wait_completed_request()

blk-mq may schedule to call queue's complete function on remote CPU via
IPI, but doesn't provide any way to synchronize the request's complete
fn. The current queue freeze interface can't provide the synchonization
because aborted requests stay at blk-mq queues during EH.

In some driver's EH(such as NVMe), hardware queue's resource may be freed &
re-allocated. If the completed request's complete fn is run finally after the
hardware queue's resource is released, kernel crash will be triggered.

Prepare for fixing this kind of issue by introducing
blk_mq_tagset_wait_completed_request().

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c05f4220 01-Jul-2019 Bart Van Assche <bvanassche@acm.org>

blk-mq: remove blk_mq_put_ctx()

No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
on preemption being disabled for its correctness. Since removing the CPU
preemption calls does not measurably affect performance, simplify the
blk-mq code by removing the blk_mq_put_ctx() function and also by not
disabling preemption in blk_mq_get_ctx().

Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3dcf60bc 30-Apr-2019 Christoph Hellwig <hch@lst.de>

block: add SPDX tags to block layer files missing licensing information

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ccdf4a3 24-Jan-2019 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: save queue mapping result into ctx directly

Currently, the queue mapping result is saved in a two-dimensional
array. In the hot path, to get a hctx, we need do following:

q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

This isn't very efficient. We could save the queue mapping result into
ctx directly with different hctx type, like,

ctx->hctxs[type]

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5d2ee712 29-Nov-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: optimize wakeup check

Even if we have no waiters on any of the sbitmap_queue wait states, we
still have to loop every entry to check. We do this for every IO, so
the cost adds up.

Shift a bit of the cost to the slow path, when we actually have waiters.
Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
an internal count of how many are currently active. Then we can simply
check this count in sbq_wake_ptr() and not have to loop if we don't
have any sleepers.

Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ab11fe5a 08-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq-tag: document tag iteration helper return value

Document the fact that the strategy function passed in can
control whether to continue iterating or not.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7baa8572 08-Nov-2018 Jens Axboe <axboe@kernel.dk>

blk-mq-tag: change busy_iter_fn to return whether to continue or not

We have this functionality in sbitmap, but we don't export it in
blk-mq for users of the tags busy iteration. This can be useful
for stopping the iteration, if the caller doesn't need to find
more requests.

Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ea4f995e 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: cache request hardware queue mapping

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f9afca4d 29-Oct-2018 Jens Axboe <axboe@kernel.dk>

blk-mq: pass in request/bio flags to queue mapping

Prep patch for being able to place request based not just on
CPU location, but also on the type of request.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7ca01926 24-Oct-2018 Jens Axboe <axboe@kernel.dk>

block: remove legacy rq tagging

It's now unused, kill it.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 530ca2c9 25-Sep-2018 Keith Busch <kbusch@kernel.org>

blk-mq: Allow blocking queue tag iter callbacks

A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4d6357 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c7b1bf5c 21-Sep-2018 Bart Van Assche <bvanassche@acm.org>

blk-mq: Document the functions that iterate over requests

Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix several minor spelling
and grammer mistakes in comments in these functions.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f5bbbbe4 21-Aug-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter

For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d263ed99 09-Aug-2018 Jianchao Wang <jianchao.w.wang@oracle.com>

blk-mq: count the hctx as active before allocating tag

Currently, we count the hctx as active after allocate driver tag
successfully. If a previously inactive hctx try to get tag first
time, it may fails and need to wait. However, due to the stale tag
->active_queues, the other shared-tags users are still able to
occupy all driver tags while there is someone waiting for tag.
Consequently, even if the previously inactive hctx is waked up, it
still may not be able to get a tag and could be starved.

To fix it, we count the hctx as active before try to allocate driver
tag, then when it is waiting the tag, the other shared-tag users
will reserve budget for it.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2d5ba0e2 02-Aug-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: fix blk_mq_tagset_busy_iter

Commit d250bf4e776ff09d5("blk-mq: only iterate over inflight requests
in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT'
to replace 'blk_mq_request_started(req)', this way is wrong, and causes
lots of test system hang during booting.

Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter().

Fixes: d250bf4e776ff09d5 ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter")
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matt Hart <matthew.hart@linaro.org>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: John Garry <john.garry@huawei.com>
Cc: Hannes Reinecke <hare@suse.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: linux-scsi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 75d6e175 02-Aug-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: fix updating tags depth

The passed 'nr' from userspace represents the total depth, meantime
inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
and 'nr_reserved_tags' stores the reserved part.

There are two issues in blk_mq_tag_update_depth() now:

1) for growing tags, we should have used the passed 'nr', and keep the
number of reserved tags not changed.

2) the passed 'nr' should have been used for checking against
'tags->nr_tags', instead of number of the normal part.

This patch fixes the above two cases, and avoids kernel crash caused
by wrong resizing sbitmap queue.

Cc: "Ewan D. Milne" <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Tested by: Marco Patalano <mpatalan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e6c3456a 14-Jun-2018 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_tagset_iter

Unused now that nvme stopped using it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>


# d250bf4e 30-May-2018 Christoph Hellwig <hch@lst.de>

blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter

We already check for started commands in all callbacks, but we should
also protect against already completed commands. Do this by taking
the checks to common code.

Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e6fc4649 24-May-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: avoid starving tag allocation after allocating process migrates

When the allocation process is scheduled back and the mapped hw queue is
changed, fake one extra wake up on previous queue for compensating wake
up miss, so other allocations on the previous queue won't be starved.

This patch fixes one request allocation hang issue, which can be
triggered easily in case of very low nr_request.

The race is as follows:

1) 2 hw queues, nr_requests are 2, and wake_batch is one

2) there are 3 waiters on hw queue 0

3) two in-flight requests in hw queue 0 are completed, and only two
waiters of 3 are waken up because of wake_batch, but both the two
waiters can be scheduled to another CPU and cause to switch to hw
queue 1

4) then the 3rd waiter will wait for ever, since no in-flight request
is in hw queue 0 any more.

5) this patch fixes it by the fake wakeup when waiter is scheduled to
another hw queue

Cc: <stable@vger.kernel.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Modified commit message to make it clearer, and make it apply on
top of the 4.18 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e5dff41 14-Nov-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: improve heavily contended tag case

Even with a number of waitqueues, we can get into a situation where we
are heavily contended on the waitqueue lock. I got a report on spc1
where we're spending seconds doing this. Arguably the use case is nasty,
I reproduce it with one device and 1000 threads banging on the device.
But that doesn't mean we shouldn't be handling it better.

What ends up happening is that a thread will fail to get a tag, add
itself to the waitqueue, and subsequently get woken up when a tag is
freed - only to find itself going back to sleep on the waitqueue.

Instead of waking all threads, use an exclusive wait and wake up our
sbitmap batch count instead. This seems to work well for me (massive
improvement for this use case), and it survives basic testing. But I
haven't fully verified it yet.

An additional improvement is running the queue and checking for a new
tag BEFORE needing to add ourselves to the waitqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dab7487b 10-Oct-2017 Sagi Grimberg <sagi@grimberg.me>

block: remove blk_mq_reinit_tagset

No callers left.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 149e10f8 10-Oct-2017 Sagi Grimberg <sagi@grimberg.me>

block: introduce blk_mq_tagset_iter

Iterator helper to apply a function on all the
tags in a given tagset. export it as it will be used
outside the block layer later on.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# d352ae20 17-Aug-2017 Bart Van Assche <bvanassche@acm.org>

blk-mq: Make blk_mq_reinit_tagset() calls easier to read

Since blk_mq_ops.reinit_request is only called from inside
blk_mq_reinit_tagset(), make this function pointer an argument of
blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
This patch does not change any functionality but makes
blk_mq_reinit_tagset() calls easier to read and to analyze.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: James Smart <james.smart@broadcom.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7f5562d5 04-Aug-2017 Jens Axboe <axboe@kernel.dk>

blk-mq-tag: check for NULL rq when iterating tags

Since we introduced blk-mq-sched, the tags->rqs[] array has been
dynamically assigned. So we need to check for NULL when iterating,
since there's a window of time where the bit is set, but we haven't
dynamically assigned the tags->rqs[] array position yet.

This is perfectly safe, since the memory backing of the request is
never going away while the device is alive.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 229a9287 14-Apr-2017 Omar Sandoval <osandov@fb.com>

blk-mq: add shallow depth option for blk_mq_get_tag()

Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0067d4b0 13-Mar-2017 Sagi Grimberg <sagi@grimberg.me>

blk-mq: Fix tagset reinit in the presence of cpu hot-unplug

In case cpu was unplugged, we need to make sure not to assume
that the tags for that cpu are still allocated. so check
for null tags when reinitializing a tagset.

Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 415b806d 27-Feb-2017 Sagi Grimberg <sagi@grimberg.me>

blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Modified by me to also check at driver tag allocation time if the
original request was reserved, so we can be sure to allocate a
properly reserved tag at that point in time, too.

Signed-off-by: Jens Axboe <axboe@fb.com>


# bd6737f1 27-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()

Instead of letting the caller check this and handle the details
of inserting a flush request, put the logic in the scheduler
insertion function. This fixes direct flush insertion outside
of the usual make_request_fn calls, like from dm via
blk_insert_cloned_request().

Signed-off-by: Jens Axboe <axboe@fb.com>


# d96b37c0 25-Jan-2017 Omar Sandoval <osandov@fb.com>

blk-mq: move tags and sched_tags info from sysfs to debugfs

These are very tied to the blk-mq tag implementation, so exposing them
to sysfs isn't a great idea. Move the debugging information to debugfs
and add basic entries for the number of tags and the number of reserved
tags to sysfs.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 200e86b3 25-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: only apply active queue tag throttling for driver tags

If we have a scheduler attached, we have two sets of tags. We don't
want to apply our active queue throttling for the scheduler side
of tags, that only applies to driver tags since that's the resource
we need to dispatch an IO.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 70f36b60 19-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: allow resize of scheduler requests

Add support for growing the tags associated with a hardware queue, for
the scheduler tags. Currently we only support resizing within the
limits of the original depth, change that so we can grow it as well by
allocating and replacing the existing scheduler tag set.

This is similar to how we could increase the software queue depth with
the legacy IO stack and schedulers.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 8cecb07d 19-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-tag: remove redundant check for 'data->hctx' being non-NULL

We used to pass in NULL for hctx for reserved tags, but we don't
do that anymore. Hence the check for whether hctx is NULL or not
is now redundant, kill it.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: a642a158aec6 ("blk-mq-tag: cleanup the normal/reserved tag allocation")
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2af8cbe3 13-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq: split tag ->rqs[] into two

This is in preparation for having two sets of tags available. For
that we need a static index, and a dynamically assignable one.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 4941115b 13-Jan-2017 Jens Axboe <axboe@fb.com>

blk-mq-tag: cleanup the normal/reserved tag allocation

This is in preparation for having another tag set available. Cleanup
the parameters, and allow passing in of tags for blk_mq_put_tag().

Signed-off-by: Jens Axboe <axboe@fb.com>
[hch: even more cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>


# 98d95416 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: randomize initial alloc_hint values

In order to get good cache behavior from a sbitmap, we want each CPU to
stick to its own cacheline(s) as much as possible. This might happen
naturally as the bitmap gets filled up and the alloc_hint values spread
out, but we really want this behavior from the start. blk-mq apparently
intended to do this, but the code to do this was never wired up. Get rid
of the dead code and make it part of the sbitmap library.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f4a644db 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: push alloc policy into sbitmap_queue

Again, there's no point in passing this in every time. Make it part of
struct sbitmap_queue and clean up the API.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 40aabb67 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: push per-cpu last_tag into sbitmap_queue

Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 88459642 17-Sep-2016 Omar Sandoval <osandov@fb.com>

blk-mq: abstract tag allocation out into sbitmap library

This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1b157939 14-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: get rid of the cpumask in struct blk_mq_tags

Unused now that NVMe sets up irq affinity before calling into blk-mq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7d7e0f90 14-Sep-2016 Christoph Hellwig <hch@lst.de>

blk-mq: remove ->map_queue

All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 486cf989 06-Jul-2016 Sagi Grimberg <sagi@grimberg.me>

blk-mq: Introduce blk_mq_reinit_tagset

The new nvme-rdma driver will need to reinitialize all the tags as part of
the error recovery procedure (realloc the tag memory region). Add a helper
in blk-mq for it that can iterate over all requests in a tagset to make
this easier.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Stephen Bates <Stephen.Bates@pmcs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e8f1e163 10-Mar-2016 Sagi Grimberg <sagig@mellanox.com>

blk-mq: Make blk_mq_all_tag_busy_iter static

No caller outside the blk-mq code so we can settle
with it static.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e0489487 10-Mar-2016 Sagi Grimberg <sagig@mellanox.com>

blk-mq: Export tagset iter function

Its useful to iterate on all the active tags in cases
where we will need to fail all the queues IO.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
[hch: carefully check for valid tagsets]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6f3b0e8b 26-Nov-2015 Christoph Hellwig <hch@lst.de>

blk-mq: add a flags parameter to blk_mq_alloc_request

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# d0164adc 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f42d79ab 13-Oct-2015 Junichi Nomura <j-nomura@ce.jp.nec.com>

blk-mq: fix use-after-free in blk_mq_free_tag_set()

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8ee1b7b9 08-Oct-2015 Kosuke Tatsukawa <tatsu@ab.jp.nec.com>

blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c

blk_mq_tag_update_depth() seems to be missing a memory barrier which
might cause the waker to not notice the waiter and fail to send a
wake_up as in the following figure.

blk_mq_tag_update_depth bt_get
------------------------------------------------------------------------
if (waitqueue_active(&bs->wait))
/* The CPU might reorder the test for
the waitqueue up here, before
prior writes complete */
prepare_to_wait(&bs->wait, &wait,
TASK_UNINTERRUPTIBLE);
tag = __bt_get(hctx, bt, last_tag,
tags);
/* Value set in bt_update_count not
visible yet */
bt_update_count(&tags->bitmap_tags, tdepth);
/* blk_mq_tag_wakeup_all(tags, false); */
bt = &tags->bitmap_tags;
wake_index = atomic_read(&bt->wake_index);
...
io_schedule();
------------------------------------------------------------------------

This patch adds the missing memory barrier.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0bf6cd5b 27-Sep-2015 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a helper to iterate all tags for a request_queue

And replace the blk_mq_tag_busy_iter with it - the driver use has been
replaced with a new helper a while ago, and internal to the block we
only need the new version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0048b483 09-Aug-2015 Ming Lei <ming.lei@canonical.com>

blk-mq: fix race between timeout and freeing request

Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[ 439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[ 439.700653] Dumping ftrace buffer:^M
[ 439.700653] (ftrace buffer empty)^M
[ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[ 439.730500] RIP: 0010:[<ffffffff812d89ba>] [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M
[ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[ 439.730500] Stack:^M
[ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[ 439.755663] Call Trace:^M
[ 439.755663] <IRQ> ^M
[ 439.755663] [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
[ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[ 439.755663] [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[ 439.755663] [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
[ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[ 439.755663] [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
[ 439.755663] [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
[ 439.755663] [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
[ 439.755663] [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[ 439.755663] [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
[ 439.755663] [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
[ 439.755663] [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
[ 439.755663] [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
[ 439.755663] [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
[ 439.755663] <EOI> ^M
[ 439.755663] [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
[ 439.755663] [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
[ 439.755663] [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
[ 439.755663] [<ffffffff81550066>] __schedule+0x469/0x6cd^M
[ 439.755663] [<ffffffff8155039b>] schedule+0x82/0x9a^M
[ 439.789267] [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
[ 439.790911] [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
[ 439.790911] [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
[ 439.790911] [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
[ 439.790911] [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
[ 439.790911] [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
[ 439.790911] [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[ 439.790911] RIP [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.790911] RSP <ffff880819203da0>^M
[ 439.790911] CR2: 0000000000000158^M
[ 439.790911] ---[ end trace d40af58949325661 ]---^M

Cc: <stable@vger.kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f26cdc85 01-Jun-2015 Keith Busch <kbusch@kernel.org>

blk-mq: Shared tag enhancements

Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bc188d81 18-Mar-2015 Sam Bradshaw <sbradshaw@micron.com>

blkmq: Fix NULL pointer deref when all reserved tags in

When allocating from the reserved tags pool, bt_get() is called with
a NULL hctx. If all tags are in use, the hw queue is kicked to push
out any pending IO, potentially freeing tags, and tag allocation is
retried. The problem is that blk_mq_run_hw_queue() doesn't check for
a NULL hctx. So we avoid it with a simple NULL hctx test.

Tested by hammering mtip32xx with concurrent smartctl/hdparm.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()")
Cc: stable@kernel.org

Added appropriate comment.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 564e559f 11-Feb-2015 Tony Battersby <tonyb@cybernetics.com>

blk-mq: fix double-free in error path

If the allocation of bt->bs fails, then bt->map can be freed twice, once
in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in
blk_mq_init_bitmap_tags() -> bt_free(). Fix by setting the pointer to
NULL after the first free.

Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 24391c0d 23-Jan-2015 Shaohua Li <shli@fb.com>

blk-mq: add tag allocation policy

This is the blk-mq part to support tag allocation policy. The default
allocation policy isn't changed (though it's not a strict FIFO). The new
policy is round-robin for libata. But it's a try-best implementation. If
multiple tasks are competing, the tags returned will be mixed (which is
unavoidable even with !mq, as requests from different tasks can be
mixed in queue)

Cc: Jens Axboe <axboe@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0bf36498 14-Jan-2015 Jens Axboe <axboe@fb.com>

blk-mq: fix false negative out-of-tags condition

The blk-mq tagging tries to maintain some locality between CPUs and
the tags issued. The tags are split into groups of words, and the
words may not be fully populated. When searching for a new free tag,
blk-mq may look at partial words, hence it passes in an offset/size
to find_next_zero_bit(). However, it does that wrong, the size must
always be the full length of the number of tags in that word,
otherwise we'll potentially miss some near the end.

Another issue is when __bt_get() goes from one word set to the next.
It bumps the index, but not the last_tag associated with the
previous index. Bump that to be in the range of the new word.

Finally, clean up __bt_get() and __bt_get_word() a bit and get
rid of the goto in there, and the unnecessary 'wrap' variable.

Signed-off-by: Jens Axboe <axboe@fb.com>


# aed3ea94 22-Dec-2014 Jens Axboe <axboe@fb.com>

block: wake up waiters when a queue is marked dying

If it's dying, we can't expect new request to complete and come
in an wake up other tasks waiting for requests. So after we
have marked it as dying, wake up everybody currently waiting
for a request. Once they wake, they will retry their allocation
and fail appropriately due to the state of the queue.

Tested-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 35d37c66 15-Dec-2014 Jens Axboe <axboe@fb.com>

Revert "blk-mq: Micro-optimize bt_get()"

This reverts commit 52f7eb945f2ba62b324bb9ae16d945326a961dcf.

The optimization is only really safe for a single queue, otherwise
'bs' and 'bt' can indeed change, and if we don't do a finish_wait()
for each loop, we'll potentially change the wait structure and
corrupt task wait list.

Reported-by: Jan Kara <jack@suse.cz>


# 52f7eb94 09-Dec-2014 Bart Van Assche <bvanassche@acm.org>

blk-mq: Micro-optimize bt_get()

Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
calls into a single call.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c38d185d 09-Dec-2014 Bart Van Assche <bvanassche@acm.org>

blk-mq: Fix a race between bt_clear_tag() and bt_get()

What we need is the following two guarantees:
* Any thread that observes the effect of the test_and_set_bit() by
__bt_get_word() also observes the preceding addition of 'current'
to the appropriate wait list. This is guaranteed by the semantics
of the spin_unlock() operation performed by prepare_and_wait().
Hence the conversion of test_and_set_bit_lock() into
test_and_set_bit().
* The wait lists are examined by bt_clear() after the tag bit has
been cleared. clear_bit_unlock() guarantees that any thread that
observes that the bit has been cleared also observes the store
operations preceding clear_bit_unlock(). However,
clear_bit_unlock() does not prevent that the wait lists are examined
before that the tag bit is cleared. Hence the addition of a memory
barrier between clear_bit() and the wait list examination.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9e98e9d7 09-Dec-2014 Bart Van Assche <bvanassche@acm.org>

blk-mq: Avoid that __bt_get_word() wraps multiple times

If __bt_get_word() is called with last_tag != 0, if the first
find_next_zero_bit() fails, if after wrap-around the
test_and_set_bit() call fails and find_next_zero_bit() succeeds,
if the next test_and_set_bit() call fails and subsequently
find_next_zero_bit() does not find a zero bit, then another
wrap-around will occur. Avoid this by introducing an additional
local variable.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>


# 080ff351 08-Dec-2014 Jens Axboe <axboe@fb.com>

blk-mq: re-check for available tags after running the hardware queue

If we run out of tags and have to sleep, we run the hardware queue
to kick pending IO into gear. During that run, we may have completed
requests, so re-check if we have free tags before going to sleep.

Signed-off-by: Jens Axboe <axboe@fb.com>


# b3223207 08-Dec-2014 Bart Van Assche <bvanassche@acm.org>

blk-mq: fix hang in bt_get()

Avoid that if there are fewer hardware queues than CPU threads that
bt_get() can hang. The symptoms of the hang were as follows:

* All tags allocated for a particular hardware queue.
* (nr_tags) pending commands for that hardware queue.
* No pending commands for the software queues associated with that
hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 70114c39 24-Nov-2014 Jens Axboe <axboe@fb.com>

blk-mq: cleanup tag free handling

We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag()
from blk_mq_put_tag(), so just inline the two calls instead of
having them as separate functions.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 205fb5f5 30-Oct-2014 Bart Van Assche <bvanassche@acm.org>

blk-mq: add blk_mq_unique_tag()

The queuecommand() callback functions in SCSI low-level drivers
need to know which hardware context has been selected by the
block layer. Since this information is not available in the
request structure, and since passing the hctx pointer directly to
the queuecommand callback function would require modification of
all SCSI LLDs, add a function to the block layer that allows to
query the hardware context index.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 9d8f0bcc 07-Oct-2014 Bart Van Assche <bvanassche@acm.org>

blk-mq: Make bt_clear_tag() easier to read

Eliminate a backwards goto statement from bt_clear_tag().

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# abab13b5 07-Oct-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix potential hang if rolling wakeup depth is too high

We currently divide the queue depth by 4 as our batch wakeup
count, but we split the wakeups over BT_WAIT_QUEUES number of
wait queues. This defaults to 8. If the product of the resulting
batch wake count and BT_WAIT_QUEUES is higher than the device
queue depth, we can get into a situation where a task goes to
sleep waiting for a request, but never gets woken up.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: 4bb659b156996
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>


# 81481eb4 13-Sep-2014 Christoph Hellwig <hch@lst.de>

blk-mq: fix and simplify tag iteration for the timeout handler

Don't do a kmalloc from timer to handle timeouts, chances are we could be
under heavy load or similar and thus just miss out on the timeouts.
Fortunately it is very easy to just iterate over all in use tags, and doing
this properly actually cleans up the blk_mq_busy_iter API as well, and
prepares us for the next patch by passing a reserved argument to the
iterator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 86fb5c56 17-Jun-2014 Alexander Gordeev <agordeev@redhat.com>

blk-mq: bitmap tag: fix races in bt_get() function

This update fixes few issues in bt_get() function:

- list_empty(&wait.task_list) check is not protected;

- was_empty check is always true which results in *every* thread
entering the loop resets bt_wait_state::wait_cnt counter rather
than every bt->wake_cnt'th thread;

- 'bt_wait_state::wait_cnt' counter update is redundant, since
it also gets reset in bt_clear_tag() function;

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2971c35f 12-Jun-2014 Alexander Gordeev <agordeev@redhat.com>

blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt

This piece of code in bt_clear_tag() function is racy:

bs = bt_wake_ptr(bt);
if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
atomic_set(&bs->wait_cnt, bt->wake_cnt);
wake_up(&bs->wait);
}

Since nothing prevents bt_wake_ptr() from returning the very
same 'bs' address on multiple CPUs, the following scenario is
possible:

CPU1 CPU2
---- ----

0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt);
1. atomic_dec_and_test(&bs->wait_cnt)
2. atomic_dec_and_test(&bs->wait_cnt)
3. atomic_set(&bs->wait_cnt, bt->wake_cnt);

If the decrement in [1] yields zero then for some amount of time
the decrement in [2] results in a negative/overflow value, which
is not expected. The follow-up assignment in [3] overwrites the
invalid value with the batch value (and likely prevents the issue
from being severe) which is still incorrect and should be a lesser.

Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8537b120 17-Jun-2014 Alexander Gordeev <agordeev@redhat.com>

blk-mq: bitmap tag: fix races on shared ::wake_index fields

Fix racy updates of shared blk_mq_bitmap_tags::wake_index
and blk_mq_hw_ctx::wake_index fields.

Cc: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# cb96a42c 31-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: fix schedule from atomic context

blk_mq_put_ctx() has to be called before io_schedule() in
bt_get().

This patch fixes the problem by taking similar approach from
percpu_ida allocation for the situation.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 75bb4625 28-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: add file comments and update copyright notices

None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.

Signed-off-by: Jens Axboe <axboe@fb.com>


# a3bd7756 27-May-2014 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_wait_for_tags

The current logic for blocking tag allocation is rather confusing, as we
first allocated and then free again a tag in blk_mq_wait_for_tags, just
to attempt a non-blocking allocation and then repeat if someone else
managed to grab the tag before us.

Instead change blk_mq_alloc_request_pinned to simply do a blocking tag
allocation itself and use the request we get back from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# edf866b3 23-May-2014 Sam Bradshaw <sbradshaw@micron.com>

blk-mq: export blk_mq_tag_busy_iter

Export the blk-mq in-flight tag iterator for driver consumption.
This is particularly useful in exception paths or SRSI where
in-flight IOs need to be cancelled and/or reissued. The NVMe driver
conversion will use this.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e3a2b3f9 20-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: allow changing of queue depth through sysfs

For request_fn based devices, the block layer exports a 'nr_requests'
file through sysfs to allow adjusting of queue depth on the fly.
Currently this returns -EINVAL for blk-mq, since it's not wired up.
Wire this up for blk-mq, so that it now also always dynamic
adjustments of the allowed queue depth for any given block device
managed by blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>


# e93ecf60 19-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: move the cache friendly bitmap type of out blk-mq-tag

We will use it for the pending list in blk-mq core as well.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 0d2602ca 13-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: improve support for shared tags maps

This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.

If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.

The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 1f236ab2 10-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: bitmap tag: cleanup blk_mq_init_tags

Both nr_cache and nr_tags arn't needed for bitmap tag anymore.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9d3d21ae 10-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: bitmap tag: select random tag betweet 0 and (depth - 1)

The selected tag should be selected at random between 0 and
(depth - 1) with probability 1/depth, instead between 0 and
(depth - 2) with probability 1/(depth - 1).

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 60f2df8a 10-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: bitmap tag: remove barrier in bt_clear_tag()

The barrier isn't necessary because both atomic_dec_and_test()
and wake_up() implicate one barrier.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0289b2e1 10-May-2014 Ming Lei <tom.leiming@gmail.com>

blk-mq: bitmap tag: use clear_bit_unlock in bt_clear_tag()

The unlock memory barrier need to order access to req in free
path and clearing tag bit, otherwise either request free path
may see a allocated request, or initialized request in allocate
path might be modified by the ongoing free path.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 59d13bf5 09-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: use sparser tag layout for lower queue depth

For best performance, spreading tags over multiple cachelines
makes the tagging more efficient on multicore systems. But since
we have 8 * sizeof(unsigned long) tags per cacheline, we don't
always get a nice spread.

Attempt to spread the tags over at least 4 cachelines, using fewer
number of bits per unsigned long if we have to. This improves
tagging performance in setups with 32-128 tags. For higher depths,
the spread is the same as before (BITS_PER_LONG tags per cacheline).

Signed-off-by: Jens Axboe <axboe@fb.com>


# 4bb659b1 09-May-2014 Jens Axboe <axboe@fb.com>

blk-mq: implement new and more efficient tagging scheme

blk-mq currently uses percpu_ida for tag allocation. But that only
works well if the ratio between tag space and number of CPUs is
sufficiently high. For most devices and systems, that is not the
case. The end result if that we either only utilize the tag space
partially, or we end up attempting to fully exhaust it and run
into lots of lock contention with stealing between CPUs. This is
not optimal.

This new tagging scheme is a hybrid bitmap allocator. It uses
two tricks to both be SMP friendly and allow full exhaustion
of the space:

1) We cache the last allocated (or freed) tag on a per blk-mq
software context basis. This allows us to limit the space
we have to search. The key element here is not caching it
in the shared tag structure, otherwise we end up dirtying
more shared cache lines on each allocate/free operation.

2) The tag space is split into cache line sized groups, and
each context will start off randomly in that space. Even up
to full utilization of the space, this divides the tag users
efficiently into cache line groups, avoiding dirtying the same
one both between allocators and between allocator and freeer.

This scheme shows drastically better behaviour, both on small
tag spaces but on large ones as well. It has been tested extensively
to show better performance for all the cases blk-mq cares about.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 5810d903 29-Apr-2014 Jens Axboe <axboe@fb.com>

blk-mq: fix waiting for reserved tags

blk_mq_wait_for_tags() is only able to wait for "normal" tags,
not reserved tags. Pass in which one we should attempt to get
a tag for, so that waiting for reserved tags will work.

Reserved tags are used for internal commands, which are usually
serialized. Hence no waiting generally takes place, but we should
ensure that it actually works if users need that functionality.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 24d2f903 15-Apr-2014 Christoph Hellwig <hch@lst.de>

blk-mq: split out tag initialization, support shared tags

Add a new blk_mq_tag_set structure that gets set up before we initialize
the queue. A single blk_mq_tag_set structure can be shared by multiple
queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modular export of blk_mq_{alloc,free}_tagset added by me.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 11c94444 10-Feb-2014 Masanari Iida <standby24x7@gmail.com>

block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show

cppcheck detected following format string mismatch.
[blk-mq-tag.c:201]: (warning) %u in format string (no. 1) requires
'unsigned int' but the argument type is 'int'.

Change "cpu" from int to unsigned int, because the cpu
never become minus value.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 6f6b5d1e 19-Jan-2014 Kent Overstreet <kmo@daterainc.com>

percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask

This patch changes percpu_ida_alloc() + callers to accept task state
bitmask for prepare_to_wait() for code like target/iscsi that needs
it for interruptible sleep, that is provided in a subsequent patch.

It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
and is forced to return a negative value when no tags are available.

v2 changes:
- Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
- Drop signal_pending_state() call
v3 changes:
- Only call prepare_to_wait() + finish_wait() when != TASK_RUNNING
(PeterZ)

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: <stable@vger.kernel.org> #3.12+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>


# 320ae51f 24-Oct-2013 Jens Axboe <axboe@kernel.dk>

blk-mq: new multi-queue block IO queueing mechanism

Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.

- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>