History log of /linux-master/lib/sbitmap.c
Revision Date Author Comments
# 5c7fa5c8 15-Jan-2024 Kemeng Shi <shikemeng@huaweicloud.com>

sbitmap: remove stale comment in sbq_calc_wake_batch

After commit 106397376c036 ("sbitmap: fix batching wakeup"), we may wake
up more than one queue for each batch. Just remove stale comment that
we wake up only one queue for each batch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240115145626.665562-1-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 10639737 21-Jul-2023 David Jeffery <djeffery@redhat.com>

sbitmap: fix batching wakeup

Current code supposes that it is enough to provide forward progress by
just waking up one wait queue after one completion batch is done.

Unfortunately this way isn't enough, cause waiter can be added to wait
queue just after it is woken up.

Follows one example(64 depth, wake_batch is 8)

1) all 64 tags are active

2) in each wait queue, there is only one single waiter

3) each time one completion batch(8 completions) wakes up just one
waiter in each wait queue, then immediately one new sleeper is added
to this wait queue

4) after 64 completions, 8 waiters are wakeup, and there are still 8
waiters in each wait queue

5) after another 8 active tags are completed, only one waiter can be
wakeup, and the other 7 can't be waken up anymore.

Turns out it isn't easy to fix this problem, so simply wakeup enough
waiters for single batch.

Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20230721095715.232728-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b5fcf787 16-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

sbitmap: correct wake_batch recalculation to avoid potential IO hung

Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened")
mentioned that in case of shared tags, there could be just one real
active hctx(queue) because of lazy detection of tag idle. Then driver tag
allocation may wait forever on this real active hctx(queue) if wake_batch
is > hctx_max_depth where hctx_max_depth is available tags depth for the
actve hctx(queue). However, the condition wake_batch > hctx_max_depth is
not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only
wake up one wait queue for each wake_batch even though there is only one
waiter in the woken wait queue. After this, there is only one tag to free
and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq:
fix tag_get wait task can't be awakened") methioned that driver tag
allocation may wait forever. Actually, the inactive hctx(queue) will be
truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all
to wake one waiter per wait queue to break the hung. But IO hung for 30
seconds is also not acceptable. Set batch size to small enough that depth
of the shared hctx(queue) is enough to wake up all of the queues like
sbq_calc_wake_batch do to fix this potential IO hung.

Although hctx_max_depth will be clamped to at least 4 while wake_batch
recalculation does not do the clamp, the wake_batch will be always
recalculated to 1 when hctx_max_depth <= 4.

Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 678418c6 16-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

sbitmap: add sbitmap_find_bit to remove repeat code in __sbitmap_get/__sbitmap_get_shallow

There are three differences between __sbitmap_get and
__sbitmap_get_shallow when searching free bit:
1. __sbitmap_get_shallow limit number of bit to search per word.
__sbitmap_get has no such limit.
2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set
wrap according to round_robin.
3. __sbitmap_get_shallow always searches from first bit in first word.
__sbitmap_get searches from first bit when round_robin is not set
otherwise searches from SB_NR_TO_BIT(sb, alloc_hint).

Add helper function sbitmap_find_bit function to do common search while
accept "limit depth per word", "wrap flag" and "first bit to
search" from caller to support the need of both __sbitmap_get and
__sbitmap_get_shallow.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 08470a98 16-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code

Rewrite sbitmap_find_bit_in_index as following:
1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word
2. Accept "struct sbitmap_word *" directly instead of accepting
"struct sbitmap *" and "int index" to get "struct sbitmap_word *".
3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller
to support need of both __sbitmap_get_shallow and __sbitmap_get.

With helper function sbitmap_find_bit_in_word, we can remove repeat
code in __sbitmap_get_shallow to find bit considring deferred clear.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 903e86f3 16-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

sbitmap: remove redundant check in __sbitmap_queue_get_batch

Commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in
__sbitmap_queue_get_batch()") mentioned that "Checking free bits when
setting the target bits. Otherwise, it may reuse the busying bits."
This commit add check to make sure all masked bits in word before
cmpxchg is zero. Then the existing check after cmpxchg to check any
zero bit is existing in masked bits in word is redundant.

Actually, old value of word before cmpxchg is stored in val and we
will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg.
So we will not reuse busy bits methioned in commit fbb564a557809
("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert
new-added check to remove redundant check.

Fixes: fbb564a55780 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f1591a8b 16-Jan-2023 Kemeng Shi <shikemeng@huaweicloud.com>

sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow

Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly
pointless and equivalent to setting alloc_hint to zero (because
SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So
simplify the logic.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8032bf12 09-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use get_random_u32_below() instead of deprecated function

This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# 26edb30d 15-Nov-2022 Gabriel Krisman Bertazi <krisman@suse.de>

sbitmap: Try each queue to wake up at least one waiter

Jan reported the new algorithm as merged might be problematic if the
queue being awaken becomes empty between the waitqueue_active inside
sbq_wake_ptr check and the wake up. If that happens, wake_up_nr will
not wake up any waiter and we loose too many wake ups. In order to
guarantee progress, we need to wake up at least one waiter here, if
there are any. This now requires trying to wake up from every queue.

Instead of walking through all the queues with sbq_wake_ptr, this call
moves the wake up inside that function. In a previous version of the
patch, I found that updating wake_index several times when walking
through queues had a measurable overhead. This ensures we only update
it once, at the end.

Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags")
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221115224553.23594-4-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 976570b4 15-Nov-2022 Gabriel Krisman Bertazi <krisman@suse.de>

sbitmap: Advance the queue index before waking up a queue

When a queue is awaken, the wake_index written by sbq_wake_ptr currently
keeps pointing to the same queue. On the next wake up, it will thus
retry the same queue, which is unfair to other queues, and can lead to
starvation. This patch, moves the index update to happen before the
queue is returned, such that it will now try a different queue first on
the next wake up, improving fairness.

Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags")
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20221115224553.23594-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f8126bb 05-Nov-2022 Gabriel Krisman Bertazi <krisman@suse.de>

sbitmap: Use single per-bitmap counting to wake up queued tags

sbitmap suffers from code complexity, as demonstrated by recent fixes,
and eventual lost wake ups on nested I/O completion. The later happens,
from what I understand, due to the non-atomic nature of the updates to
wait_cnt, which needs to be subtracted and eventually reset when equal
to zero. This two step process can eventually miss an update when a
nested completion happens to interrupt the CPU in between the wait_cnt
updates. This is very hard to fix, as shown by the recent changes to
this code.

The code complexity arises mostly from the corner cases to avoid missed
wakes in this scenario. In addition, the handling of wake_batch
recalculation plus the synchronization with sbq_queue_wake_up is
non-trivial.

This patchset implements the idea originally proposed by Jan [1], which
removes the need for the two-step updates of wait_cnt. This is done by
tracking the number of completions and wakeups in always increasing,
per-bitmap counters. Instead of having to reset the wait_cnt when it
reaches zero, we simply keep counting, and attempt to wake up N threads
in a single wait queue whenever there is enough space for a batch.
Waking up less than batch_wake shouldn't be a problem, because we
haven't changed the conditions for wake up, and the existing batch
calculation guarantees at least enough remaining completions to wake up
a batch for each queue at any time.

Performance-wise, one should expect very similar performance to the
original algorithm for the case where there is no queueing. In both the
old algorithm and this implementation, the first thing is to check
ws_active, which bails out if there is no queueing to be managed. In the
new code, we took care to avoid accounting completions and wakeups when
there is no queueing, to not pay the cost of atomic operations
unnecessarily, since it doesn't skew the numbers.

For more interesting cases, where there is queueing, we need to take
into account the cross-communication of the atomic operations. I've
been benchmarking by running parallel fio jobs against a single hctx
nullb in different hardware queue depth scenarios, and verifying both
IOPS and queueing.

Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
varying only the hardware queue length per test.

queue size 2 4 8 16 32 64
6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K)
patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K)

The following is a similar experiment, ran against a nullb with a single
bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
parallel fio jobs operating on the same device

queue size 2 4 8 16 32 64
6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K)
patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K)

It has also survived blktests and a 12h-stress run against nullb. I also
ran the code against nvme and a scsi SSD, and I didn't observe
performance regression in those. If there are other tests you think I
should run, please let me know and I will follow up with results.

[1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/

Cc: Hugh Dickins <hughd@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Liu Song <liusong@linux.alibaba.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8b3ccbc1 05-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use prandom_u32_max() when possible, part 2

Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done by hand, covering things that coccinelle could not do on its own.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext2, ext4, and sbitmap
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# 81895a65 05-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use prandom_u32_max() when possible, part 1

Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}

@drop_var@
type T;
identifier VAR;
@@

{
- T VAR;
... when != VAR
}

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# 30514bd2 29-Sep-2022 Hugh Dickins <hughd@google.com>

sbitmap: fix lockup while swapping

Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.

Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.

I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.

The check for wait_cnt being 0 is obviously racy, and ultimately can lead
to lost wakeups: for example, when there is only a single waitqueue with
waiters. However, lost wakeups are unlikely to matter in these cases,
and a proper fix requires redesign (and benchmarking) of the batched
wakeup code: so let's plug the hole with this bandaid for now.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4acb8341 09-Sep-2022 Keith Busch <kbusch@kernel.org>

sbitmap: fix batched wait_cnt accounting

Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c35227d4 08-Sep-2022 Uros Bizjak <ubizjak@gmail.com>

sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch

Use atomic_long_try_cmpxchg instead of
atomic_long_cmpxchg (*ptr, old, new) == old in __sbitmap_queue_get_batch.
x86 CMPXCHG instruction returns success in ZF flag, so this change
saves a compare after cmpxchg (and related move instruction in front
of cmpxchg).

Also, atomic_long_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails, enabling further code simplifications, e.g.
an extra memory read can be avoided in the loop.

No functional change intended.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220908151200.9993-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 48c03331 08-Sep-2022 Jan Kara <jack@suse.cz>

sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()

When __sbq_wake_up() decrements wait_cnt to 0 but races with someone
else waking the waiter on the waitqueue (so the waitqueue becomes
empty), it exits without reseting wait_cnt to wake_batch number. Once
wait_cnt is 0, nobody will ever reset the wait_cnt or wake the new
waiters resulting in possible deadlocks or busyloops. Fix the problem by
making sure we reset wait_cnt even if we didn't wake up anybody in the
end.

Fixes: 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908130937.2795-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bce1b56c 04-Sep-2022 Jens Axboe <axboe@kernel.dk>

Revert "sbitmap: fix batched wait_cnt accounting"

This reverts commit 16ede66973c84f890c03584f79158dd5b2d725f5.

This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.

Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 16ede669 25-Aug-2022 Keith Busch <kbusch@kernel.org>

sbitmap: fix batched wait_cnt accounting

Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ddbfc34f 25-Aug-2022 Liu Song <liusong@linux.alibaba.com>

sbitmap: remove unnecessary code in __sbitmap_queue_get_batch

If "nr + nr_tags <= map_depth", then the value of nr_tags will not be
greater than map_depth, so no additional comparison is required.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1661483653-27326-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 040b83fc 03-Aug-2022 Yu Kuai <yukuai3@huawei.com>

sbitmap: fix possible io hung due to lost wakeup

There are two problems can lead to lost wakeup:

1) invalid wakeup on the wrong waitqueue:

For example, 2 * wake_batch tags are put, while only wake_batch threads
are woken:

__sbq_wake_up
atomic_cmpxchg -> reset wait_cnt
__sbq_wake_up -> decrease wait_cnt
...
__sbq_wake_up -> wait_cnt is decreased to 0 again
atomic_cmpxchg
sbq_index_atomic_inc -> increase wake_index
wake_up_nr -> wake up and waitqueue might be empty
sbq_index_atomic_inc -> increase again, one waitqueue is skipped
wake_up_nr -> invalid wake up because old wakequeue might be empty

To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'.

2) 'wait_cnt' can be decreased while waitqueue is empty

As pointed out by Jan Kara, following race is possible:

CPU1 CPU2
__sbq_wake_up __sbq_wake_up
sbq_wake_ptr() sbq_wake_ptr() -> the same
wait_cnt = atomic_dec_return()
/* decreased to 0 */
sbq_index_atomic_inc()
/* move to next waitqueue */
atomic_set()
/* reset wait_cnt */
wake_up_nr()
/* wake up on the old waitqueue */
wait_cnt = atomic_dec_return()
/*
* decrease wait_cnt in the old
* waitqueue, while it can be
* empty.
*/

Fix the problem by waking up before updating 'wake_index' and
'wait_cnt'.

With this patch, noted that 'wait_cnt' is still decreased in the old
empty waitqueue, however, the wakeup is redirected to a active waitqueue,
and the extra decrement on the old empty waitqueue is not handled.

Fixes: 88459642cba4 ("blk-mq: abstract tag allocation out into sbitmap library")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220803121504.212071-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fbb564a5 05-Jun-2022 wuchi <wuchi.zero@gmail.com>

lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()

1. Getting next index before continue branch.
2. Checking free bits when setting the target bits. Otherwise,
it may reuse the busying bits.

Signed-off-by: wuchi <wuchi.zero@gmail.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Link: https://lore.kernel.org/r/20220605145835.26916-1-wuchi.zero@gmail.com
Fixes: 9672b0d43782 ("sbitmap: add __sbitmap_queue_get_batch()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 863a66cd 15-Mar-2022 Ming Lei <ming.lei@redhat.com>

lib/sbitmap: allocate sb->map via kvzalloc_node

sbitmap has been used in scsi for replacing atomic operations on
sdev->device_busy, so IOPS on some fast scsi storage can be improved.

However, sdev->device_busy can be changed in fast path, so we have to
allocate the sb->map statically. sdev->device_busy has been capped to 1024,
but some drivers may configure the default depth as < 8, then
cause each sbitmap word to hold only one bit. Finally 1024 * 128(
sizeof(sbitmap_word)) bytes is needed for sb->map, given it is order 5
allocation, sometimes it may fail.

Avoid the issue by using kvzalloc_node() for allocating sb->map.

Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220316012708.354668-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3f607293 08-Feb-2022 John Garry <john.garry@huawei.com>

sbitmap: Delete old sbitmap_queue_get_shallow()

Since __sbitmap_queue_get_shallow() was introduced in commit c05e66733788
("sbitmap: add sbitmap_get_shallow() operation"), it has not been used.

Delete __sbitmap_queue_get_shallow() and rename public
__sbitmap_queue_get_shallow() -> sbitmap_queue_get_shallow() as it is odd
to have public __foo but no foo at all.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1644322024-105340-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3301bc53 10-Jan-2022 Ming Lei <ming.lei@redhat.com>

lib/sbitmap: kill 'depth' from sbitmap_word

Only the last sbitmap_word can have different depth, and all the others
must have same depth of 1U << sb->shift, so not necessary to store it in
sbitmap_word, and it can be retrieved easily and efficiently by adding
one internal helper of __map_depth(sb, index).

Remove 'depth' field from sbitmap_word, then the annotation of
____cacheline_aligned_in_smp for 'word' isn't needed any more.

Not see performance effect when running high parallel IOPS test on
null_blk.

This way saves us one cacheline(usually 64 words) per each sbitmap_word.

Cc: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20220110072945.347535-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 10825410 27-Jan-2022 Laibin Qiu <qiulaibin@huawei.com>

blk-mq: Fix wrong wakeup batch configuration which will cause hang

Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be
awakened") will recalculate wake_batch when incrementing or decrementing
active_queues to avoid wake_batch > hctx_max_depth. At the same time, in
order to not affect performance as much as possible, the minimum wakeup
batch is set to 4. But when the QD is small (such as QD=1), if inc or dec
active_queues increases wakeup batch, that can lead to a hang:

Fix this problem with the following strategies:
QD : >= 32 | < 32
---------------------------------
wakeup batch: 8~4 | 3~1

Fixes: 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened")
Link: https://lore.kernel.org/linux-block/78cafe94-a787-e006-8851-69906f0c2128@huawei.com/T/#t
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Link: https://lore.kernel.org/r/20220127100047.1763746-1-qiulaibin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 180dccb0 12-Jan-2022 Laibin Qiu <qiulaibin@huawei.com>

blk-mq: fix tag_get wait task can't be awakened

In case of shared tags, there might be more than one hctx which
allocates from the same tags, and each hctx is limited to allocate at
most:
hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);

tag idle detection is lazy, and may be delayed for 30sec, so there
could be just one real active hctx(queue) but all others are actually
idle and still accounted as active because of the lazy idle detection.
Then if wake_batch is > hctx_max_depth, driver tag allocation may wait
forever on this real active hctx.

Fix this by recalculating wake_batch when inc or dec active_queues.

Fixes: 0d2602ca30e41 ("blk-mq: improve support for shared tags maps")
Suggested-by: Ming Lei <ming.lei@redhat.com>
Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220113025536.1479653-1-qiulaibin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9f8b93a7 25-Oct-2021 Jens Axboe <axboe@kernel.dk>

sbitmap: silence data race warning

KCSAN complaints about the sbitmap hint update:

==================================================================
BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear

write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1:
sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
blk_mq_put_tag+0x82/0x90
__blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
__blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
blk_complete_reqs block/blk-mq.c:584 [inline]
blk_done_softirq+0x69/0x90 block/blk-mq.c:589
__do_softirq+0x12c/0x26e kernel/softirq.c:558
run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
kthread+0x262/0x280 kernel/kthread.c:319
ret_from_fork+0x1f/0x30

write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0:
sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
blk_mq_put_tag+0x82/0x90
__blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
__blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
blk_complete_reqs block/blk-mq.c:584 [inline]
blk_done_softirq+0x69/0x90 block/blk-mq.c:589
__do_softirq+0x12c/0x26e kernel/softirq.c:558
run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
kthread+0x262/0x280 kernel/kthread.c:319
ret_from_fork+0x1f/0x30

value changed: 0x00000035 -> 0x00000044

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
==================================================================

which is a data race, but not an important one. This is just updating the
percpu alloc hint, and the reader of that hint doesn't ever require it to
be valid.

Just annotate it with data_race() to silence this one.

Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1aec5e4a 08-Oct-2021 Jens Axboe <axboe@kernel.dk>

sbitmap: add helper to clear a batch of tags

sbitmap currently only supports clearing tags one-by-one, add a helper
that allows the caller to pass in an array of tags to clear.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9672b0d4 09-Oct-2021 Jens Axboe <axboe@kernel.dk>

sbitmap: add __sbitmap_queue_get_batch()

The block layer tag allocation batching still calls into sbitmap to get
each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
which returns a mask of tags all at once, along with an offset for
those tags.

An example return would be 0xff, where bits 0..7 are set, with
tag_offset == 128. The valid tags in this case would be 128..135.

A batch is specific to an individual sbitmap_map, hence it cannot be
larger than that. The requested number of tags is automatically reduced
to the max that can be satisfied with a single map.

On failure, 0 is returned. Caller should fall back to single tag
allocation at that point/

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9dbbc3b9 07-Jul-2021 Zhen Lei <thunder.leizhen@huawei.com>

lib: fix spelling mistakes

Fix some spelling mistakes in comments:
permanentely ==> permanently
wont ==> won't
remaning ==> remaining
succed ==> succeed
shouldnt ==> shouldn't
alpha-numeric ==> alphanumeric
storeing ==> storing
funtion ==> function
documenation ==> documentation
Determin ==> Determine
intepreted ==> interpreted
ammount ==> amount
obious ==> obvious
interupts ==> interrupts
occured ==> occurred
asssociated ==> associated
taking into acount ==> taking into account
squence ==> sequence
stil ==> still
contiguos ==> contiguous
matchs ==> matches

Link: https://lkml.kernel.org/r/20210607072555.12416-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d13b1ea 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Add sbitmap_calculate_shift() helper

Move code for calculating default shift into a public helper which can be
used by SCSI.

Link: https://lore.kernel.org/r/20210122023317.687987-7-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# cbb9950b 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Export sbitmap_weight

SCSI's .device_busy will be converted to sbitmap and sbitmap_weight is
needed. Export the helper.

The only existing user of sbitmap_weight() uses it to find out how many
bits are set and not cleared. Align sbitmap_weight() meaning with this
usage model.

Link: https://lore.kernel.org/r/20210122023317.687987-6-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# c548e62b 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Move allocation hint into sbitmap

Allocation hint should have belonged to sbitmap. Also, when sbitmap's depth
is high and there is no need to use mulitple wakeup queues, user can
benefit from percpu allocation hint too.

Move allocation hint into sbitmap, then SCSI device queue can benefit from
allocation hint when converting to plain sbitmap.

Convert vhost/scsi.c to use sbitmap allocation with percpu alloc hint. This
is more efficient than the previous approach.

Link: https://lore.kernel.org/r/20210122023317.687987-5-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: virtualization@lists.linux-foundation.org
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# bf2c4282 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Add helpers for updating allocation hint

Add helpers for updating allocation hint so that we can avoid duplicate
code.

Prepare for moving allocation hint into sbitmap.

Link: https://lore.kernel.org/r/20210122023317.687987-4-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# efe1f3a1 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Maintain allocation round_robin in sbitmap

Currently the allocation round_robin info is maintained by sbitmap_queue.

However, bit allocation really belongs to sbitmap. Move it there.

Link: https://lore.kernel.org/r/20210122023317.687987-3-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: virtualization@lists.linux-foundation.org
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# 0eff1f1a 22-Nov-2020 Pavel Begunkov <asml.silence@gmail.com>

sbitmap: simplify wrap check

__sbitmap_get_word() doesn't warp if it's starting from the beginning
(i.e. initial hint is 0). Instead of stashing the original hint just set
@wrap accordingly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c3250c8d 22-Nov-2020 Pavel Begunkov <asml.silence@gmail.com>

sbitmap: replace CAS with atomic and

sbitmap_deferred_clear() does CAS loop to propagate cleared bits,
replace it with equivalent atomic bitwise and. That's slightly faster
and makes wait-free instead of lock-free as before.

The atomic can be relaxed (i.e. barrier-less) because following
sbitmap_get*() deal with synchronisation, see comments in
sbitmap_queue_clear().

It's ok to cast to atomic_long_t, that's what bitops/lock.h does.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 661d4f55 22-Nov-2020 Pavel Begunkov <asml.silence@gmail.com>

sbitmap: remove swap_lock

map->swap_lock protects map->cleared from concurrent modification,
however sbitmap_deferred_clear() is already atomically drains it, so
it's guaranteed to not loose bits on concurrent
sbitmap_deferred_clear().

A one threaded tag heavy test on top of nullbk showed ~1.5% t-put
increase, and 3% -> 1% cycle reduction of sbitmap_get() according to perf.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b78beea0 22-Nov-2020 Pavel Begunkov <asml.silence@gmail.com>

sbitmap: optimise sbitmap_deferred_clear()

Because of spinlocks and atomics sbitmap_deferred_clear() have to reload
&sb->map[index] on each access even though the map address won't change.
Pass in sbitmap_word instead of {sb, index}, so it's cached in a
variable. It also improves code generation of
sbitmap_find_bit_in_index().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6bf0eb55 01-Jul-2020 John Garry <john.garry@huawei.com>

sbitmap: Consider cleared bits in sbitmap_bitmap_show()

sbitmap works by maintaining separate bitmaps of set and cleared bits.
The set bits are cleared in a batch, to save the burden of continuously
locking the "word" map to unset.

sbitmap_bitmap_show() only shows the set bits (in "word"), which is not
too much use, so mask out the cleared bits.

Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# df034c93 17-Dec-2019 David Jeffery <djeffery@redhat.com>

sbitmap: only queue kyber's wait callback if not already active

Under heavy loads where the kyber I/O scheduler hits the token limits for
its scheduling domains, kyber can become stuck. When active requests
complete, kyber may not be woken up leaving the I/O requests in kyber
stuck.

This stuck state is due to a race condition with kyber and the sbitmap
functions it uses to run a callback when enough requests have completed.
The running of a sbt_wait callback can race with the attempt to insert the
sbt_wait. Since sbitmap_del_wait_queue removes the sbt_wait from the list
first then sets the sbq field to NULL, kyber can see the item as not on a
list but the call to sbitmap_add_wait_queue will see sbq as non-NULL. This
results in the sbt_wait being inserted onto the wait list but ws_active
doesn't get incremented. So the sbitmap queue does not know there is a
waiter on a wait list.

Since sbitmap doesn't think there is a waiter, kyber may never be
informed that there are domain tokens available and the I/O never advances.
With the sbt_wait on a wait list, kyber believes it has an active waiter
so cannot insert a new waiter when reaching the domain's full state.

This race can be fixed by only adding the sbt_wait to the queue if the
sbq field is NULL. If sbq is not NULL, there is already an action active
which will trigger the re-running of kyber. Let it run and add the
sbt_wait to the wait list if still needing to wait.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reported-by: John Pittman <jpittman@redhat.com>
Tested-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 708edafa 13-Nov-2019 John Garry <john.garry@huawei.com>

sbitmap: Delete sbitmap_any_bit_clear()

Since the only caller of this function has been deleted, delete this one
also.

Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 41723288 23-May-2019 Pavel Begunkov <asml.silence@gmail.com>

sbitmap: Replace cmpxchg with xchg

cmpxchg() with an immediate value could be replaced with less expensive
xchg(). The same true if new value don't _depend_ on the old one.

In the second block, atomic_cmpxchg() return value isn't checked, so
after atomic_cmpxchg() -> atomic_xchg() conversion it could be replaced
with atomic_set(). Comparison with atomic_read() in the second chunk was
left as an optimisation (if that was the initial intention).

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0fc479b1 29-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 328

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license v2 as published
by the free software foundation this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not see https www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 2 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000435.923873561@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a0934fd2 20-May-2019 Andrea Parri <andrea.parri@amarulasolutions.com>

sbitmap: fix improper use of smp_mb__before_atomic()

This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic_set() primitive.

Replace the barrier with an smp_mb().

Fixes: 6c0ca7ae292ad ("sbitmap: fix wakeup hang after sbq resize")
Cc: stable@vger.kernel.org
Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e6d1fa58 21-Mar-2019 Ming Lei <ming.lei@redhat.com>

sbitmap: order READ/WRITE freed instance and setting clear bit

Inside sbitmap_queue_clear(), once the clear bit is set, it will be
visiable to allocation path immediately. Meantime READ/WRITE on old
associated instance(such as request in case of blk-mq) may be
out-of-order with the setting clear bit, so race with re-allocation
may be triggered.

Adds one memory barrier for ordering READ/WRITE of the freed associated
instance with setting clear bit for avoiding race with re-allocation.

The following kernel oops triggerd by block/006 on aarch64 may be fixed:

[ 142.330954] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000330
[ 142.338794] Mem abort info:
[ 142.341554] ESR = 0x96000005
[ 142.344632] Exception class = DABT (current EL), IL = 32 bits
[ 142.350500] SET = 0, FnV = 0
[ 142.353544] EA = 0, S1PTW = 0
[ 142.356678] Data abort info:
[ 142.359528] ISV = 0, ISS = 0x00000005
[ 142.363343] CM = 0, WnR = 0
[ 142.366305] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000002a3c51c0
[ 142.372983] [0000000000000330] pgd=0000000000000000, pud=0000000000000000
[ 142.379777] Internal error: Oops: 96000005 [#1] SMP
[ 142.384613] Modules linked in: null_blk ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp vfat fat rpcrdma sunrpc rdma_ucm ib_iser rdma_cm iw_cm libiscsi ib_umad scsi_transport_iscsi ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core sbsa_gwdt crct10dif_ce ghash_ce ipmi_ssif sha2_ce ipmi_devintf sha256_arm64 sg sha1_ce ipmi_msghandler ip_tables xfs libcrc32c mlx5_core sdhci_acpi mlxfw ahci_platform at803x sdhci libahci_platform qcom_emac mmc_core hdma hdma_mgmt i2c_dev [last unloaded: null_blk]
[ 142.429753] CPU: 7 PID: 1983 Comm: fio Not tainted 5.0.0.cki #2
[ 142.449458] pstate: 00400005 (nzcv daif +PAN -UAO)
[ 142.454239] pc : __blk_mq_free_request+0x4c/0xa8
[ 142.458830] lr : blk_mq_free_request+0xec/0x118
[ 142.463344] sp : ffff00003360f6a0
[ 142.466646] x29: ffff00003360f6a0 x28: ffff000010e70000
[ 142.471941] x27: ffff801729a50048 x26: 0000000000010000
[ 142.477232] x25: ffff00003360f954 x24: ffff7bdfff021440
[ 142.482529] x23: 0000000000000000 x22: 00000000ffffffff
[ 142.487830] x21: ffff801729810000 x20: 0000000000000000
[ 142.493123] x19: ffff801729a50000 x18: 0000000000000000
[ 142.498413] x17: 0000000000000000 x16: 0000000000000001
[ 142.503709] x15: 00000000000000ff x14: ffff7fe000000000
[ 142.509003] x13: ffff8017dcde09a0 x12: 0000000000000000
[ 142.514308] x11: 0000000000000001 x10: 0000000000000008
[ 142.519597] x9 : ffff8017dcde09a0 x8 : 0000000000002000
[ 142.524889] x7 : ffff8017dcde0a00 x6 : 000000015388f9be
[ 142.530187] x5 : 0000000000000001 x4 : 0000000000000000
[ 142.535478] x3 : 0000000000000000 x2 : 0000000000000000
[ 142.540777] x1 : 0000000000000001 x0 : ffff00001041b194
[ 142.546071] Process fio (pid: 1983, stack limit = 0x000000006460a0ea)
[ 142.552500] Call trace:
[ 142.554926] __blk_mq_free_request+0x4c/0xa8
[ 142.559181] blk_mq_free_request+0xec/0x118
[ 142.563352] blk_mq_end_request+0xfc/0x120
[ 142.567444] end_cmd+0x3c/0xa8 [null_blk]
[ 142.571434] null_complete_rq+0x20/0x30 [null_blk]
[ 142.576194] blk_mq_complete_request+0x108/0x148
[ 142.580797] null_handle_cmd+0x1d4/0x718 [null_blk]
[ 142.585662] null_queue_rq+0x60/0xa8 [null_blk]
[ 142.590171] blk_mq_try_issue_directly+0x148/0x280
[ 142.594949] blk_mq_try_issue_list_directly+0x9c/0x108
[ 142.600064] blk_mq_sched_insert_requests+0xb0/0xd0
[ 142.604926] blk_mq_flush_plug_list+0x16c/0x2a0
[ 142.609441] blk_flush_plug_list+0xec/0x118
[ 142.613608] blk_finish_plug+0x3c/0x4c
[ 142.617348] blkdev_direct_IO+0x3b4/0x428
[ 142.621336] generic_file_read_iter+0x84/0x180
[ 142.625761] blkdev_read_iter+0x50/0x78
[ 142.629579] aio_read.isra.6+0xf8/0x190
[ 142.633409] __io_submit_one.isra.8+0x148/0x738
[ 142.637912] io_submit_one.isra.9+0x88/0xb8
[ 142.642078] __arm64_sys_io_submit+0xe0/0x238
[ 142.646428] el0_svc_handler+0xa0/0x128
[ 142.650238] el0_svc+0x8/0xc
[ 142.653104] Code: b9402a63 f9000a7f 3100047f 540000a0 (f9419a81)
[ 142.659202] ---[ end trace 467586bc175eb09d ]---

Fixes: ea86ea2cdced20057da ("sbitmap: ammortize cost of clearing bits")
Reported-and-bisected_and_tested-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fe76fc6a 14-Jan-2019 Ming Lei <ming.lei@redhat.com>

sbitmap: Protect swap_lock from hardirq

Because we may call blk_mq_get_driver_tag() directly from
blk_mq_dispatch_rq_list() without holding any lock, then HARDIRQ may
come and the above DEADLOCK is triggered.

Commit ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq") tries to
fix this issue by using 'spin_lock_bh', which isn't enough because we
complete request from hardirq context direclty in case of multiqueue.

Cc: Clark Williams <williams@redhat.com>
Fixes: ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 37198768 13-Jan-2019 Steven Rostedt (VMware) <rostedt@goodmis.org>

sbitmap: Protect swap_lock from softirqs

The swap_lock used by sbitmap has a chain with locks taken from softirq,
but the swap_lock is not protected from being preempted by softirqs.

A chain exists of:

sbq->ws[i].wait -> dispatch_wait_lock -> swap_lock

Where the sbq->ws[i].wait lock can be taken from softirq context, which
means all locks below it in the chain must also be protected from
softirqs.

Reported-by: Clark Williams <williams@redhat.com>
Fixes: 58ab5e32e6fd ("sbitmap: silence bogus lockdep IRQ warning")
Fixes: ea86ea2cdced ("sbitmap: amortize cost of clearing bits")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9f6b7ef6 20-Dec-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: add helpers for add/del wait queue handling

After commit 5d2ee7122c73, users of sbitmap that need wait queue
handling must use the provided helpers. But we only added
prepare_to_wait()/finish_wait() style helpers, add the equivalent
add_wait_queue/list_del wrappers as we..

This is needed to ensure kyber plays by the sbitmap waitqueue
rules.

Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b2dbff1b 11-Dec-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: flush deferred clears for resize and shallow gets

We're missing a deferred clear off the shallow get, which can cause
a hang. Additionally, when we resize the sbitmap, we should also
flush deferred clears for good measure.

Ensure we have full coverage on batch clears, even for paths where
we would not be doing deferred clear. This makes it less error
prone for future additions.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 58ab5e32 09-Dec-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: silence bogus lockdep IRQ warning

Ming reports that lockdep spews the following trace. What this
essentially says is that the sbitmap swap_lock was used inconsistently
in IRQ enabled and disabled context, and that is usually indicative of a
bug that will cause a deadlock.

For this case, it's a false positive. The swap_lock is used from process
context only, when we swap the bits in the word and cleared mask. We
also end up doing that when we are getting a driver tag, from the
blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with
IRQs disabled. However, this isn't from an actual IRQ, it's still
process context.

In lieu of a better way to fix this, simply always disable interrupts
when grabbing the swap_lock if lockdep is enabled.

[ 100.967642] ================start test sanity/001================
[ 101.238280] null: module loaded
[ 106.093735]
[ 106.094012] =====================================================
[ 106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[ 106.095759] 4.20.0-rc3_5d2ee7122c73_for-next+ #1 Not tainted
[ 106.096551] -----------------------------------------------------
[ 106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 106.098231] 000000004c43fa71
(&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
[ 106.099431]
[ 106.099431] and this task is already holding:
[ 106.100229] 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.101630] which would create a new lock dependency:
[ 106.102326] (&(&hctx->dispatch_wait_lock)->rlock){....} ->
(&(&sb->map[i].swap_lock)->rlock){+.+.}
[ 106.103553]
[ 106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
[ 106.104580] (&sbq->ws[i].wait){..-.}
[ 106.104582]
[ 106.104582] ... which became SOFTIRQ-irq-safe at:
[ 106.105751] _raw_spin_lock_irqsave+0x4b/0x82
[ 106.106284] __wake_up_common_lock+0x119/0x1b9
[ 106.106825] sbitmap_queue_wake_up+0x33f/0x383
[ 106.107456] sbitmap_queue_clear+0x4c/0x9a
[ 106.108046] __blk_mq_free_request+0x188/0x1d3
[ 106.108581] blk_mq_free_request+0x23b/0x26b
[ 106.109102] scsi_end_request+0x345/0x5d7
[ 106.109587] scsi_io_completion+0x4b5/0x8f0
[ 106.110099] scsi_finish_command+0x412/0x456
[ 106.110615] scsi_softirq_done+0x23f/0x29b
[ 106.111115] blk_done_softirq+0x2a7/0x2e6
[ 106.111608] __do_softirq+0x360/0x6ad
[ 106.112062] run_ksoftirqd+0x2f/0x5b
[ 106.112499] smpboot_thread_fn+0x3a5/0x3db
[ 106.113000] kthread+0x1d4/0x1e4
[ 106.113457] ret_from_fork+0x3a/0x50
[ 106.113969]
[ 106.113969] to a SOFTIRQ-irq-unsafe lock:
[ 106.114672] (&(&sb->map[i].swap_lock)->rlock){+.+.}
[ 106.114674]
[ 106.114674] ... which became SOFTIRQ-irq-unsafe at:
[ 106.116000] ...
[ 106.116003] _raw_spin_lock+0x33/0x64
[ 106.116676] sbitmap_get+0xd5/0x22c
[ 106.117134] __sbitmap_queue_get+0xe8/0x177
[ 106.117731] __blk_mq_get_tag+0x1e6/0x22d
[ 106.118286] blk_mq_get_tag+0x1db/0x6e4
[ 106.118756] blk_mq_get_driver_tag+0x161/0x258
[ 106.119383] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.120043] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.120607] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.121234] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.121781] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.122366] blk_mq_run_hw_queue+0x151/0x187
[ 106.122887] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.123492] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.124042] blk_flush_plug_list+0x392/0x3d7
[ 106.124557] blk_finish_plug+0x37/0x4f
[ 106.125019] read_pages+0x3ef/0x430
[ 106.125446] __do_page_cache_readahead+0x18e/0x2fc
[ 106.126027] force_page_cache_readahead+0x121/0x133
[ 106.126621] page_cache_sync_readahead+0x35f/0x3bb
[ 106.127229] generic_file_buffered_read+0x410/0x1860
[ 106.127932] __vfs_read+0x319/0x38f
[ 106.128415] vfs_read+0xd2/0x19a
[ 106.128817] ksys_read+0xb9/0x135
[ 106.129225] do_syscall_64+0x140/0x385
[ 106.129684] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.130292]
[ 106.130292] other info that might help us debug this:
[ 106.130292]
[ 106.131226] Chain exists of:
[ 106.131226] &sbq->ws[i].wait -->
&(&hctx->dispatch_wait_lock)->rlock -->
&(&sb->map[i].swap_lock)->rlock
[ 106.131226]
[ 106.132865] Possible interrupt unsafe locking scenario:
[ 106.132865]
[ 106.133659] CPU0 CPU1
[ 106.134194] ---- ----
[ 106.134733] lock(&(&sb->map[i].swap_lock)->rlock);
[ 106.135318] local_irq_disable();
[ 106.136014] lock(&sbq->ws[i].wait);
[ 106.136747]
lock(&(&hctx->dispatch_wait_lock)->rlock);
[ 106.137742] <Interrupt>
[ 106.138110] lock(&sbq->ws[i].wait);
[ 106.138625]
[ 106.138625] *** DEADLOCK ***
[ 106.138625]
[ 106.139430] 3 locks held by fio/1043:
[ 106.139947] #0: 0000000076ff0fd9 (rcu_read_lock){....}, at:
hctx_lock+0x29/0xe8
[ 106.140813] #1: 000000002feb1016 (&sbq->ws[i].wait){..-.}, at:
blk_mq_dispatch_rq_list+0x4ad/0xd7c
[ 106.141877] #2: 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.143267]
[ 106.143267] the dependencies between SOFTIRQ-irq-safe lock and the
holding lock:
[ 106.144351] -> (&sbq->ws[i].wait){..-.} ops: 82 {
[ 106.144926] IN-SOFTIRQ-W at:
[ 106.145314] _raw_spin_lock_irqsave+0x4b/0x82
[ 106.146042] __wake_up_common_lock+0x119/0x1b9
[ 106.146785] sbitmap_queue_wake_up+0x33f/0x383
[ 106.147567] sbitmap_queue_clear+0x4c/0x9a
[ 106.148379] __blk_mq_free_request+0x188/0x1d3
[ 106.149148] blk_mq_free_request+0x23b/0x26b
[ 106.149864] scsi_end_request+0x345/0x5d7
[ 106.150546] scsi_io_completion+0x4b5/0x8f0
[ 106.151367] scsi_finish_command+0x412/0x456
[ 106.152157] scsi_softirq_done+0x23f/0x29b
[ 106.152855] blk_done_softirq+0x2a7/0x2e6
[ 106.153537] __do_softirq+0x360/0x6ad
[ 106.154280] run_ksoftirqd+0x2f/0x5b
[ 106.155020] smpboot_thread_fn+0x3a5/0x3db
[ 106.155828] kthread+0x1d4/0x1e4
[ 106.156526] ret_from_fork+0x3a/0x50
[ 106.157267] INITIAL USE at:
[ 106.157713] _raw_spin_lock_irqsave+0x4b/0x82
[ 106.158542] prepare_to_wait_exclusive+0xa8/0x215
[ 106.159421] blk_mq_get_tag+0x34f/0x6e4
[ 106.160186] blk_mq_get_request+0x48e/0xaef
[ 106.160997] blk_mq_make_request+0x27e/0xbd2
[ 106.161828] generic_make_request+0x4d1/0x873
[ 106.162661] submit_bio+0x20c/0x253
[ 106.163379] mpage_bio_submit+0x44/0x4b
[ 106.164142] mpage_readpages+0x3c2/0x407
[ 106.164919] read_pages+0x13a/0x430
[ 106.165633] __do_page_cache_readahead+0x18e/0x2fc
[ 106.166530] force_page_cache_readahead+0x121/0x133
[ 106.167439] page_cache_sync_readahead+0x35f/0x3bb
[ 106.168337] generic_file_buffered_read+0x410/0x1860
[ 106.169255] __vfs_read+0x319/0x38f
[ 106.169977] vfs_read+0xd2/0x19a
[ 106.170662] ksys_read+0xb9/0x135
[ 106.171356] do_syscall_64+0x140/0x385
[ 106.172120] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.173051] }
[ 106.173308] ... key at: [<ffffffff85094600>] __key.26481+0x0/0x40
[ 106.174219] ... acquired at:
[ 106.174646] _raw_spin_lock+0x33/0x64
[ 106.175183] blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.175843] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.176518] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.177262] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.177900] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.178591] blk_mq_run_hw_queue+0x151/0x187
[ 106.179207] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.179926] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.180571] blk_flush_plug_list+0x392/0x3d7
[ 106.181187] blk_finish_plug+0x37/0x4f
[ 106.181737] __se_sys_io_submit+0x171/0x304
[ 106.182346] do_syscall_64+0x140/0x385
[ 106.182895] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.183607]
[ 106.183830] -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
[ 106.184691] INITIAL USE at:
[ 106.185119] _raw_spin_lock+0x33/0x64
[ 106.185838] blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.186697] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.187551] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.188481] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.189307] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.190189] blk_mq_run_hw_queue+0x151/0x187
[ 106.190989] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.191902] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.192739] blk_flush_plug_list+0x392/0x3d7
[ 106.193535] blk_finish_plug+0x37/0x4f
[ 106.194269] __se_sys_io_submit+0x171/0x304
[ 106.195059] do_syscall_64+0x140/0x385
[ 106.195794] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.196705] }
[ 106.196950] ... key at: [<ffffffff84880620>] __key.51231+0x0/0x40
[ 106.197853] ... acquired at:
[ 106.198270] lock_acquire+0x280/0x2f3
[ 106.198806] _raw_spin_lock+0x33/0x64
[ 106.199337] sbitmap_get+0xd5/0x22c
[ 106.199850] __sbitmap_queue_get+0xe8/0x177
[ 106.200450] __blk_mq_get_tag+0x1e6/0x22d
[ 106.201035] blk_mq_get_tag+0x1db/0x6e4
[ 106.201589] blk_mq_get_driver_tag+0x161/0x258
[ 106.202237] blk_mq_dispatch_rq_list+0x5b9/0xd7c
[ 106.202902] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.203572] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.204316] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.204956] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.205649] blk_mq_run_hw_queue+0x151/0x187
[ 106.206269] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.206997] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.207644] blk_flush_plug_list+0x392/0x3d7
[ 106.208264] blk_finish_plug+0x37/0x4f
[ 106.208814] __se_sys_io_submit+0x171/0x304
[ 106.209415] do_syscall_64+0x140/0x385
[ 106.209965] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.210684]
[ 106.210904]
[ 106.210904] the dependencies between the lock to be acquired
[ 106.210905] and SOFTIRQ-irq-unsafe lock:
[ 106.212541] -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 1969 {
[ 106.213393] HARDIRQ-ON-W at:
[ 106.213840] _raw_spin_lock+0x33/0x64
[ 106.214570] sbitmap_get+0xd5/0x22c
[ 106.215282] __sbitmap_queue_get+0xe8/0x177
[ 106.216086] __blk_mq_get_tag+0x1e6/0x22d
[ 106.216876] blk_mq_get_tag+0x1db/0x6e4
[ 106.217627] blk_mq_get_driver_tag+0x161/0x258
[ 106.218465] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.219326] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.220198] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.221138] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.221975] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.222874] blk_mq_run_hw_queue+0x151/0x187
[ 106.223686] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.224597] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.225444] blk_flush_plug_list+0x392/0x3d7
[ 106.226255] blk_finish_plug+0x37/0x4f
[ 106.227006] read_pages+0x3ef/0x430
[ 106.227717] __do_page_cache_readahead+0x18e/0x2fc
[ 106.228595] force_page_cache_readahead+0x121/0x133
[ 106.229491] page_cache_sync_readahead+0x35f/0x3bb
[ 106.230373] generic_file_buffered_read+0x410/0x1860
[ 106.231277] __vfs_read+0x319/0x38f
[ 106.231986] vfs_read+0xd2/0x19a
[ 106.232666] ksys_read+0xb9/0x135
[ 106.233350] do_syscall_64+0x140/0x385
[ 106.234097] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.235012] SOFTIRQ-ON-W at:
[ 106.235460] _raw_spin_lock+0x33/0x64
[ 106.236195] sbitmap_get+0xd5/0x22c
[ 106.236913] __sbitmap_queue_get+0xe8/0x177
[ 106.237715] __blk_mq_get_tag+0x1e6/0x22d
[ 106.238488] blk_mq_get_tag+0x1db/0x6e4
[ 106.239244] blk_mq_get_driver_tag+0x161/0x258
[ 106.240079] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.240937] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.241806] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.242751] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.243579] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.244469] blk_mq_run_hw_queue+0x151/0x187
[ 106.245277] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.246191] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.247044] blk_flush_plug_list+0x392/0x3d7
[ 106.247859] blk_finish_plug+0x37/0x4f
[ 106.248749] read_pages+0x3ef/0x430
[ 106.249463] __do_page_cache_readahead+0x18e/0x2fc
[ 106.250357] force_page_cache_readahead+0x121/0x133
[ 106.251263] page_cache_sync_readahead+0x35f/0x3bb
[ 106.252157] generic_file_buffered_read+0x410/0x1860
[ 106.253084] __vfs_read+0x319/0x38f
[ 106.253808] vfs_read+0xd2/0x19a
[ 106.254488] ksys_read+0xb9/0x135
[ 106.255186] do_syscall_64+0x140/0x385
[ 106.255943] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.256867] INITIAL USE at:
[ 106.257300] _raw_spin_lock+0x33/0x64
[ 106.258033] sbitmap_get+0xd5/0x22c
[ 106.258747] __sbitmap_queue_get+0xe8/0x177
[ 106.259542] __blk_mq_get_tag+0x1e6/0x22d
[ 106.260320] blk_mq_get_tag+0x1db/0x6e4
[ 106.261072] blk_mq_get_driver_tag+0x161/0x258
[ 106.261902] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.262762] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.263626] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.264571] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.265409] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.266302] blk_mq_run_hw_queue+0x151/0x187
[ 106.267111] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.268028] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.268878] blk_flush_plug_list+0x392/0x3d7
[ 106.269694] blk_finish_plug+0x37/0x4f
[ 106.270432] read_pages+0x3ef/0x430
[ 106.271139] __do_page_cache_readahead+0x18e/0x2fc
[ 106.272040] force_page_cache_readahead+0x121/0x133
[ 106.272932] page_cache_sync_readahead+0x35f/0x3bb
[ 106.273811] generic_file_buffered_read+0x410/0x1860
[ 106.274709] __vfs_read+0x319/0x38f
[ 106.275407] vfs_read+0xd2/0x19a
[ 106.276074] ksys_read+0xb9/0x135
[ 106.276764] do_syscall_64+0x140/0x385
[ 106.277500] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.278417] }
[ 106.278676] ... key at: [<ffffffff85094640>] __key.26212+0x0/0x40
[ 106.279586] ... acquired at:
[ 106.280026] lock_acquire+0x280/0x2f3
[ 106.280559] _raw_spin_lock+0x33/0x64
[ 106.281101] sbitmap_get+0xd5/0x22c
[ 106.281610] __sbitmap_queue_get+0xe8/0x177
[ 106.282221] __blk_mq_get_tag+0x1e6/0x22d
[ 106.282809] blk_mq_get_tag+0x1db/0x6e4
[ 106.283368] blk_mq_get_driver_tag+0x161/0x258
[ 106.284018] blk_mq_dispatch_rq_list+0x5b9/0xd7c
[ 106.284685] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.285371] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.286135] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.286806] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.287515] blk_mq_run_hw_queue+0x151/0x187
[ 106.288149] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.289041] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.289912] blk_flush_plug_list+0x392/0x3d7
[ 106.290590] blk_finish_plug+0x37/0x4f
[ 106.291238] __se_sys_io_submit+0x171/0x304
[ 106.291864] do_syscall_64+0x140/0x385
[ 106.292534] entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5d2ee712 29-Nov-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: optimize wakeup check

Even if we have no waiters on any of the sbitmap_queue wait states, we
still have to loop every entry to check. We do this for every IO, so
the cost adds up.

Shift a bit of the cost to the slow path, when we actually have waiters.
Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
an internal count of how many are currently active. Then we can simply
check this count in sbq_wake_ptr() and not have to loop if we don't
have any sleepers.

Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ea86ea2c 30-Nov-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: ammortize cost of clearing bits

sbitmap maintains a set of words that we use to set and clear bits, with
each bit representing a tag for blk-mq. Even though we spread the bits
out and maintain a hint cache, one particular bit allocated will end up
being cleared in the exact same spot.

This introduces batched clearing of bits. Instead of clearing a given
bit, the same bit is set in a cleared/free mask instead. If we fail
allocating a bit from a given word, then we check the free mask, and
batch move those cleared bits at that time. This trades 64 atomic bitops
for 2 cmpxchg().

In a threaded poll test case, half the overhead of getting and clearing
tags is removed with this change. On another poll test case with a
single thread, performance is unchanged.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 27fae429 29-Nov-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: don't loop for find_next_zero_bit() for !round_robin

If we aren't forced to do round robin tag allocation, just use the
allocation hint to find the index for the tag word, don't use it for the
offset inside the word. This avoids a potential extra round trip in the
bit looping, and since we're fetching this cacheline, we may as well
check the whole word from the start.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 590b5b7d 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kzalloc_node() -> kcalloc_node()

The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:

kzalloc_node(a * b, gfp, node)

with:
kcalloc_node(a * b, gfp, node)

as well as handling cases of:

kzalloc_node(a * b * c, gfp, node)

with:

kzalloc_node(array3_size(a, b, c), gfp, node)

as it's slightly less ugly than:

kcalloc_node(array_size(a, b), c, gfp, node)

This does, however, attempt to ignore constant size factors like:

kzalloc_node(4 * 1024, gfp, node)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# e6fc4649 24-May-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: avoid starving tag allocation after allocating process migrates

When the allocation process is scheduled back and the mapped hw queue is
changed, fake one extra wake up on previous queue for compensating wake
up miss, so other allocations on the previous queue won't be starved.

This patch fixes one request allocation hang issue, which can be
triggered easily in case of very low nr_request.

The race is as follows:

1) 2 hw queues, nr_requests are 2, and wake_batch is one

2) there are 3 waiters on hw queue 0

3) two in-flight requests in hw queue 0 are completed, and only two
waiters of 3 are waken up because of wake_batch, but both the two
waiters can be scheduled to another CPU and cause to switch to hw
queue 1

4) then the 3rd waiter will wait for ever, since no in-flight request
is in hw queue 0 any more.

5) this patch fixes it by the fake wakeup when waiter is scheduled to
another hw queue

Cc: <stable@vger.kernel.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Modified commit message to make it clearer, and make it apply on
top of the 4.18 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c854ab57 14-May-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: fix race in wait batch accounting

If we have multiple callers of sbq_wake_up(), we can end up in a
situation where the wait_cnt will continually go more and more
negative. Consider the case where our wake batch is 1, hence
wait_cnt will start out as 1.

wait_cnt == 1

CPU0 CPU1
atomic_dec_return(), cnt == 0
atomic_dec_return(), cnt == -1
cmpxchg(-1, 0) (succeeds)
[wait_cnt now 0]
cmpxchg(0, 1) (fails)

This ends up with wait_cnt being 0, we'll wakeup immediately
next time. Going through the same loop as above again, and
we'll have wait_cnt -1.

For the case where we have a larger wake batch, the only
difference is that the starting point will be higher. We'll
still end up with continually smaller batch wakeups, which
defeats the purpose of the rolling wakeups.

Always reset the wait_cnt to the batch value. Then it doesn't
matter who wins the race. But ensure that whomever does win
the race is the one that increments the ws index and wakes up
our batch count, loser gets to call __sbq_wake_up() again to
account his wakeups towards the next active wait state index.

Fixes: 6c0ca7ae292a ("sbitmap: fix wakeup hang after sbq resize")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 61445b56 09-May-2018 Omar Sandoval <osandov@fb.com>

sbitmap: warn if using smaller shallow depth than was setup

Make sure the user passed the right value to
sbitmap_queue_min_shallow_depth().

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a3275539 09-May-2018 Omar Sandoval <osandov@fb.com>

sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()

The sbitmap queue wake batch is calculated such that once allocations
start blocking, all of the bits which are already allocated must be
enough to fulfill the batch counters of all of the waitqueues. However,
the shallow allocation depth can break this invariant, since we block
before our full depth is being utilized. Add
sbitmap_queue_min_shallow_depth(), which saves the minimum shallow depth
the sbq will use, and update sbq_calc_wake_batch() to take it into
account.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4ace53f1 27-Feb-2018 Omar Sandoval <osandov@fb.com>

sbitmap: use test_and_set_bit_lock()/clear_bit_unlock()

sbitmap_queue_get()/sbitmap_queue_clear() are used for
allocating/freeing a resource, so they should provide acquire/release
barrier semantics, respectively. sbitmap_get() currently contains a full
barrier, which is unnecessary, so use test_and_set_bit_lock() instead of
test_and_set_bit() (these are equivalent on x86_64). sbitmap_clear_bit()
does not imply any barriers, which is incorrect, as accesses of the
resource (e.g., request) could potentially get reordered to after the
clear_bit(). Introduce sbitmap_clear_bit_unlock() and use it for
sbitmap_queue_clear() (this only adds a compiler barrier on x86_64). The
other existing user of sbitmap_clear_bit() (the blk-mq software queue
pending map) is serialized through a spinlock and does not need this.

Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e5dff41 14-Nov-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: improve heavily contended tag case

Even with a number of waitqueues, we can get into a situation where we
are heavily contended on the waitqueue lock. I got a report on spc1
where we're spending seconds doing this. Arguably the use case is nasty,
I reproduce it with one device and 1000 threads banging on the device.
But that doesn't mean we shouldn't be handling it better.

What ends up happening is that a thread will fail to get a tag, add
itself to the waitqueue, and subsequently get woken up when a tag is
freed - only to find itself going back to sleep on the waitqueue.

Instead of waking all threads, use an exclusive wait and wake up our
sbitmap batch count instead. This seems to work well for me (massive
improvement for this use case), and it survives basic testing. But I
haven't fully verified it yet.

An additional improvement is running the queue and checking for a new
tag BEFORE needing to add ourselves to the waitqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c05e6673 14-Apr-2017 Omar Sandoval <osandov@fb.com>

sbitmap: add sbitmap_get_shallow() operation

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# af8601ad 03-Feb-2017 Ingo Molnar <mingo@kernel.org>

kasan, sched/headers: Uninline kasan_enable/disable_current()

<linux/kasan.h> is a low level header that is included early
in affected kernel headers. But it includes <linux/sched.h>
which complicates the cleanup of sched.h dependencies.

But kasan.h has almost no need for sched.h: its only use of
scheduler functionality is in two inline functions which are
not used very frequently - so uninline kasan_enable_current()
and kasan_disable_current().

Also add a <linux/sched.h> dependency to a .c file that depended
on kasan.h including it.

This paves the way to remove the <linux/sched.h> include from kasan.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 24af1ccf 25-Jan-2017 Omar Sandoval <osandov@fb.com>

sbitmap: add helpers for dumping to a seq_file

This is useful debugging information that will be used in the blk-mq
debugfs directory.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>

Changed 'weight' to 'busy'.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 6c0ca7ae 18-Jan-2017 Omar Sandoval <osandov@fb.com>

sbitmap: fix wakeup hang after sbq resize

When we resize a struct sbitmap_queue, we update the wakeup batch size,
but we don't update the wait count in the struct sbq_wait_states. If we
resized down from a size which could use a bigger batch size, these
counts could be too large and cause us to miss necessary wakeups. To fix
this, update the wait counts when we resize (ensuring some careful
memory ordering so that it's safe w.r.t. concurrent clears).

This also fixes a theoretical issue where two threads could end up
bumping the wait count up by the batch size, which could also
potentially lead to hangs.

Reported-by: Martin Raiber <martin@urbackup.org>
Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
Fixes: 2971c35f3588 ("blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f66227de 18-Jan-2017 Omar Sandoval <osandov@fb.com>

sbitmap: use smp_mb__after_atomic() in sbq_wake_up()

We always do an atomic clear_bit() right before we call sbq_wake_up(),
so we can use smp_mb__after_atomic(). While we're here, comment the
memory barriers in here a little more.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 60658e0d 19-Sep-2016 Colin Ian King <colin.king@canonical.com>

sbitmap: initialize weight to zero

Variable weight is not being initialized to zero before it is
used to compute the weight sum. Ensure it is initialized to zero.

Found with static analysis with cppcheck:
[lib/sbitmap.c:177]: (error) Uninitialized variable: weight

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5c64a8df 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: don't update the allocation hint on clear after resize

If we have a bunch of high-numbered bits allocated and then we resize
the struct sbitmap_queue, when those bits get cleared, we'll update the
hint and then have to re-randomize it repeatedly. Avoid that by checking
that the cleared bit is still a valid hint. No measurable performance
difference in the common case.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 05fd095d 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: re-initialize allocation hints after resize

After a struct sbitmap_queue is resized smaller, the allocation hints
may still be set to bits beyond the new depth of the bitmap. This means
that, for example, if the number of blk-mq tags is reduced through
sysfs, more requests than the nominal queue depth may be in flight.

It's tempting to fix this at resize time by doing a one-time
reinitialization of the hints, but this can race with
__sbitmap_queue_get() updating the hint. Instead, check the hint before
we use it. This caused no measurable performance difference in my
synthetic benchmarks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 98d95416 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: randomize initial alloc_hint values

In order to get good cache behavior from a sbitmap, we want each CPU to
stick to its own cacheline(s) as much as possible. This might happen
naturally as the bitmap gets filled up and the alloc_hint values spread
out, but we really want this behavior from the start. blk-mq apparently
intended to do this, but the code to do this was never wired up. Get rid
of the dead code and make it part of the sbitmap library.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f4a644db 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: push alloc policy into sbitmap_queue

Again, there's no point in passing this in every time. Make it part of
struct sbitmap_queue and clean up the API.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 40aabb67 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: push per-cpu last_tag into sbitmap_queue

Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 48e28166 17-Sep-2016 Omar Sandoval <osandov@fb.com>

sbitmap: allocate wait queues on a specific node

The original bt_alloc() we converted from was using kzalloc(), not
kzalloc_node(), to allocate the wait queues. This was probably an
oversight, so fix it for sbitmap_queue_init_node().

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 88459642 17-Sep-2016 Omar Sandoval <osandov@fb.com>

blk-mq: abstract tag allocation out into sbitmap library

This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>