History log of /linux-master/block/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
c0f7e49f 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request from Christoph:
- fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
- update a MAINTAINERS email address (Chaitanya Kulkarni)
- set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
- handle errors from add_disk() (Luis Chamberlain)
- update the keep alive interval when kato is modified (Tatsuya Sasaki)
- fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
- do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
- only call synchronize_srcu when clearing current path (Daniel Wagner)
- revalidate paths during rescan (Hannes Reinecke)

- Split out the fs/block_dev into block/fops.c and block/bdev.c, which
has been long overdue. Do this now before -rc1, to avoid annoying
conflicts due to this (Christoph)

- blk-throtl use-after-free fix (Li)

- Improve plug depth for multi-device plugs, greatly increasing md
resync performance (Song)

- blkdev_show() locking fix (Tetsuo)

- n64cart error check fix (Yang)

* tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
n64cart: fix return value check in n64cart_probe()
blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
block: move fs/block_dev.c to block/bdev.c
block: split out operations on block special files
blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
block: genhd: don't call blkdev_show() with major_names_lock held
nvme: update MAINTAINERS email address
nvme: add error handling support for add_disk()
nvme: only call synchronize_srcu when clearing current path
nvme: update keep alive interval when kato is modified
nvme-tcp: Do not reset transport on data digest errors
nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
nvmet: looks at the passthrough controller when initializing CAP
nvme: move nvme_multi_css into nvme.h
nvme-multipath: revalidate paths during rescan
nvme-multipath: set QUEUE_FLAG_NOWAIT


7f2a6a69 07-Sep-2021 Song Liu <songliubraving@fb.com>

blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues

Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts
performance for large md arrays. [1] shows resync speed of md array drops
for md array with more than 16 HDDs.

Fix this by allowing more request at plug queue. The multiple_queue flag
is used to only apply higher limit to multiple queue cases.

[1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/
Tested-by: Marcin Wanat <marcin.wanat@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0dca4462 07-Sep-2021 Christoph Hellwig <hch@lst.de>

block: move fs/block_dev.c to block/bdev.c

Move it together with the rest of the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd82cca7 07-Sep-2021 Christoph Hellwig <hch@lst.de>

block: split out operations on block special files

Add a new block/fops.c for all the file and address_space operations
that provide the block special file support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de
[axboe: correct trailing whitespace while at it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

884f0e84 07-Sep-2021 Li Jinlin <lijinlin3@huawei.com>

blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()

The pending timer has been set up in blk_throtl_init(). However, the
timer is not deleted in blk_throtl_exit(). This means that the timer
handler may still be running after freeing the timer, which would
result in a use-after-free.

Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit().

Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dfbb3409 07-Sep-2021 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>

block: genhd: don't call blkdev_show() with major_names_lock held

If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.

======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560

but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);

*** DEADLOCK ***

2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]

stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050

Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.

The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.

Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1dbe7e38 06-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Was going to send this one in later this week, but given that -Werror
is now enabled (or at least available), the mq-deadline fix really
should go in for the folks hitting that.

- Ensure dd_queued() is only there if needed (Geert)

- Fix a kerneldoc warning for bio_alloc_kiocb()

- BFQ fix for queue merging

- loop locking fix (Tetsuo)"

* tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
loop: reduce the loop_ctl_mutex scope
bio: fix kerneldoc documentation for bio_alloc_kiocb()
block, bfq: honor already-setup queue merges
block/mq-deadline: Move dd_queued() to fix defined but not used warning


14726903 03-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:
"173 patches.

Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...


f358afc5 02-Sep-2021 Christoph Hellwig <hch@lst.de>

mm: remove flush_kernel_dcache_page

flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.

The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages. Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:

1) an optimization to defer the cache flush for page cache pages not
mapped into userspace
2) additional flushing for mapped page cache pages if cache aliases
are possible

Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0ef47db1 03-Sep-2021 Jens Axboe <axboe@kernel.dk>

bio: fix kerneldoc documentation for bio_alloc_kiocb()

Apparently the last fixup got butter fingered a bit, the correct variable
name is 'nr_vecs', not 'nr_iovecs'.

Link: https://lore.kernel.org/lkml/20210903164939.02f6e8c5@canb.auug.org.au/
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9c9a6f7 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, qla2xxx,
target, smartpqi, lpfc, mpt3sas).

The core change causing the most churn was replacing the command
request field request with a macro, allowing us to offset map to it
and remove the redundant field; the same was also done for the tag
field.

The most impactful change is the final removal of scsi_ioctl, which
has been deprecated for over a decade"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
scsi: ufs: ufs-exynos: Fix static checker warning
scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Copyright updates for 14.0.0.1 patches
scsi: lpfc: Update lpfc version to 14.0.0.1
scsi: lpfc: Add bsg support for retrieving adapter cmf data
scsi: lpfc: Add cmf_info sysfs entry
scsi: lpfc: Add debugfs support for cm framework buffers
scsi: lpfc: Add support for maintaining the cm statistics buffer
scsi: lpfc: Add rx monitoring statistics
scsi: lpfc: Add support for the CM framework
scsi: lpfc: Add cmfsync WQE support
scsi: lpfc: Add support for cm enablement buffer
scsi: lpfc: Add cm statistics buffer support
scsi: lpfc: Add EDC ELS support
scsi: lpfc: Expand FPIN and RDF receive logging
scsi: lpfc: Add MIB feature enablement support
scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
scsi: fc: Add EDC ELS definition
...


2d52c58b 02-Aug-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: honor already-setup queue merges

The function bfq_setup_merge prepares the merging between two
bfq_queues, say bfqq and new_bfqq. To this goal, it assigns
bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives,
the process that generated that I/O is disassociated from bfqq and
associated with new_bfqq (merging is actually a redirection). In this
respect, bfq_setup_merge increases new_bfqq->ref in advance, adding
the number of processes that are expected to be associated with
new_bfqq.

Unfortunately, the stable-merging mechanism interferes with this
setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and
before all the expected processes have been associated with
bfqq->new_bfqq, bfqq may happen to be stably merged with a different
queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq
gets changed. So, some of the processes that have been already
accounted for in the ref counter of the previous new_bfqq will not be
associated with that queue. This creates an unbalance, because those
references will never be decremented.

This commit fixes this issue by reestablishing the previous, natural
behaviour: once bfqq->new_bfqq has been set, it will not be changed
until all expected redirections have occurred.

Signed-off-by: Davide Zini <davidezini2@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

55a51ea1 30-Aug-2021 Geert Uytterhoeven <geert@linux-m68k.org>

block/mq-deadline: Move dd_queued() to fix defined but not used warning

If CONFIG_BLK_DEBUG_FS=n:

block/mq-deadline.c:274:12: warning: ‘dd_queued’ defined but not used [-Wunused-function]
274 | static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
| ^~~~~~~~~

Fix this by moving dd_queued() just before the sole function that calls
it.

Fixes: 7b05bf771084ff78 ("Revert "block/mq-deadline: Prioritize high-priority requests"")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210830091128.1854266-1-geert@linux-m68k.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

87045e65 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"The highlights of this round are integrations with fs-verity and
idmapped mounts, the rest is usual mix of minor improvements, speedups
and cleanups.

There are some patches outside of btrfs, namely updating some VFS
interfaces, all straightforward and acked.

Features:

- fs-verity support, using standard ioctls, backward compatible with
read-only limitation on inodes with previously enabled fs-verity

- idmapped mount support

- make mount with rescue=ibadroots more tolerant to partially damaged
trees

- allow raid0 on a single device and raid10 on two devices,
degenerate cases but might be useful as an intermediate step during
conversion to other profiles

- zoned mode block group auto reclaim can be disabled via sysfs knob

Performance improvements:

- continue readahead of node siblings even if target node is in
memory, could speed up full send (on sample test +11%)

- batching of delayed items can speed up creating many files

- fsync/tree-log speedups
- avoid unnecessary work (gains +2% throughput, -2% run time on
sample load)
- reduced lock contention on renames (on dbench +4% throughput,
up to -30% latency)

Fixes:

- various zoned mode fixes

- preemptive flushing threshold tuning, avoid excessive work on
almost full filesystems

Core:

- continued subpage support, preparation for implementing remaining
features like compression and defragmentation; with some
limitations, write is now enabled on 64K page systems with 4K
sectors, still considered experimental
- no readahead on compressed reads
- inline extents disabled
- disabled raid56 profile conversion and mount

- improved flushing logic, fixing early ENOSPC on some workloads

- inode flags have been internally split to read-only and read-write
incompat bit parts, used by fs-verity

- new tree items for fs-verity
- descriptor item
- Merkle tree item

- inode operations extended to be namespace-aware

- cleanups and refactoring

Generic code changes:

- fs: new export filemap_fdatawrite_wbc

- fs: removed sync_inode

- block: bio_trim argument type fixups

- vfs: add namespace-aware lookup"

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reset replace target device to allocation state on close
btrfs: zoned: fix ordered extent boundary calculation
btrfs: do not do preemptive flushing if the majority is global rsv
btrfs: reduce the preemptive flushing threshold to 90%
btrfs: tree-log: check btrfs_lookup_data_extent return value
btrfs: avoid unnecessarily logging directories that had no changes
btrfs: allow idmapped mount
btrfs: handle ACLs on idmapped mounts
btrfs: allow idmapped INO_LOOKUP_USER ioctl
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
btrfs: allow idmapped SNAP_DESTROY ioctls
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
btrfs: check whether fsgid/fsuid are mapped during subvolume creation
btrfs: allow idmapped permission inode op
btrfs: allow idmapped setattr inode op
btrfs: allow idmapped tmpfile inode op
btrfs: allow idmapped symlink inode op
btrfs: allow idmapped mkdir inode op
...


3b629f8d 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block

Pull support for struct bio recycling from Jens Axboe:
"This adds bio recycling support for polled IO, allowing quick reuse of
a bio for high IOPS scenarios via a percpu bio_set list.

It's good for almost a 10% improvement in performance, bumping our
per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"

* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
bio: improve kerneldoc documentation for bio_alloc_kiocb()
block: provide bio_clear_hipri() helper
block: use the percpu bio cache in __blkdev_direct_IO
io_uring: enable use of bio alloc cache
block: clear BIO_PERCPU_CACHE flag if polling isn't supported
bio: add allocation cache abstraction
fs: add kiocb alloc cache flag
bio: optimize initialization of a bio


67936911 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
"Nothing major in here - lots of good cleanups and tech debt handling,
which is also evident in the diffstats. In particular:

- Add disk sequence numbers (Matteo)

- Discard merge fix (Ming)

- Relax disk zoned reporting restrictions (Niklas)

- Bio error handling zoned leak fix (Pavel)

- Start of proper add_disk() error handling (Luis, Christoph)

- blk crypto fix (Eric)

- Non-standard GPT location support (Dmitry)

- IO priority improvements and cleanups (Damien)o

- blk-throtl improvements (Chunguang)

- diskstats_show() stack reduction (Abd-Alrhman)

- Loop scheduler selection (Bart)

- Switch block layer to use kmap_local_page() (Christoph)

- Remove obsolete disk_name helper (Christoph)

- block_device refcounting improvements (Christoph)

- Ensure gendisk always has a request queue reference (Christoph)

- Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

* tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
sg: pass the device name to blk_trace_setup
block, bfq: cleanup the repeated declaration
blk-crypto: fix check for too-large dun_bytes
blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
block: mark blkdev_fsync static
block: refine the disk_live check in del_gendisk
mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
mmc: block: Support alternative_gpt_sector() operation
partitions/efi: Support non-standard GPT location
block: Add alternative_gpt_sector() operation
bio: fix page leak bio_add_hw_page failure
block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
block: remove a pointless call to MINOR() in device_add_disk
null_blk: add error handling support for add_disk()
virtio_blk: add error handling support for add_disk()
block: add error handling for device_add_disk / add_disk
block: return errors from disk_alloc_events
block: return errors from blk_integrity_add
block: call blk_register_queue earlier in device_add_disk
...


7d6e3fa8 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
"Updates to the interrupt core and driver subsystems:

Core changes:

- The usual set of small fixes and improvements all over the place,
but nothing stands out

MSI changes:

- Further consolidation of the PCI/MSI interrupt chip code

- Make MSI sysfs code independent of PCI/MSI and expose the MSI
interrupts of platform devices in the same way as PCI exposes them.

Driver changes:

- Support for ARM GICv3 EPPI partitions

- Treewide conversion to generic_handle_domain_irq() for all chained
interrupt controllers

- Conversion to bitmap_zalloc() throughout the irq chip drivers

- The usual set of small fixes and improvements"

* tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
platform-msi: Add ABI to show msi_irqs of platform devices
genirq/msi: Move MSI sysfs handling from PCI to MSI core
genirq/cpuhotplug: Demote debug printk to KERN_DEBUG
irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy
irqdomain: Export irq_domain_disconnect_hierarchy()
irqchip/gic-v3: Fix priority comparison when non-secure priorities are used
irqchip/apple-aic: Fix irq_disable from within irq handlers
pinctrl/rockchip: drop the gpio related codes
gpio/rockchip: drop irq_gc_lock/irq_gc_unlock for irq set type
gpio/rockchip: support next version gpio controller
gpio/rockchip: use struct rockchip_gpio_regs for gpio controller
gpio/rockchip: add driver for rockchip gpio
dt-bindings: gpio: change items restriction of clock for rockchip,gpio-bank
pinctrl/rockchip: add pinctrl device to gpio bank struct
pinctrl/rockchip: separate struct rockchip_pin_bank to a head file
pinctrl/rockchip: always enable clock for gpio controller
genirq: Fix kernel doc indentation
EDAC/altera: Convert to generic_handle_domain_irq()
powerpc: Bulk conversion to generic_handle_domain_irq()
nios2: Bulk conversion to generic_handle_domain_irq()
...


64b4fc45 27-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Revert the mq-deadline priority handling, it's causing serious
performance regressions. While experimental patches exists to fix
this up, it's too late to do so now. Revert it and re-do it properly
for 5.15 instead.

- Fix a NULL vs IS_ERR() regression in this release (Dan)

- Fix a mq-deadline accounting regression in this release (Bart)

- Mark cryptoloop as deprecated. It's broken and dm-crypt fully
supports it, and it's actively intefering with loop. Plan on removal
for 5.16 (Christoph)

* tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block:
cryptoloop: add a deprecation warning
pd: fix a NULL vs IS_ERR() check
Revert "block/mq-deadline: Prioritize high-priority requests"
mq-deadline: Fix request accounting


7b05bf77 26-Aug-2021 Jens Axboe <axboe@kernel.dk>

Revert "block/mq-deadline: Prioritize high-priority requests"

This reverts commit fb926032b3209300f9dc454a36b8299582ae545c.

Zhen reports that this commit slows down mq-deadline on a 128 thread
box, going from 258K IOPS to 170-180K. My testing shows that Optane
gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box.

Looking in detail at the code, the main culprit here is needing to sum
percpu counters in the dispatch hot path, leading to very high CPU
utilization there. To make matters worse, the code currently needs to
sum 2 percpu counters, and it does so in the most naive way of iterating
possible CPUs _twice_.

Since we're close to release, revert this commit and we can re-do it
with regular per-priority counters instead for the 5.15 kernel.

Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/
Reported-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1e294970 25-Aug-2021 Shaokun Zhang <zhangshaokun@hisilicon.com>

block, bfq: cleanup the repeated declaration

Function 'bfq_entity_to_bfqq' is declared twice, so remove the
repeated declaration and blank line.

Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Link: https://lore.kernel.org/r/1629872391-46399-1-git-send-email-zhangshaokun@hisilicon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cc40b722 24-Aug-2021 Eric Biggers <ebiggers@google.com>

blk-crypto: fix check for too-large dun_bytes

dun_bytes needs to be less than or equal to the IV size of the
encryption mode, not just less than or equal to BLK_CRYPTO_MAX_IV_SIZE.

Currently this doesn't matter since blk_crypto_init_key() is never
actually passed invalid values, but we might as well fix this.

Fixes: a892c8d52c02 ("block: Inline encryption support for blk-mq")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20210825055918.51975-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b6d2b054 24-Aug-2021 Bart Van Assche <bvanassche@acm.org>

mq-deadline: Fix request accounting

The block layer may call the I/O scheduler .finish_request() callback
without having called the .insert_requests() callback. Make sure that the
mq-deadline I/O statistics are correct if the block layer inserts an I/O
request that bypasses the I/O scheduler. This patch prevents that lower
priority I/O is delayed longer than necessary for mixed I/O priority
workloads.

Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Tested-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4d643b66 11-Aug-2021 Niklas Cassel <niklas.cassel@wdc.com>

blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN

A user space process should not need the CAP_SYS_ADMIN capability set
in order to perform a BLKREPORTZONE ioctl.

Getting the zone report is required in order to get the write pointer.
Neither read() nor write() requires CAP_SYS_ADMIN, so it is reasonable
that a user space process that can read/write from/to the device, also
can get the write pointer. (Since e.g. writes have to be at the write
pointer.)

Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: stable@vger.kernel.org # v4.10+
Link: https://lore.kernel.org/r/20210811110505.29649-3-Niklas.Cassel@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ead3b768 11-Aug-2021 Niklas Cassel <niklas.cassel@wdc.com>

blk-zoned: allow zone management send operations without CAP_SYS_ADMIN

Zone management send operations (BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE
and BLKFINISHZONE) should be allowed under the same permissions as write().
(write() does not require CAP_SYS_ADMIN).

Additionally, other ioctls like BLKSECDISCARD and BLKZEROOUT only check if
the fd was successfully opened with FMODE_WRITE.
(They do not require CAP_SYS_ADMIN).

Currently, zone management send operations require both CAP_SYS_ADMIN
and that the fd was successfully opened with FMODE_WRITE.

Remove the CAP_SYS_ADMIN requirement, so that zone management send
operations match the access control requirement of write(), BLKSECDISCARD
and BLKZEROOUT.

Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: stable@vger.kernel.org # v4.10+
Link: https://lore.kernel.org/r/20210811110505.29649-2-Niklas.Cassel@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9f286992 24-Aug-2021 Christoph Hellwig <hch@lst.de>

block: refine the disk_live check in del_gendisk

hidden gendisks will never be marked live.

Fixes: 40b3a52ffc5b ("block: add a sanity check for a live disk in del_gendisk")
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824144310.1487816-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

466d9c49 19-Aug-2021 Dmitry Osipenko <digetx@gmail.com>

partitions/efi: Support non-standard GPT location

Support looking up GPT at a non-standard location specified by a block
device driver.

Acked-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210820004536.15791-3-digetx@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d9cf3bd5 19-Jul-2021 Pavel Begunkov <asml.silence@gmail.com>

bio: fix page leak bio_add_hw_page failure

__bio_iov_append_get_pages() doesn't put not appended pages on
bio_add_hw_page() failure, so potentially leaking them, fix it. Also, do
the same for __bio_iov_iter_get_pages(), even though it looks like it
can't be triggered by userspace in this case.

Fixes: 0512a75b98f8 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org # 5.8+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1edfa6a2ffd66d55e6345a477df5387d2c1415d0.1626653825.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c4b2b7d1 24-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT

This might have been a neat debug aid when the extended dev_t was
added, but that time is long gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824075216.1179406-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

539711d7 24-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove a pointless call to MINOR() in device_add_disk

blk_alloc_ext_minor already returns just a minor number, so no need to
mask the high bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824075216.1179406-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3d5b3fbe 13-Aug-2021 Jens Axboe <axboe@kernel.dk>

bio: improve kerneldoc documentation for bio_alloc_kiocb()

We're missing a description for the 'nr_vecs' parameter. While in there,
clarify that freeing a bio allocated through this function must be done
from process context.

Fixes: 1cbbd31c4ada ("bio: add allocation cache abstraction")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

270a1c91 12-Aug-2021 Jens Axboe <axboe@kernel.dk>

block: provide bio_clear_hipri() helper

Any case that turns off REQ_HIPRI must also clear BIO_PERCPU_CACHE,
as non-polled IO may complete through hard/soft IRQ and hence isn't
safe for our polled bio alloc cache.

Provide a helper that does just that, and use it in the merging code as
well if we split a bio and turn off polling.

Fixes: be863b9e4348 ("block: clear BIO_PERCPU_CACHE flag if polling isn't supported")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

be863b9e 11-Aug-2021 Jens Axboe <axboe@kernel.dk>

block: clear BIO_PERCPU_CACHE flag if polling isn't supported

The bio alloc cache relies on the fact that a polled bio will complete
in process context, clear the cacheable flag if we disable polling
for a given bio.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

be4d234d 08-Mar-2021 Jens Axboe <axboe@kernel.dk>

bio: add allocation cache abstraction

Add a per-cpu bio_set cache for bio allocations, enabling us to quickly
recycle them instead of going through the slab allocator. This cache
isn't IRQ safe, and hence is only really suitable for polled IO.

Very simple - keeps a count of bio's in the cache, and maintains a max
of 512 with a slack of 64. If we get above max + slack, we drop slack
number of bio's.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da521626 11-Aug-2021 Jens Axboe <axboe@kernel.dk>

bio: optimize initialization of a bio

The memset() used is measurably slower in targeted benchmarks, wasting
about 1% of the total runtime, or 50% of the (later) hot path cached
bio alloc. Get rid of it and fill in the bio manually.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

83cbce95 18-Aug-2021 Luis Chamberlain <mcgrof@kernel.org>

block: add error handling for device_add_disk / add_disk

Properly unwind on errors in device_add_disk. This is the initial work
as drivers are not converted yet, which will follow in separate patches.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: major rebase. All bugs are probably mine]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

92e7755e 18-Aug-2021 Luis Chamberlain <mcgrof@kernel.org>

block: return errors from disk_alloc_events

Prepare for proper error handling in add_disk.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

614310c9 18-Aug-2021 Luis Chamberlain <mcgrof@kernel.org>

block: return errors from blk_integrity_add

Prepare for proper error handling in add_disk.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75f4dca5 18-Aug-2021 Christoph Hellwig <hch@lst.de>

block: call blk_register_queue earlier in device_add_disk

Ensure that all the sysfs bits are set up before bdev_add is called,
as that will make the upcomding error handling much easier. However
this means the call to disk_update_readahead has to be split as that
requires a bdi. Also remove various sanity checks that don't make
sense now that blk_register_queue only has a single caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210818144542.19305-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bab53f6b 18-Aug-2021 Christoph Hellwig <hch@lst.de>

block: call blk_integrity_add earlier in device_add_disk

Doing all the sysfs file creation before adding the bdev and thus
allowing it to be opened will simplify the about to be added error
handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9d5ee676 18-Aug-2021 Christoph Hellwig <hch@lst.de>

block: create the bdi link earlier in device_add_disk

This will simplify error handling going forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8235b5c1 18-Aug-2021 Christoph Hellwig <hch@lst.de>

block: call bdev_add later in device_add_disk

Once bdev_add is called userspace can open the block device. Ensure
that the struct device, which is used for refcounting of the disk
besides various other things, is fully setup at that point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

52b85909 18-Aug-2021 Christoph Hellwig <hch@lst.de>

block: fold register_disk into device_add_disk

There is no real reason these should be separate. Also simplify the
groups assignment a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210818144542.19305-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

40b3a52f 18-Aug-2021 Christoph Hellwig <hch@lst.de>

block: add a sanity check for a live disk in del_gendisk

Add a sanity check to del_gendisk to do nothing when the disk wasn't
successfully added. This papers over the complete lack of add_disk
error handling, which is about to get fixed gradually.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d152c682 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: add an explicit ->disk backpointer to the request_queue

Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.

Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Reported-by: syzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

61a35cfc 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: hold a request_queue reference for the lifetime of struct gendisk

Acquire the queue ref dropped in disk_release in __blk_alloc_disk so any
allocate gendisk always has a queue reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4a1fa41d 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: pass a request_queue to __blk_alloc_disk

Pass in a request_queue and assign disk->queue in __blk_alloc_disk to
ensure struct gendisk always has a valid ->queue pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a58bd768 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove the minors argument to __alloc_disk_node

This was a leftover from the legacy alloc_disk interface. Switch
the scsi ULPs and dasd to set ->minors directly like all other
drivers and remove the argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd]
Link: https://lore.kernel.org/r/20210816131910.615153-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4dcc4874 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: cleanup the lockdep handling in *alloc_disk

Pass the lockdep name to the low-level __blk_alloc_disk helper and
hardcode the name for it given that the number of minors or node_id
are not very useful information. While this passes a pointless
argument for non-lockdep builds that is not really an issue as
disk allocation is a probe time only slow path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e83502ca 21-Jul-2021 Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

block: fix argument type of bio_trim()

The function bio_trim has offset and size arguments that are declared
as int.

The callers of this function use sector_t type when passing the offset
and size, e.g. drivers/md/raid1.c:narrow_write_error() and
drivers/md/raid1.c:narrow_write_error().

Change offset and size arguments to sector_t type for bio_trim(). Also,
add WARN_ON_ONCE() to catch their overflow.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

002c0aef 21-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Three fixes from Ming Lei that should go into 5.14:

- Fix for a kernel panic when iterating over tags for some cases
where a flush request is present, a regression in this cycle.

- Request timeout fix

- Fix flush request checking"

* tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
blk-mq: fix is_flush_rq
blk-mq: fix kernel panic during iterating over flush request
blk-mq: don't grab rq's refcount in blk_mq_check_expired()


759e0fd4 20-Aug-2021 Christoph Hellwig <hch@lst.de>

block: add back the bd_holder_dir reference in bd_link_disk_holder

This essentially reverts "block: remove the extra kobject reference in
bd_link_disk_holder". That commit dropped the extra reference because
the condition in the comment can't be true. But it turns out that
comment did not actually describe the problematic situation, so add
back the extra reference and document it properly.

Fixes: fbd9a39542ec ("block: remove the extra kobject reference in bd_link_disk_holder")
Reported-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e70344c0 10-Aug-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: fix default IO priority handling

The default IO priority is the best effort (BE) class with the
normal priority level IOPRIO_NORM (4). However, get_task_ioprio()
returns IOPRIO_CLASS_NONE/IOPRIO_NORM as the default priority and
get_current_ioprio() returns IOPRIO_CLASS_NONE/0. Let's be consistent
with the defined default and have both of these functions return the
default priority IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM) when
the user did not define another default IO priority for the task.

In include/uapi/linux/ioprio.h, introduce the IOPRIO_BE_NORM macro as
an alias to IOPRIO_NORM to clarify that this default level applies to
the BE priotity class. In include/linux/ioprio.h, define the macro
IOPRIO_DEFAULT as IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
and use this new macro when setting a priority to the default.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-7-damien.lemoal@wdc.com
[axboe: drop unnecessary lightnvm change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

202bc942 10-Aug-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: Introduce IOPRIO_NR_LEVELS

The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.

Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a680dd72 10-Aug-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: bfq: fix bfq_set_next_ioprio_data()

For a request that has a priority level equal to or larger than
IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but
defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This
is not consistent with the warning and the allowed values for priority
levels. Fix this by setting the request new_ioprio field to
IOPRIO_BE_NR - 1, the lowest priority level allowed.

Cc: <stable@vger.kernel.org>
Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9ed27a7 17-Aug-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: fix is_flush_rq

is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the
following check:

hctx->fq->flush_rq == req

but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because:

1) memory re-order in blk_mq_rq_ctx_init():

rq->mq_hctx = data->hctx;
...
refcount_set(&rq->ref, 1);

OR

2) tag re-use and ->rqs[] isn't updated with new request.

Fix the issue by re-writing is_flush_rq() as:

return rq->end_io == flush_end_io;

which turns out simpler to follow and immune to data race since we have
ordered WRITE rq->end_io and refcount_set(&rq->ref, 1).

Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Cc: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c2da19ed 11-Aug-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: fix kernel panic during iterating over flush request

For fixing use-after-free during iterating over requests, we grabbed
request's refcount before calling ->fn in commit 2e315dc07df0 ("blk-mq:
grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter").
Turns out this way may cause kernel panic when iterating over one flush
request:

1) old flush request's tag is just released, and this tag is reused by
one new request, but ->rqs[] isn't updated yet

2) the flush request can be re-used for submitting one new flush command,
so blk_rq_init() is called at the same time

3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request
is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called,
flush_rq->end_io may not be updated yet, so NULL pointer dereference
is triggered in blk_mq_put_rq_ref().

Fix the issue by calling refcount_set(&flush_rq->ref, 1) after
flush_rq->end_io is set. So far the only other caller of blk_rq_init() is
scsi_ioctl_reset() in which the request doesn't enter block IO stack and
the request reference count isn't used, so the change is safe.

Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Reported-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Tested-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c797b40c 11-Aug-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: don't grab rq's refcount in blk_mq_check_expired()

Inside blk_mq_queue_tag_busy_iter() we already grabbed request's
refcount before calling ->fn(), so needn't to grab it one more time
in blk_mq_check_expired().

Meantime remove extra request expire check in blk_mq_check_expired().

Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

69f87cc7 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: unexport blk_register_queue

Not actually used in any modular code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816123649.601591-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

252c651a 10-Aug-2021 Christoph Hellwig <hch@lst.de>

blk-cgroup: stop using seq_get_buf

seq_get_buf is a crutch that undoes all the memory safety of the
seq_file interface. Use the normal seq_printf interfaces instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210810152623.1796144-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

49cb5168 10-Aug-2021 Christoph Hellwig <hch@lst.de>

blk-cgroup: refactor blkcg_print_stat

Factor out a helper to deal with a single blkcg_gq to make the code a
little bit easier to follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210810152623.1796144-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b93ef453 04-Aug-2021 Christoph Hellwig <hch@lst.de>

block: use bvec_virt in bio_integrity_{process,free}

Use the bvec_virt helper to clean up the bio integrity processing a
little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@kernel.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210804095634.460779-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

889c05cc 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: ensure the bdi is freed after inode_detach_wb

inode_detach_wb references the "main" bdi of the inode. With the
recent change to move the bdi from the request_queue to the gendisk
this causes a guaranteed use after free when using certain cgroup
configurations. The big itself is older through as any non-default
inode reference (e.g. an open file descriptor) could have injected
this use after free even before that.

Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Reported-by: syzbot <syzbot+1fb38bb7d3ce0fa3e1c4@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816122614.601358-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9451aa0a 16-Aug-2021 Christoph Hellwig <hch@lst.de>

block: free the extended dev_t minor later

The dev_t is used as the inode hash, so we should only released it
once then block device inode is gone from the inode cache. Move it
to bdev_free_inode to ensure that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816122614.601358-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4f1e9630 01-Aug-2021 Chunguang Xu <brookxu@tencent.com>

blk-throtl: optimize IOPS throttle for large IO scenarios

After patch 54efd50 (block: make generic_make_request handle
arbitrarily sized bios), the IO through io-throttle may be larger,
and these IOs may be further split into more small IOs. However,
IOPS throttle does not seem to be aware of this change, which
makes the calculation of IOPS of large IOs incomplete, resulting
in disk-side IOPS that does not meet expectations. Maybe we should
fix this problem.

We can reproduce it by set max_sectors_kb of disk to 128, set
blkio.write_iops_throttle to 100, run a dd instance inside blkio
and use iostat to watch IOPS:

dd if=/dev/zero of=/dev/sdb bs=1M count=1000 oflag=direct

As a result, without this change the average IOPS is 1995, with
this change the IOPS is 98.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/65869aaad05475797d63b4c3fed4f529febe3c26.1627876014.git.brookxu@tencent.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

020efdad 13-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A few fixes for block that should go into 5.14:

- Revert the mq-deadline cgroup addition. More work is needed on this
front, let's revert it for now and get it right before having it in
a released kernel (Tejun)

- blk-iocost lockdep fix (Ming)

- nbd double completion fix (Xie)

- Fix for non-idling when clearing the shared tag flag (Yu)"

* tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
nbd: Aovid double completion of a request
blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED
Revert "block/mq-deadline: Add cgroup support"
blk-iocost: fix lockdep warning on blkcg->lock


454bb677 31-Jul-2021 Yu Kuai <yukuai3@huawei.com>

blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED

We run a test that delete and recover devcies frequently(two devices on
the same host), and we found that 'active_queues' is super big after a
period of time.

If device a and device b share a tag set, and a is deleted, then
blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
is only one queue that are using the tag set. However, if b is still
active, the active_queues of b might never be cleared even if b is
deleted.

Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3d2e7989 10-Aug-2021 Christoph Hellwig <hch@lst.de>

block: pass a gendisk to bdev_resize_partition

bdev_resize_partition can only operate on the whole device. Make that clear
by passing a gendisk instead of a block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210810154512.1809898-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

926fbb16 10-Aug-2021 Christoph Hellwig <hch@lst.de>

block: pass a gendisk to bdev_del_partition

bdev_del_partition can only operate on the whole device. Make that clear
by passing a gendisk instead of a block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210810154512.1809898-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7f6be376 10-Aug-2021 Christoph Hellwig <hch@lst.de>

block: pass a gendisk to bdev_add_partition

bdev_add_partition can only operate on the whole device. Make that clear
by passing a gendisk instead of a block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210810154512.1809898-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a08aa9bc 10-Aug-2021 Christoph Hellwig <hch@lst.de>

block: store a gendisk in struct parsed_partitions

Partition scanning only happens on the whole device, so pass a
struct gendisk instead of the whole device block_device to the scanners.
This allows to simplify printing the device name in various places as the
disk name is available in disk->name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20210810154512.1809898-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

50b4aecf 09-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove GENHD_FL_UP

Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0f783995 11-Aug-2021 Tejun Heo <tj@kernel.org>

Revert "block/mq-deadline: Add cgroup support"

This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
WARN_ON_ONCE() call"). The added cgroup support has the following issues:

* It breaks cgroup interface file format rule by adding custom elements to a
nested key-value file.

* It registers mq-deadline as a cgroup-aware policy even though all it's
doing is collecting per-cgroup stats. Even if we need these stats, this
isn't the right way to add them.

* It hasn't been reviewed from cgroup side.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91cc470e 02-Jun-2021 Tanner Love <tannerlove@google.com>

genirq: Change force_irqthreads to a static key

With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
could incur a cache line miss in invoke_softirq() and other places.

Replace the test with a static key to avoid the potential cache miss.

[ tglx: Dropped the IDE part, removed the export and updated blk-mq ]

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tanner Love <tannerlove@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com

11431e26 03-Aug-2021 Ming Lei <ming.lei@redhat.com>

blk-iocost: fix lockdep warning on blkcg->lock

blkcg->lock depends on q->queue_lock which may depend on another driver
lock required in irq context, one example is dm-thin:

Chain exists of:
&pool->lock#3 --> &q->queue_lock --> &blkcg->lock

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&blkcg->lock);
local_irq_disable();
lock(&pool->lock#3);
lock(&q->queue_lock);
<Interrupt>
lock(&pool->lock#3);

Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write().

Cc: Tejun Heo <tj@kernel.org>
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#u
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9a73fa37 09-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:
"One commit to fix a possible A-A deadlock around u64_stats_sync on
32bit machines caused by updating it without disabling IRQ when it may
be read from IRQ context"

* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync


866663b7 28-Jul-2021 Ming Lei <ming.lei@redhat.com>

block: return ELEVATOR_DISCARD_MERGE if possible

When merging one bio to request, if they are discard IO and the queue
supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
because both block core and related drivers(nvme, virtio-blk) doesn't
handle mixed discard io merge(traditional IO merge together with
discard merge) well.

Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
so both blk-mq and drivers just need to handle multi-range discard.

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Fixes: 2705dfb20947 ("block: fix discard request merge")
Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a11d7fc2 09-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove the bd_bdi in struct block_device

Just retrieve the bdi from the disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

edb0872f 09-Aug-2021 Christoph Hellwig <hch@lst.de>

block: move the bdi from the request_queue to the gendisk

The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

471aa704 09-Aug-2021 Christoph Hellwig <hch@lst.de>

block: pass a gendisk to blk_queue_update_readahead

.. and rename the function to disk_update_readahead. This is in
preparation for moving the BDI from the request_queue to the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5ed964f8 09-Aug-2021 Christoph Hellwig <hch@lst.de>

mm: hide laptop_mode_wb_timer entirely behind the BDI API

Don't leak the detaіls of the timer into the block layer, instead
initialize the timer in bdi_alloc and delete it in bdi_unregister.
Note that this means the timer is initialized (but not armed) for
non-block queues as well now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d1254a87 04-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove support for delayed queue registrations

Now that device mapper has been changed to register the disk once
it is fully ready all this code is unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d6263387 04-Aug-2021 Christoph Hellwig <hch@lst.de>

block: support delayed holder registration

device mapper needs to register holders before it is ready to do I/O.
Currently it does so by registering the disk early, which can leave
the disk and queue in a weird half state where the queue is registered
with the disk, except for sysfs and the elevator. And this state has
been a bit promlematic before, and will get more so when sorting out
the responsibilities between the queue and the disk.

Support registering holders on an initialized but not registered disk
instead by delaying the sysfs registration until the disk is registered.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0dbcfe24 04-Aug-2021 Christoph Hellwig <hch@lst.de>

block: look up holders by bdev

Invert they way the holder relations are tracked. This very
slightly reduces the memory overhead for partitioned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210804094147.459763-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fbd9a395 04-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove the extra kobject reference in bd_link_disk_holder

Since commit 0d02129e76ed ("block: merge struct block_device and struct
hd_struct") there is no way for the bdev to go away as long as there is
a holder, so remove the extra references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c66fd019 04-Aug-2021 Christoph Hellwig <hch@lst.de>

block: make the block holder code optional

Move the block holder code into a separate file as it is not in any way
related to the other block_dev.c code, and add a new selectable config
option for it so that we don't have to build it without any remapped
drivers selected.

The Kconfig symbol contains a _DEPRECATED suffix to match the comments
added in commit 49731baa41df
("block: restore multiple bd_link_disk_holder() support").

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6bbf5914 07-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A few minor fixes:

- Fix ldm kernel-doc warning (Bart)

- Fix adding offset twice for DMA address in n64cart (Christoph)

- Fix use-after-free in dasd path handling (Stefan)

- Order kyber insert trace correctly (Vincent)

- raid1 errored write handling fix (Wei)

- Fix blk-iolatency queue get failure handling (Yu)"

* tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-block:
kyber: make trace_block_rq call consistent with documentation
block/partitions/ldm.c: Fix a kernel-doc warning
blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit()
n64cart: fix the dma address in n64cart_do_bvec
s390/dasd: fix use after free in dasd path handling
md/raid10: properly indicate failure when ending a failed write request


fb7b9b02 04-Aug-2021 Vincent Fu <vincent.fu@samsung.com>

kyber: make trace_block_rq call consistent with documentation

The kyber ioscheduler calls trace_block_rq_insert() *after* the request
is added to the queue but the documentation for trace_block_rq_insert()
says that the call should be made *before* the request is added to the
queue. Move the tracepoint for the kyber ioscheduler so that it is
consistent with the documentation.

Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Link: https://lore.kernel.org/r/20210804194913.10497-1-vincent.fu@samsung.com
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90b71980 05-Aug-2021 Bart Van Assche <bvanassche@acm.org>

blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flag

elevator_get_default() uses the following algorithm to select an I/O
scheduler from inside add_disk():
- In case of a single hardware queue or if sharing hardware queues across
multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
- Otherwise, use 'none'.

This is a good choice for most but not for all block drivers. Make it
possible to override the selection of mq-deadline with a new flag,
namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Martijn Coenen <maco@android.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e9fb2c1 05-Aug-2021 Bart Van Assche <bvanassche@acm.org>

block/partitions/ldm.c: Fix a kernel-doc warning

Fix the following kernel-doc warning that appears when building with W=1:

block/partitions/ldm.c:31: warning: expecting prototype for ldm().
Prototype was for ldm_debug() instead

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805173447.3249906-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8d75d0ef 05-Aug-2021 Yu Kuai <yukuai3@huawei.com>

blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit()

If queue is dying while iolatency_set_limit() is in progress,
blk_get_queue() won't increment the refcount of the queue. However,
blk_put_queue() will still decrement the refcount later, which will
cause the refcout to be unbalanced.

Thus error out in such case to fix the problem.

Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2bc1f6e4 13-Jul-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: remove blk-mq-sysfs dead code

In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to
define any attribute since the "mq" sysfs directory contains only
sub-directories (no attribute files). As a result, blk_mq_sysfs_show(),
blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all
unused and unnecessary. Remove all this unused code.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210713081837.524422-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e6138dc1 12-Jul-2021 Matteo Croce <mcroce@microsoft.com>

block: add a helper to raise a media changed event

Refactor disk_check_events() and move some code into disk_event_uevent().
Then add disk_force_media_change(), a helper which will be used by
devices to force issuing a DISK_EVENT_MEDIA_CHANGE event.

Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-6-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

13927b31 12-Jul-2021 Matteo Croce <mcroce@microsoft.com>

block: export diskseq in sysfs

Add a new sysfs handle to export the new diskseq value.
Place it in <sysfs>/block/<disk>/diskseq and document it.

$ grep . /sys/class/block/*/diskseq
/sys/class/block/loop0/diskseq:13
/sys/class/block/loop1/diskseq:14
/sys/class/block/loop2/diskseq:5
/sys/class/block/loop3/diskseq:6
/sys/class/block/ram0/diskseq:1
/sys/class/block/ram1/diskseq:2
/sys/class/block/vda/diskseq:7

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-5-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7957d93b 12-Jul-2021 Matteo Croce <mcroce@microsoft.com>

block: add ioctl to read the disk sequence number

Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number
from the genhd structure.

# ./getdiskseq /dev/loop*
/dev/loop0: 13
/dev/loop0p1: 13
/dev/loop0p2: 13
/dev/loop0p3: 13
/dev/loop1: 14
/dev/loop1p1: 14
/dev/loop1p2: 14
/dev/loop2: 5
/dev/loop3: 6

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-4-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

87eb7107 12-Jul-2021 Matteo Croce <mcroce@microsoft.com>

block: export the diskseq in uevents

Export the newly introduced diskseq in uevents:

$ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ
E: DEVNAME=/dev/loop0
E: DISKSEQ=1
E: DEVNAME=/dev/loop1
E: DISKSEQ=2
E: DEVNAME=/dev/loop2
E: DISKSEQ=3
E: DEVNAME=/dev/loop3
E: DISKSEQ=4
E: DEVNAME=/dev/loop4
E: DISKSEQ=5
E: DEVNAME=/dev/loop5
E: DISKSEQ=6
E: DEVNAME=/dev/loop6
E: DISKSEQ=7
E: DEVNAME=/dev/loop7
E: DISKSEQ=8
E: DEVNAME=/dev/nvme0n1
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p1
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p2
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p3
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p4
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p5
E: DISKSEQ=9
E: DEVNAME=/dev/sda
E: DISKSEQ=10
E: DEVNAME=/dev/sda1
E: DISKSEQ=10
E: DEVNAME=/dev/sda2
E: DISKSEQ=10

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-3-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cf179948 12-Jul-2021 Matteo Croce <mcroce@microsoft.com>

block: add disk sequence number

Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.

Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.

Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.

Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2164877c 27-Jul-2021 Christoph Hellwig <hch@lst.de>

block: remove cmdline-parser.c

cmdline-parser.c is only used by the cmdline faux partition format,
so merge the code into that and avoid an indirect call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210728053756.409654-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

abd2864a 27-Jul-2021 Christoph Hellwig <hch@lst.de>

block: remove disk_name()

Remove the disk_name function now that all users are gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1d703547 27-Jul-2021 Christoph Hellwig <hch@lst.de>

block: simplify disk name formatting in check_partition

disk_name for partition 0 just copies out the disk_name field. Replace
the call to disk_name with a %s format specifier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

453b8ab6 27-Jul-2021 Christoph Hellwig <hch@lst.de>

block: simplify printing the device names disk_stack_limits

Printk ->disk_name directly for the disk and use the %pg format specifier
for the block device, which is equivalent to a bdevname call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a291bb43 27-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use the %pg format specifier in show_partition

Simplify printing the partition name by using the %pg format specifier
that is equivalent to a bdevname call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9e7bc3d 27-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use the %pg format specifier in printk_all_partitions

Simplify printing the partition name by using the %pg format specifier
that is equivalent to a bdevname call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

26e2d7a3 27-Jul-2021 Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>

block: reduce stack usage in diskstats_show

I have compiled the kernel with a cross compiler "hppa-linux-gnu-" v9.3.0
on x86-64 host machine. I got the following warning:

block/genhd.c: In function ‘diskstats_show’:
block/genhd.c:1227:1: warning: the frame size of 1688 bytes is larger
than 1280 bytes [-Wframe-larger-than=]
1227 | }

By Reduced the stack footprint by using the %pg printk specifier instead
of disk_name to remove the need for the on-stack buffer.

Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2f4731dc 22-Jul-2021 Christoph Hellwig <hch@lst.de>

block: remove bdput

Now that we've stopped using inode references for anything meaninful
in the block layer get rid of the helper to put it and just open code
the call to iput on the block_device inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Link: https://lore.kernel.org/r/20210722075402.983367-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9d3b8813 22-Jul-2021 Christoph Hellwig <hch@lst.de>

block: change the refcounting for partitions

Instead of acquiring an inode reference on open make sure partitions
always hold device model references to the disk while alive, and switch
open to grab only a device model reference to the opened block device.
If that is a partition the disk reference is transitively held by the
partition already.

Link: https://lore.kernel.org/r/20210722075402.983367-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0468c532 22-Jul-2021 Christoph Hellwig <hch@lst.de>

block: allocate bd_meta_info later in add_partitions

Move the allocation of bd_meta_info after initializing the struct device
to avoid the special bdput error handling path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210722075402.983367-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d7a66574 22-Jul-2021 Christoph Hellwig <hch@lst.de>

block: unhash the whole device inode earlier

Unhash the whole device inode early in del_gendisk. This allows to
remove the first GENHD_FL_UP check in the open path as we simply
won't find a just removed inode. The second non-racy check after
taking open_mutex is still kept.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210722075402.983367-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a45e43ca 22-Jul-2021 Christoph Hellwig <hch@lst.de>

block: assert the locking state in delete_partition

Add a lockdep assert instead of the outdated locking comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Link: https://lore.kernel.org/r/20210722075402.983367-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

503469b5 26-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use bvec_kmap_local in bio_integrity_process

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8aec120a 26-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use bvec_kmap_local in t10_pi_type1_{prepare,complete}

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4aebe859 26-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use memcpy_from_bvec in __blk_queue_bounce

Rewrite the actual bounce buffering loop in __blk_queue_bounce to that
the memcpy_to_bvec helper can be used to perform the data copies.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d24920e2 26-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use memcpy_from_bvec in bio_copy_kern_endio_read

Use memcpy_from_bvec instead of open coding the logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f434cdc7 26-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use memcpy_to_bvec in copy_to_high_bio_irq

Use memcpy_to_bvec instead of opencoding the logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f8b679a0 26-Jul-2021 Christoph Hellwig <hch@lst.de>

block: rewrite bio_copy_data_iter to use bvec_kmap_local and memcpy_to_bvec

Use the proper helpers instead of open coding the copy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727055646.118787-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ab6c340e 26-Jul-2021 Christoph Hellwig <hch@lst.de>

block: use memzero_page in zero_fill_bio

Use memzero_bvec to zero each segment in the bio instead of manually
mapping and zeroing the data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20210727055646.118787-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

659a3784 31-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: bsg-lib: Fix commands without data transfer in bsg_transport_sg_io_fn()

Set ret to 0 after the initial permission checks to avoid leaking -EPERM
for commands without data transfer.

Link: https://lore.kernel.org/r/20210731074027.1185545-3-hch@lst.de
Fixes: 75ca56409e5b ("scsi: bsg: Move the whole request execution into the SCSI/transport handlers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

75ca5640 29-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: bsg: Move the whole request execution into the SCSI/transport handlers

Remove the amount of indirect calls by making the handler responsible for
the entire execution of the request.

Link: https://lore.kernel.org/r/20210729064845.1044147-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

1e61c1a8 29-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: block: Remove the remaining SG_IO-related fields from struct request_queue

Move the sg_timeout and sg_reserved_size fields into the bsg_device and
scsi_device structures as they have nothing to do with generic block I/O.
Note that these values are now separate for bsg vs. SCSI device node
access, but that just matches how /dev/sg vs the other nodes has always
behaved.

Link: https://lore.kernel.org/r/20210729064845.1044147-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

ead09dd3 29-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: bsg: Simplify device registration

Use the per-device cdev_device_interface to store the bsg data in the char
device inode, and thus remove the need to embedd the bsg_class_device
structure in the request_queue.

Link: https://lore.kernel.org/r/20210729064845.1044147-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

4669e13c 30-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- gendisk freeing fix (Christoph)

- blk-iocost wake ordering fix (Tejun)

- tag allocation error handling fix (John)

- loop locking fix. While this isn't the prettiest fix in the world,
nobody has any good alternatives for 5.14. Something to likely
revisit for 5.15. (Tetsuo)

* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
block: delay freeing the gendisk
blk-iocost: fix operation ordering in iocg_wake_fn()
blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
loop: reintroduce global lock for safe loop_validate_file() traversal


33ff4ce4 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: core: Rename CONFIG_BLK_SCSI_REQUEST to CONFIG_SCSI_COMMON

CONFIG_BLK_SCSI_REQUEST is rather misnamed as it enables building a small
amount of code shared by the SCSI initiator, target, and consumers of the
scsi_request passthrough API. Rename it and also allow building it as a
module.

[mkp: add module license]

Link: https://lore.kernel.org/r/20210724072033.1284840-20-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

f2542a3b 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: scsi_ioctl: Move the "block layer" SCSI ioctl handling to drivers/scsi

Merge the ioctl handling in block/scsi_ioctl.c into its only caller in
drivers/scsi/scsi_ioctl.c.

Link: https://lore.kernel.org/r/20210724072033.1284840-19-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

7353dc06 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: scsi_ioctl: Simplify SCSI passthrough permission checking

Remove the separate command filter structure and just use a switch
statement (which also cought two duplicate commands), return a bool and
give the function a sensible name.

Link: https://lore.kernel.org/r/20210724072033.1284840-18-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

b69367df 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: scsi_ioctl: Move scsi_command_size_tbl to scsi_common.c

Move the SCSI command size table to common SCSI code.

Link: https://lore.kernel.org/r/20210724072033.1284840-17-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

2cece377 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: scsi_ioctl: Remove scsi_req_init()

Merge scsi_req_init() into its only caller.

Link: https://lore.kernel.org/r/20210724072033.1284840-16-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

78011042 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: bsg: Move bsg_scsi_ops to drivers/scsi/

Move the SCSI-specific bsg code in the SCSI midlayer instead of in the
common bsg code. This just keeps the common bsg code block/ and also
allows building it as a module.

Link: https://lore.kernel.org/r/20210724072033.1284840-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

d52fe8f4 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: bsg: Decouple from scsi_cmd_ioctl()

Decouple bsg from scsi_cmd_ioctl(). This requires a small amount of code
duplication, but will allow moving all SCSI ioctl handling into SCSI
midlayer.

Link: https://lore.kernel.org/r/20210724072033.1284840-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

547e2f70 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: block: Add a queue_max_bytes() helper

Return the max_sectors value in bytes. Lifted from scsi_ioctl.c.

Link: https://lore.kernel.org/r/20210724072033.1284840-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

4f07bfc5 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: scsi_ioctl: Remove scsi_verify_blk_ioctl()

Manually verify that the device is not a partition and the caller has admin
privіleges at the beginning of the sr ioctl method and open code the
trivial check for sd as well.

Link: https://lore.kernel.org/r/20210724072033.1284840-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

fb1ba406 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: scsi_ioctl: Remove scsi_cmd_blk_ioctl()

Open code scsi_cmd_blk_ioctl() in its two callers.

Link: https://lore.kernel.org/r/20210724072033.1284840-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

beec64d0 24-Jul-2021 Christoph Hellwig <hch@lst.de>

scsi: bsg: Remove support for SCSI_IOCTL_SEND_COMMAND

SCSI_IOCTL_SEND_COMMAND has been deprecated longer than bsg exists and has
been warning for just as long. More importantly it harcodes SCSI CDBs and
thus will do the wrong thing on non-SCSI bsg nodes.

Link: https://lore.kernel.org/r/20210724072033.1284840-2-hch@lst.de
Fixes: aa387cc89567 ("block: add bsg helper library")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

340e8457 22-Jul-2021 Christoph Hellwig <hch@lst.de>

block: delay freeing the gendisk

blkdev_get_no_open acquires a reference to the block_device through
the block device inode and then tries to acquire a device model
reference to the gendisk. But at this point the disk migh already
be freed (although the race is free). Fix this by only freeing the
gendisk from the whole device bdevs ->free_inode callback as well.

Fixes: 22ae8ce8b892 ("block: simplify bdev/disk lookup in blkdev_get")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210722075402.983367-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5ab189cf 27-Jul-2021 Tejun Heo <tj@kernel.org>

blk-iocost: fix operation ordering in iocg_wake_fn()

iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it
wants the wq_entry to be always removed whether it ended up waking the
task or not. finish_wait() tests whether wq_entry needs removal without
grabbing the wait_queue lock and expects the waker to use
list_del_init_careful() after all waking operations are complete, which
iocg_wake_fn() didn't do. The operation order was wrong and the regular
list_del_init() was used.

The result is that if a waiter wakes up racing the waker, it can free pop
the wq_entry off stack before the waker is still looking at it, which can
lead to a backtrace like the following.

[7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP
...
[7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0
...
[7312084.858314] Call Trace:
[7312084.863548] _raw_spin_lock_irqsave+0x22/0x30
[7312084.872605] try_to_wake_up+0x4c/0x4f0
[7312084.880444] iocg_wake_fn+0x71/0x80
[7312084.887763] __wake_up_common+0x71/0x140
[7312084.895951] iocg_kick_waitq+0xe8/0x2b0
[7312084.903964] ioc_rqos_throttle+0x275/0x650
[7312084.922423] __rq_qos_throttle+0x20/0x30
[7312084.930608] blk_mq_make_request+0x120/0x650
[7312084.939490] generic_make_request+0xca/0x310
[7312084.957600] submit_bio+0x173/0x200
[7312084.981806] swap_readpage+0x15c/0x240
[7312084.989646] read_swap_cache_async+0x58/0x60
[7312084.998527] swap_cluster_readahead+0x201/0x320
[7312085.023432] swapin_readahead+0x2df/0x450
[7312085.040672] do_swap_page+0x52f/0x820
[7312085.058259] handle_mm_fault+0xa16/0x1420
[7312085.066620] do_page_fault+0x2c6/0x5c0
[7312085.074459] page_fault+0x2f/0x40

Fix it by switching to list_del_init_careful() and putting it at the end.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rik van Riel <riel@surriel.com>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c3df5fb5 27-Jul-2021 Tejun Heo <tj@kernel.org>

cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync

0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added
cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq
context. However, rstat paths use u64_stats_sync to synchronize access to
64bit stat counters on 32bit machines. u64_stats_sync is implemented using
seq_lock and trying to read from an irq context can lead to A-A deadlock if
the irq happens to interrupt the stat update.

Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and
u64_stats_update_end_irqrestore() - in the update paths. Note that none of
this matters on 64bit machines. All these are just for 32bit SMP setups.

Note that the interface was introduced way back, its first and currently
only use was recently added by 2d146aa3aa84 ("mm: memcontrol: switch to
rstat"). Stable tagging targets this commit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rik van Riel <riel@surriel.com>
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+

b93af305 27-Jul-2021 John Garry <john.garry@huawei.com>

blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling

If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we
call blk_mq_sched_free_tags() -> blk_mq_free_rqs().

It is incorrect to do so, as any rqs would have already been freed in the
blk_mq_alloc_rqs() call.

Fix by calling blk_mq_free_rq_map() only directly.

Fixes: 6917ff0b5bd41 ("blk-mq-sched: refactor scheduler initialization")
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1627378373-148090-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a022f7d5 09-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
"A combination of changes that ended up depending on both the driver
and core branch (and/or the IDE removal), and a few late arriving
fixes. In detail:

- Fix io ticks wrap-around issue (Chunguang)

- nvme-tcp sock locking fix (Maurizio)

- s390-dasd fixes (Kees, Christoph)

- blk_execute_rq polling support (Keith)

- blk-cgroup RCU iteration fix (Yu)

- nbd backend ID addition (Prasanna)

- Partition deletion fix (Yufen)

- Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)

- Removal of now dead block request types due to IDE removal
(Christoph)

- Loop probing and control device cleanups (Christoph)

- Device uevent fix (Christoph)

- Misc cleanups/fixes (Tetsuo, Christoph)"

* tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
block: fix the problem of io_ticks becoming smaller
nvme-tcp: can't set sk_user_data without write_lock
loop: remove unused variable in loop_set_status()
block: remove the bdgrab in blk_drop_partitions
block: grab a device refcount in disk_uevent
s390/dasd: Avoid field over-reading memcpy()
dasd: unexport dasd_set_target_state
block: check disk exist before trying to add partition
ubd: remove dead code in ubd_setup_common
nvme: use return value from blk_execute_rq()
block: return errors from blk_execute_rq()
nvme: use blk_execute_rq() for passthrough commands
block: support polling through blk_execute_rq
block: remove REQ_OP_SCSI_{IN,OUT}
block: mark blk_mq_init_queue_data static
loop: rewrite loop_exit using idr_for_each_entry
loop: split loop_lookup
loop: don't allow deleting an unspecified loop device
loop: move loop_ctl_mutex locking into loop_add
...


a731763f 06-Jul-2021 Yu Kuai <yukuai3@huawei.com>

blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs

We run a test that create millions of cgroups and blkgs, and then trigger
blkg_destroy_all(). blkg_destroy_all() will hold spin lock for a long
time in such situation. Thus release the lock when a batch of blkgs are
destroyed.

blkcg_activate_policy() and blkcg_deactivate_policy() might have the
same problem, however, as they are basically only called from module
init/exit paths, let's leave them alone for now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210707015649.1929797-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d80c228d 05-Jul-2021 Chunguang Xu <brookxu@tencent.com>

block: fix the problem of io_ticks becoming smaller

On the IO submission path, blk_account_io_start() may interrupt
the system interruption. When the interruption returns, the value
of part->stamp may have been updated by other cores, so the time
value collected before the interruption may be less than part->
stamp. So when this happens, we should do nothing to make io_ticks
more accurate? For kernels less than 5.0, this may cause io_ticks
to become smaller, which in turn may cause abnormal ioutil values.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1625521646-1069-1-git-send-email-brookxu.cn@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d2500a0c 03-Jul-2021 Martin K. Petersen <martin.petersen@oracle.com>

scsi: blkcg: Fix application ID config options

Commit d2bcbeab4200 ("scsi: blkcg: Add app identifier support for
blkcg") introduced an FC_APPID config option under SCSI. However, the
added config option is not used anywhere. Simply remove it.

The block layer BLK_CGROUP_FC_APPID config option is what actually
controls whether the application ID code should be built or not. Make
this option dependent on NVMe over FC since that is currently the only
transport which supports the capability.

Fixes: d2bcbeab4200 ("scsi: blkcg: Add app identifier support for blkcg")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bd31b9ef 02-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, ibmvfc,
megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
elx and mpi3mr being new drivers.

The major core change is a rework to drop the status byte handling
macros and the old bit shifted definitions and the rest of the updates
are minor fixes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits)
scsi: aha1740: Avoid over-read of sense buffer
scsi: arcmsr: Avoid over-read of sense buffer
scsi: ips: Avoid over-read of sense buffer
scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe()
scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame()
scsi: elx: libefc: Fix less than zero comparison of a unsigned int
scsi: elx: efct: Fix pointer error checking in debugfs init
scsi: elx: efct: Fix is_originator return code type
scsi: elx: efct: Fix link error for _bad_cmpxchg
scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel()
scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session()
scsi: elx: efct: Fix error handling in efct_hw_init()
scsi: elx: efct: Remove redundant initialization of variable lun
scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected"
scsi: lpfc: Fix build error in lpfc_scsi.c
scsi: target: iscsi: Remove redundant continue statement
scsi: qla4xxx: Remove redundant continue statement
scsi: ppa: Switch to use module_parport_driver()
scsi: imm: Switch to use module_parport_driver()
scsi: mpt3sas: Fix error return value in _scsih_expander_add()
...


4cad6719 02-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm/unaligned.h unification from Arnd Bergmann:
"Unify asm/unaligned.h around struct helper

The get_unaligned()/put_unaligned() helpers are traditionally
architecture specific, with the two main variants being the
"access-ok.h" version that assumes unaligned pointer accesses always
work on a particular architecture, and the "le-struct.h" version that
casts the data to a byte aligned type before dereferencing, for
architectures that cannot always do unaligned accesses in hardware.

Based on the discussion linked below, it appears that the access-ok
version is not realiable on any architecture, but the struct version
probably has no downsides. This series changes the code to use the
same implementation on all architectures, addressing the few
exceptions separately"

Link: https://lore.kernel.org/lkml/75d07691-1e4f-741f-9852-38c0b4f520bc@synopsys.com/
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
Link: https://lore.kernel.org/lkml/20210507220813.365382-14-arnd@kernel.org/
Link: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-rework-v2
Link: https://lore.kernel.org/lkml/CAHk-=whGObOKruA_bU3aPGZfoDqZM1_9wBkwREp0H0FgR-90uQ@mail.gmail.com/

* tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: simplify asm/unaligned.h
asm-generic: uaccess: 1-byte access is always aligned
netpoll: avoid put_unaligned() on single character
mwifiex: re-fix for unaligned accesses
apparmor: use get_unaligned() only for multi-byte words
partitions: msdos: fix one-byte get_unaligned()
asm-generic: unaligned always use struct helpers
asm-generic: unaligned: remove byteshift helpers
powerpc: use linux/unaligned/le_struct.h on LE power7
m68k: select CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
sh: remove unaligned access for sh4a
openrisc: always use unaligned-struct header
asm-generic: use asm-generic/unaligned.h for most architectures


63c38d85 01-Jul-2021 Christoph Hellwig <hch@lst.de>

block: remove the bdgrab in blk_drop_partitions

There is no need to hold a bdev reference when removing the partition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210701081638.246552-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

498dcc13 01-Jul-2021 Christoph Hellwig <hch@lst.de>

block: grab a device refcount in disk_uevent

Sending uevents requires the struct device to be alive. To
ensure that grab the device refcount instead of just an inode
reference.

Fixes: bc359d03c7ec ("block: add a disk_uevent helper")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210701081638.246552-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5cfbd35 09-Jun-2021 Yufen Yu <yuyufen@huawei.com>

block: check disk exist before trying to add partition

If disk have been deleted, we should return fail for ioctl
BLKPG_DEL_PARTITION. Otherwise, the directory /sys/class/block
may remain invalid symlinks file. The race as following:

blkdev_open
del_gendisk
disk->flags &= ~GENHD_FL_UP;
blk_drop_partitions
blkpg_ioctl
bdev_add_partition
add_partition
device_add
device_add_class_symlinks

ioctl may add_partition after del_gendisk() have tried to delete
partitions. Then, symlinks file will be created.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Link: https://lore.kernel.org/r/20210610023241.3646241-1-yuyufen@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2cfa582b 30-Jun-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

- Various DM persistent-data library improvements and fixes that
benefit both the DM thinp and cache targets.

- A few small DM kcopyd efficiency improvements.

- Significant zoned related block core, DM core and DM zoned target
changes that culminate with adding zoned append emulation (which is
required to properly fix DM crypt's zoned support).

- Various DM writecache target changes that improve efficiency. Adds an
optional "metadata_only" feature that only promotes bios flagged with
REQ_META. But the most significant improvement is writecache's
ability to pause writeback, for a confiurable time, if/when the
working set is larger than the cache (and the cache is full) -- this
ensures performance is no worse than the slower origin device.

* tag 'for-5.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
dm writecache: make writeback pause configurable
dm writecache: pause writeback if cache full and origin being written directly
dm io tracker: factor out IO tracker
dm btree remove: assign new_root only when removal succeeds
dm zone: fix dm_revalidate_zones() memory allocation
dm ps io affinity: remove redundant continue statement
dm writecache: add optional "metadata_only" parameter
dm writecache: add "cleaner" and "max_age" to Documentation
dm writecache: write at least 4k when committing
dm writecache: flush origin device when writing and cache is full
dm writecache: have ssd writeback wait if the kcopyd workqueue is busy
dm writecache: use list_move instead of list_del/list_add in writecache_writeback()
dm writecache: commit just one block, not a full page
dm writecache: remove unused gfp_t argument from wc_add_block()
dm crypt: Fix zoned block device support
dm: introduce zone append emulation
dm: rearrange core declarations for extended use from dm-zone.c
block: introduce BIO_ZONE_WRITE_LOCKED bio flag
block: introduce bio zone helpers
block: improve handling of all zones reset operation
...


fb9b16e1 10-Jun-2021 Keith Busch <kbusch@kernel.org>

block: return errors from blk_execute_rq()

The synchronous blk_execute_rq() had not provided a way for its callers
to know if its request was successful or not. Return the blk_status_t
result of the request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-4-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c01b5a81 10-Jun-2021 Keith Busch <kbusch@kernel.org>

block: support polling through blk_execute_rq

Poll for completions if the request's hctx is a polling type.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-2-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da6269da 24-Jun-2021 Christoph Hellwig <hch@lst.de>

block: remove REQ_OP_SCSI_{IN,OUT}

With the legacy IDE driver gone drivers now use either REQ_OP_DRV_*
or REQ_OP_SCSI_*, so unify the two concepts of passthrough requests
into a single one.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5ec780a6 24-Jun-2021 Christoph Hellwig <hch@lst.de>

block: mark blk_mq_init_queue_data static

All driver uses are gone now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210624081012.256464-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

44046219 30-Jun-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
"Pretty calm round, mostly just NVMe and a bit of MD:

- NVMe updates (via Christoph)
- improve the APST configuration algorithm (Alexey Bogoslavsky)
- look for StorageD3Enable on companion ACPI device
(Mario Limonciello)
- allow selecting the network interface for TCP connections
(Martin Belanger)
- misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
Christoph)
- move the ACPI StorageD3 code to drivers/acpi/ and add quirks
for certain AMD CPUs (Mario Limonciello)
- zoned device support for nvmet (Chaitanya Kulkarni)
- fix the rules for changing the serial number in nvmet
(Noam Gottlieb)
- various small fixes and cleanups (Dan Carpenter, JK Kim,
Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
Uytterhoeven, Daniel Wagner)

- MD updates (Via Song)
- iostats rewrite (Guoqing Jiang)
- raid5 lock contention optimization (Gal Ofri)

- Fall through warning fix (Gustavo)

- Misc fixes (Gustavo, Jiapeng)"

* tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
nvmet: use NVMET_MAX_NAMESPACES to set nn value
loop: Fix missing discard support when using LOOP_CONFIGURE
nvme.h: add missing nvme_lba_range_type endianness annotations
nvme: remove zeroout memset call for struct
nvme-pci: remove zeroout memset call for struct
nvmet: remove zeroout memset call for struct
nvmet: add ZBD over ZNS backend support
nvmet: add Command Set Identifier support
nvmet: add nvmet_req_bio put helper for backends
nvmet: add req cns error complete helper
block: export blk_next_bio()
nvmet: remove local variable
nvmet: use nvme status value directly
nvmet: use u32 type for the local variable nsid
nvmet: use u32 for nvmet_subsys max_nsid
nvmet: use req->cmd directly in file-ns fast path
nvmet: use req->cmd directly in bdev-ns fast path
nvmet: make ver stable once connection established
nvmet: allow mn change if subsys not discovered
nvmet: make sn stable once connection was established
...


df668a5f 30-Jun-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

- disk events cleanup (Christoph)

- gendisk and request queue allocation simplifications (Christoph)

- bdev_disk_changed cleanups (Christoph)

- IO priority improvements (Bart)

- Chained bio completion trace fix (Edward)

- blk-wbt fixes (Jan)

- blk-wbt enable/disable fix (Zhang)

- Scheduler dispatch improvements (Jan, Ming)

- Shared tagset scheduler improvements (John)

- BFQ updates (Paolo, Luca, Pietro)

- BFQ lock inversion fix (Jan)

- Documentation improvements (Kir)

- CLONE_IO block cgroup fix (Tejun)

- Remove of ancient and deprecated block dump feature (zhangyi)

- Discard merge fix (Ming)

- Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
block: fix discard request merge
block/mq-deadline: Remove a WARN_ON_ONCE() call
blk-mq: update hctx->dispatch_busy in case of real scheduler
blk: Fix lock inversion between ioc lock and bfqd lock
bfq: Remove merged request already in bfq_requests_merged()
block: pass a gendisk to bdev_disk_changed
block: move bdev_disk_changed
block: add the events* attributes to disk_attrs
block: move the disk events code to a separate file
block: fix trace completion for chained bio
block/partitions/msdos: Fix typo inidicator -> indicator
block, bfq: reset waker pointer with shared queues
block, bfq: check waker only for queues with no in-flight I/O
block, bfq: avoid delayed merge of async queues
block, bfq: boost throughput by extending queue-merging times
block, bfq: consider also creation time in delayed stable merge
block, bfq: fix delayed stable merge check
block, bfq: let also stably merged queues enjoy weight raising
blk-wbt: make sure throttle is enabled properly
blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
...


2705dfb2 27-Jun-2021 Ming Lei <ming.lei@redhat.com>

block: fix discard request merge

ll_new_hw_segment() is reached only in case of single range discard
merge, and we don't have max discard segment size limit actually, so
it is wrong to run the following check:

if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))

it may be always false since req->nr_phys_segments is initialized as
one, and bio's segment count is still 1, blk_rq_get_max_segments(reg)
is 1 too.

Fix the issue by not doing the check and bypassing the calculation of
discard request's nr_phys_segments.

Based on analysis from Wang Shanker.

Cc: Christoph Hellwig <hch@lst.de>
Reported-by: Wang Shanker <shankerwangmiao@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210628023312.1903255-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c06bc5a3 27-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Remove a WARN_ON_ONCE() call

The purpose of the WARN_ON_ONCE() statement in dd_insert_request() is to
verify that dd_prepare_request() cleared rq->elv.priv[0]. Since
dd_prepare_request() is called during request initialization but not if a
request is requeued, a warning is triggered if a request is requeued. Fix
this by removing the WARN_ON_ONCE() statement. This patch suppresses the
following kernel warning:

WARNING: CPU: 28 PID: 432 at block/mq-deadline-main.c:740 dd_insert_request+0x4d4/0x5b0
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
dd_insert_requests+0xfa/0x130
blk_mq_sched_insert_request+0x22c/0x240
blk_mq_requeue_work+0x21c/0x2d0
process_one_work+0x4c2/0xa70
worker_thread+0x2e5/0x6d0
kthread+0x21c/0x250
ret_from_fork+0x1f/0x30

Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210627211112.12720-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cb9516be 24-Jun-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: update hctx->dispatch_busy in case of real scheduler

Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
starts to support io batching submission by using hctx->dispatch_busy.

However, blk_mq_update_dispatch_busy() isn't changed to update hctx->dispatch_busy
in that commit, so fix the issue by updating hctx->dispatch_busy in case
of real scheduler.

Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fd2ef39c 23-Jun-2021 Jan Kara <jack@suse.cz>

blk: Fix lock inversion between ioc lock and bfqd lock

Lockdep complains about lock inversion between ioc->lock and bfqd->lock:

bfqd -> ioc:
put_io_context+0x33/0x90 -> ioc->lock grabbed
blk_mq_free_request+0x51/0x140
blk_put_request+0xe/0x10
blk_attempt_req_merge+0x1d/0x30
elv_attempt_insert_merge+0x56/0xa0
blk_mq_sched_try_insert_merge+0x4b/0x60
bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
blk_mq_sched_insert_requests+0xd6/0x2b0
blk_mq_flush_plug_list+0x154/0x280
blk_finish_plug+0x40/0x60
ext4_writepages+0x696/0x1320
do_writepages+0x1c/0x80
__filemap_fdatawrite_range+0xd7/0x120
sync_file_range+0xac/0xf0

ioc->bfqd:
bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
exit_io_context+0x48/0x50
do_exit+0x7e9/0xdd0
do_group_exit+0x54/0xc0

To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
free the merged request but rather leave that upto the caller similarly
to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
to free all the merged requests after dropping bfqd->lock.

Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a921c655 23-Jun-2021 Jan Kara <jack@suse.cz>

bfq: Remove merged request already in bfq_requests_merged()

Currently, bfq does very little in bfq_requests_merged() and handles all
the request cleanup in bfq_finish_requeue_request() called from
blk_mq_free_request(). That is currently safe only because
blk_mq_free_request() is called shortly after bfq_requests_merged()
while bfqd->lock is still held. However to fix a lock inversion between
bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after
dropping bfqd->lock. That would mean that already merged request could
be seen by other processes inside bfq queues and possibly dispatched to
the device which is wrong. So move cleanup of the request from
bfq_finish_requeue_request() to bfq_requests_merged().

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0384264e 24-Jun-2021 Christoph Hellwig <hch@lst.de>

block: pass a gendisk to bdev_disk_changed

bdev_disk_changed can only operate on whole devices. Make that clear
by passing a gendisk instead of the struct block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210624123240.441814-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

630161cf 24-Jun-2021 Christoph Hellwig <hch@lst.de>

block: move bdev_disk_changed

Move bdev_disk_changed to block/partitions/core.c, together with the
rest of the partition scanning code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210624123240.441814-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2bc8cda5 24-Jun-2021 Christoph Hellwig <hch@lst.de>

block: add the events* attributes to disk_attrs

Add the events attributes to the disk_attrs array, which ensures they are
added by the driver core when the device is created rather than adding
them after the device has been added, which is racy versus uevents and
requires more boilerplate code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210624073843.251178-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d5870edf 24-Jun-2021 Christoph Hellwig <hch@lst.de>

block: move the disk events code to a separate file

Move the code for handling disk events from genhd.c into a new file
as it isn't very related to the rest of the file while at the same
time requiring lots of forward declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210624073843.251178-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

60b6a7e6 24-Jun-2021 Edward Hsieh <edwardh@synology.com>

block: fix trace completion for chained bio

For chained bio, trace_block_bio_complete in bio_endio is currently called
only by the parent bio once upon all chained bio completed.
However, the sector and size for the parent bio are modified in bio_split.
Therefore, the size and sector of the complete events might not match the
queue events in blktrace.

The original fix of bio completion trace <fbbaf700e7b1> ("block: trace
completion of all bios.") wants multiple complete events to correspond
to one queue event but missed this.

The issue can be reproduced by md/raid5 read with bio cross chunks.

To fix, move trace completion into the loop for every chained bio to call.

Fixes: fbbaf700e7b1 ("block: trace completion of all bios.")
Reviewed-by: Wade Liang <wadel@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Edward Hsieh <edwardh@synology.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210624123030.27014-1-edwardh@synology.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ddcc5c54 19-Jun-2021 Thomas Bracht Laumann Jespersen <t@laumann.xyz>

block/partitions/msdos: Fix typo inidicator -> indicator

Just a fix for a small typo in msdos_partition().

Signed-off-by: Thomas Bracht Laumann Jespersen <t@laumann.xyz>
Link: https://lore.kernel.org/r/20210619195130.19348-1-t@laumann.xyz
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9a2ac41b 19-Jun-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: reset waker pointer with shared queues

Commit 85686d0dc194 ("block, bfq: keep shared queues out of the waker
mechanism") leaves shared bfq_queues out of the waker-detection
mechanism. It attains this goal by not updating the pointer
last_completed_rq_bfqq, if the last request completed belongs to a
shared bfq_queue (so that the pointer will not point to the shared
bfq_queue).

Yet this has a side effect: the pointer last_completed_rq_bfqq keeps
pointing, deceptively, to a bfq_queue that actually is not the last
one to have had a request completed. As a consequence, such a
bfq_queue may deceptively be considered as a waker of some bfq_queue,
even of some shared bfq_queue.

To address this issue, reset last_completed_rq_bfqq if the last
request completed belongs to a shared queue.

Fixes: 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-8-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

efc72524 19-Jun-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: check waker only for queues with no in-flight I/O

Consider two bfq_queues, say Q1 and Q2, with Q2 empty. If a request of
Q1 gets completed shortly before a new request arrives for Q2, then
BFQ flags Q1 as a candidate waker for Q2. Yet, the arrival of this new
request may have a different cause, in the following case. If also Q2
has requests in flight while waiting for the arrival of a new request,
then the completion of its own requests may be the actual cause of the
awakening of the process that sends I/O to Q2. So Q1 may be flagged
wrongly as a candidate waker.

This commit avoids this deceptive flagging, by disabling
candidate-waker flagging for Q2, if Q2 has in-flight I/O.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bd3664b3 19-Jun-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: avoid delayed merge of async queues

Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
queues"), BFQ may schedule a merge between a newly created sync
bfq_queue, say Q2, and the last sync bfq_queue created, say Q1. To this
goal, BFQ stores the address of Q1 in the field bic->stable_merge_bfqq
of the bic associated with Q2. So, when the time for the possible merge
arrives, BFQ knows which bfq_queue to merge Q2 with. In particular,
BFQ checks for possible merges on request arrivals.

Yet the same bic may also be associated with an async bfq_queue, say
Q3. So, if a request for Q3 arrives, then the above check may happen
to be executed while the bfq_queue at hand is Q3, instead of Q2. In
this case, Q1 happens to be merged with an async bfq_queue. This is
not only a conceptual mistake, because async queues are to be kept out
of queue merging, but also a bug that leads to inconsistent states.

This commits simply filters async queues out of delayed merges.

Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-6-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7812472f 19-Jun-2021 Pietro Pedroni <pedroni.pietro.96@gmail.com>

block, bfq: boost throughput by extending queue-merging times

One of the methods with which bfq boosts throughput is by merging queues.
One of the merging variants in bfq is the stable merge.
This mechanism is activated between two queues only if they are created
within a certain maximum time T1 from each other.
Merging can happen soon or be delayed. In the second case, before
merging, bfq needs to evaluate a throughput-boost parameter that
indicates whether the queue generates a high throughput is served alone.
Merging occurs when this throughput-boost is not high enough.
In particular, this parameter is evaluated and late merging may occur
only after at least a time T2 from the creation of the queue.

Currently T1 and T2 are set to 180ms and 200ms, respectively.
In this way the merging mechanism rarely occurs because time is not
enough. This results in a noticeable lowering of the overall throughput
with some workloads (see the example below).

This commit introduces two constants bfq_activation_stable_merging and
bfq_late_stable_merging in order to increase the duration of T1 and T2.
Both the stable merging activation time and the late merging
time are set to 600ms. This value has been experimentally evaluated
using sqlite benchmark in the Phoronix Test Suite on a HDD.
The duration of the benchmark before this fix was 111.02s, while now
it has reached 97.02s, a better result than that of all the other
schedulers.

Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-5-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d4f49983 19-Jun-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: consider also creation time in delayed stable merge

Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
queues"), BFQ may schedule a merge between a newly created sync
bfq_queue and the last sync bfq_queue created. Such a merging is not
performed immediately, because BFQ needs first to find out whether the
newly created queue actually reaches a higher throughput if not merged
at all (and in that case BFQ will not perform any stable merging). To
check that, a little time must be waited after the creation of the new
queue, so that some I/O can flow in the queue, and statistics on such
I/O can be computed.

Yet, to evaluate the above waiting time, the last split time is
considered as start time, instead of the creation time of the
queue. This is a mistake, because considering the split time is
correct only in the following scenario.

The queue undergoes a non-stable merges on the arrival of its very
first I/O request, due to close I/O with some other queue. While the
queue is merged for close I/O, stable merging is not considered. Yet
the queue may then happen to be split, if the close I/O finishes (or
happens to be a false positive). From this time on, the queue can
again be considered for stable merging. But, again, a little time must
elapse, to let some new I/O flow in the queue and to get updated
statistics. To wait for this time, the split time is to be taken into
account.

Yet, if the queue does not undergo a non-stable merge on the arrival
of its very first request, then BFQ immediately checks whether the
stable merge is to be performed. It happens because the split time for
a queue is initialized to minus infinity when the queue is created.

This commit fixes this mistake by adding the missing condition. Now
the check for delayed stable-merge is performed after a little time is
elapsed not only from the last queue split time, but also from the
creation time of the queue.

Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-4-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e03f2ab7 19-Jun-2021 Luca Mariotti <mariottiluca1@hotmail.it>

block, bfq: fix delayed stable merge check

When attempting to schedule a merge of a given bfq_queue with the currently
in-service bfq_queue or with a cooperating bfq_queue among the scheduled
bfq_queues, delayed stable merge is checked for rotational or non-queueing
devs. For this stable merge to be performed, some conditions must be met.
If the current bfq_queue underwent some split from some merged bfq_queue,
one of these conditions is that two hundred milliseconds must elapse from
split, otherwise this condition is always met.

Unfortunately, by mistake, time_is_after_jiffies() was written instead of
time_is_before_jiffies() for this check, verifying that less than two
hundred milliseconds have elapsed instead of verifying that at least two
hundred milliseconds have elapsed.

Fix this issue by replacing time_is_after_jiffies() with
time_is_before_jiffies().

Signed-off-by: Luca Mariotti <mariottiluca1@hotmail.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com>
Link: https://lore.kernel.org/r/20210619140948.98712-3-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

511a26992 19-Jun-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: let also stably merged queues enjoy weight raising

Merged bfq_queues are kept out of weight-raising (low-latency)
mechanisms. The reason is that these queues are usually created for
non-interactive and non-soft-real-time tasks. Yet this is not the case
for stably-merged queues. These queues are merged just because they
are created shortly after each other. So they may easily serve the I/O
of an interactive or soft-real time application, if the application
happens to spawn multiple processes.

To address this issue, this commits lets also stably-merged queued
enjoy weight raising.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

76a80408 19-Jun-2021 Zhang Yi <yi.zhang@huawei.com>

blk-wbt: make sure throttle is enabled properly

After commit a79050434b45 ("blk-rq-qos: refactor out common elements of
blk-wbt"), if throttle was disabled by wbt_disable_default(), we could
not enable again, fix this by set enable_state back to
WBT_STATE_ON_DEFAULT.

Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210619093700.920393-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1d0903d6 19-Jun-2021 Zhang Yi <yi.zhang@huawei.com>

blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()

Now that we disable wbt by simply zero out rwb->wb_normal in
wbt_disable_default() when switch elevator to bfq, but it's not safe
because it will become false positive if we change queue depth. If it
become false positive between wbt_wait() and wbt_track() when submit
write request, it will lead to drop rqw->inflight to -1 in wbt_done(),
which will end up trigger IO hung. Fix this issue by introduce a new
state which mean the wbt was disabled.

Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210619093700.920393-2-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fb926032 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Prioritize high-priority requests

While one or more requests with a certain I/O priority are pending, do not
dispatch lower priority requests. Dispatch lower priority requests anyway
after the "aging" time has expired.

This patch has been tested as follows:

modprobe scsi_debug ndelay=1000000 max_queue=16 &&
sd='' &&
while [ -z "$sd" ]; do
sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done &&
echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
cd /sys/fs/cgroup/blkio/ &&
echo $$ >cgroup.procs &&
echo restrict-to-be >blkio.prio.class &&
mkdir -p hipri &&
cd hipri &&
echo none-to-rt >blkio.prio.class &&
{ max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
echo $$ >cgroup.procs &&
max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt

Result:
* 11000 IOPS for the high-priority job
* 40 IOPS for the low-priority job

If the aging expiry time is changed from 100s into 0, the IOPS results change
into 6712 and 6796 IOPS.

The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-17-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

08a9ad8b 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Add cgroup support

Maintain statistics per cgroup and export these to user space. These
statistics are essential for verifying whether the proper I/O priorities
have been assigned to requests. An example of the statistics data with
this patch applied:

$ cat /sys/fs/cgroup/io.stat
11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

38ba64d1 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Track I/O statistics

Track I/O statistics per I/O priority and export these statistics to
debugfs. These statistics help developers of the deadline scheduler.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-15-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c807ab52 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Add I/O priority support

Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE
and IDLE. Maintain statistics for each priority level. Split the debugfs
attributes per priority level as follows:

$ ls /sys/kernel/debug/block/.../sched/
async_depth dispatch2 read_next_rq write2_fifo_list
batching read0_fifo_list starved write_next_rq
dispatch0 read1_fifo_list write0_fifo_list
dispatch1 read2_fifo_list write1_fifo_list

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-14-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d672d325 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Micro-optimize the batching algorithm

When dispatching the first request of a batch, the deadline_move_request()
call clears .next_rq[] for the opposite data direction. .next_rq[] is not
restored when changing data direction. Fix this by not clearing .next_rq[]
and by keeping track of the data direction of a batch in a variable instead.

This patch is a micro-optimization because:
- The number of deadline_next_request() calls for the read direction is
halved.
- The number of times that deadline_next_request() returns NULL is reduced.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-13-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

07757588 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests

For interactive workloads it is important that synchronous requests are
not delayed. Hence reserve 25% of scheduler tags for synchronous requests.
This patch still allows asynchronous requests to fill the hardware queues
since blk_mq_init_sched() makes sure that the number of scheduler requests
is the double of the hardware queue depth. From blk_mq_init_sched():

q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
BLKDEV_MAX_RQ);

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-12-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d6d7f013 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Improve the sysfs show and store macros

Define separate macros for integers and jiffies to improve readability.
Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol().
The former functions are the recommended functions.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-11-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

004a26b3 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Improve compile-time argument checking

Modern compilers complain if an out-of-range value is passed to a function
argument that has an enumeration type. Let the compiler detect out-of-range
data direction arguments instead of verifying the data_dir argument at
runtime.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-10-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3e9a99eb 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()

Change "queue" into "sched" to make the function names reflect better the
purpose of these functions.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-9-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2f295bea 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Remove two local variables

Make __dd_dispatch_request() easier to read by removing two local
variables.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-8-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3bd473f4 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Add two lockdep_assert_held() statements

Document the locking strategy by adding two lockdep_assert_held()
statements.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-7-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

46eae2e3 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/mq-deadline: Add several comments

Make the code easier to read by adding more comments.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-6-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

556910e3 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block: Introduce the ioprio rq-qos policy

Introduce an rq-qos policy that assigns an I/O priority to requests based
on blk-cgroup configuration settings. This policy has the following
advantages over the ioprio_set() system call:
- This policy is cgroup based so it has all the advantages of cgroups.
- While ioprio_set() does not affect page cache writeback I/O, this rq-qos
controller affects page cache writeback I/O for filesystems that support
assiociating a cgroup with writeback I/O. See also
Documentation/admin-guide/cgroup-v2.rst.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fb44023e 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/blk-rq-qos: Move a function from a header file into a C file

rq_qos_id_to_name() is only used in blk-mq-debugfs.c so move that function
into in blk-mq-debugfs.c.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20210618004456.7280-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

19688d7f 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls

Before adding more calls in this function, simplify the error path.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210618004456.7280-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5f6776ba 17-Jun-2021 Bart Van Assche <bvanassche@acm.org>

block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive

These entries were consecutive at the time of their introduction but are no
longer consecutive. Make these again consecutive. Additionally, modify the
help text since it refers to blk-mq and since the legacy block layer has
been removed.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20210618004456.7280-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a79da21b 17-Jun-2021 lijiazi <jqqlijiazi@gmail.com>

blk-wbt: remove outdated comment

Now wbt_wait() returns void, so remove now outdated comment.

Signed-off-by: lijiazi <lijiazi@xiaomi.com>
Link: https://lore.kernel.org/r/1623986240-13878-1-git-send-email-lijiazi@xiaomi.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e42cfb1d 17-Jun-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: Remove unnecessary elevator operation checks

The insert_requests and dispatch_request elevator operations are
mandatory for the correct execution of an elevator, and all implemented
elevators (bfq, kyber and mq-deadline) implement them. As a result,
there is no need to check for these operations before calling them when
a queue has an elevator set. This simplifies the code in
__blk_mq_sched_dispatch_requests() and blk_mq_sched_insert_request().

To avoid out-of-tree elevators to crash the kernel in case of bad
implementation, add a check in elv_register() to verify that these
operations are implemented.

A small, probably not significant, IOPS improvement of 0.1% is observed
with this patch applied (4.117 MIOPS to 4.123 MIOPS, average of 20 fio
runs doing 4K random direct reads with psync and 32 jobs).

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210618015922.713999-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f0c1c4d2 09-Jun-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: fix use-after-free in blk_mq_exit_sched

tagset can't be used after blk_cleanup_queue() is returned because
freeing tagset usually follows blk_clenup_queue(). Commit d97e594c5166
("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") adds
check on q->tag_set->flags in blk_mq_exit_sched(), and causes
use-after-free.

Fixes it by using hctx->flags.

Reported-by: syzbot+77ba3d171a25c56756ea@syzkaller.appspotmail.com
Fixes: d97e594c5166 ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap")
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210609063046.122843-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2f064a59 11-Jun-2021 Peter Zijlstra <peterz@infradead.org>

sched: Change task_struct::state

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

d6c23bb3 11-Jun-2021 Peter Zijlstra <peterz@infradead.org>

sched: Add get_current_state()

Remove yet another few p->state accesses.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org

b03fbd4f 11-Jun-2021 Peter Zijlstra <peterz@infradead.org>

sched: Introduce task_is_running()

Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org

c28a6147 09-Jun-2021 Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

block: export blk_next_bio()

The block layer provides emulation of zone management operations
targeting all zones of a zoned block device only for the zone reset
operation (REQ_OP_ZONE_RESET). In order to correctly implement
exporting of zoned block devices with NVMeOF, emulating zone management
operations targeting all zones of a device is also necessary for the
open, close and finish zone operations (REQ_OP_ZONE_OPEN,
REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH).

Instead of duplicating the code, export the existing helper from block
layer so we can use a bio chaining pattern that is present in the block
layer for REQ_OP_ZONE RESET all emulation in the NVMeOF zoned block
device backend.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

a72c374f 08-Jun-2021 Ming Lei <ming.lei@redhat.com>

block: mark queue init done at the end of blk_register_queue

Mark queue init done when everything is done well in blk_register_queue(),
so that wbt_enable_default() can be run quickly without any RCU period
involved since adding rq qos requires to freeze queue.

Also no any side effect by delaying to mark queue init done.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/r/20210609015822.103433-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2cafe29a 08-Jun-2021 Ming Lei <ming.lei@redhat.com>

block: fix race between adding/removing rq qos and normal IO

Yi reported several kernel panics on:

[16687.001777] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
...
[16687.163549] pc : __rq_qos_track+0x38/0x60

or

[ 997.690455] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
...
[ 997.850347] pc : __rq_qos_done+0x2c/0x50

Turns out it is caused by race between adding rq qos(wbt) and normal IO
because rq_qos_add can be run when IO is being submitted, fix this issue
by freezing queue before adding/deleting rq qos to queue.

rq_qos_exit() needn't to freeze queue because it is called after queue
has been frozen.

iolatency calls rq_qos_add() during allocating queue, so freezing won't
add delay because queue usage refcount works at atomic mode at that
time.

iocost calls rq_qos_add() when writing cgroup attribute file, that is
fine to freeze queue at that time since we usually freeze queue when
storing to queue sysfs attribute, meantime iocost only exists on the
root cgroup.

wbt_init calls it in blk_register_queue() and queue sysfs attribute
store(queue_wb_lat_store() when write it 1st time in case of !BLK_WBT_MQ),
the following patch will speedup the queue freezing in wbt_init.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/r/20210609015822.103433-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

08c1d480 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: remove blk_mq_init_sq_queue

All users are gone now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b461dfc4 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: add the blk_mq_alloc_disk APIs

Add a new API to allocate a gendisk including the request_queue for use
with blk-mq based drivers. This is to avoid boilerplate code in drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

26a9750a 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: improve the blk_mq_init_allocated_queue interface

Don't return the passed in request_queue but a normal error code, and
drop the elevator_init argument in favor of just calling elevator_init_mq
directly from dm-rq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cdb14e0f 02-Jun-2021 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a blk_mq_alloc_sq_tag_set helper

Factour out a helper to initialize a simple single hw queue tag_set from
blk_mq_init_sq_queue. This will allow to phase out blk_mq_init_sq_queue
in favor of a more symmetric and general API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d2bcbeab 07-Jun-2021 Muneendra Kumar <muneendra.kumar@broadcom.com>

scsi: blkcg: Add app identifier support for blkcg

Add a unique application identifier (i.e fc_app_id member) in blkcg. This
allows identification of traffic belonging to an specific both on the host
and in the fabric infrastructure. As an example, this allows the storage
stack to uniquely identify traffic belong to particular virtual machine.

Link: https://lore.kernel.org/r/20210608043556.274139-3-muneendra.kumar@broadcom.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

11c7aa0d 07-Jun-2021 Jan Kara <jack@suse.cz>

rq-qos: fix missed wake-ups in rq_qos_throttle try two

Commit 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:

CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker)
rq_qos_wait() rq_qos_wait()
acquire_inflight_cb() -> fails
acquire_inflight_cb() -> fails

completes IOs, inflight
decreased
prepare_to_wait_exclusive()
prepare_to_wait_exclusive()
has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
has_sleeper = !wq_has_single_sleeper() -> true
io_schedule() io_schedule()

Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.

Fixes: 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1ee533ec 25-May-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: improve handling of all zones reset operation

SCSI, ZNS and null_blk zoned devices support resetting all zones using
a single command (REQ_OP_ZONE_RESET_ALL), as indicated using the device
request queue flag QUEUE_FLAG_ZONE_RESETALL. This flag is not set for
device mapper targets creating zoned devices. In this case, a user
request for resetting all zones of a device is processed in
blkdev_zone_mgmt() by issuing a REQ_OP_ZONE_RESET operation for each
zone of the device. This leads to different behaviors of the
BLKRESETZONE ioctl() depending on the target device support for the
reset all operation. E.g.

blkzone reset /dev/sdX

will reset all zones of a SCSI device using a single command that will
ignore conventional, read-only or offline zones.

But a dm-linear device including conventional, read-only or offline
zones cannot be reset in the same manner as some of the single zone
reset operations issued by blkdev_zone_mgmt() will fail. E.g.:

blkzone reset /dev/dm-Y
blkzone: /dev/dm-0: BLKRESETZONE ioctl failed: Remote I/O error

To simplify applications and tools development, unify the behavior of
the all-zone reset operation by modifying blkdev_zone_mgmt() to not
issue a zone reset operation for conventional, read-only and offline
zones, thus mimicking what an actual reset-all device command does on a
device supporting REQ_OP_ZONE_RESET_ALL. This emulation is done using
the new function blkdev_zone_reset_all_emulated(). The zones needing a
reset are identified using a bitmap that is initialized using a zone
report. Since empty zones do not need a reset, also ignore these zones.
The function blkdev_zone_reset_all() is introduced for block devices
natively supporting reset all operations. blkdev_zone_mgmt() is modified
to call either function to execute an all zone reset request.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
[hch: split into multiple functions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7cc2623d 19-May-2021 Bart Van Assche <bvanassche@acm.org>

block: Update blk_update_request() documentation

Although the original intent was to use blk_update_request() in stacking
block drivers only, it is used much more widely today. Reflect this in the
documentation block above this function. See also:
* commit 32fab448e5e8 ("block: add request update interface").
* commit 2e60e02297cf ("block: clean up request completion API").
* commit ed6565e73424 ("block: handle partial completions for special
payload requests").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210519175226.8853-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

61347154 02-Jun-2021 Jan Kara <jack@suse.cz>

block: Do not pull requests from the scheduler when we cannot dispatch them

Provided the device driver does not implement dispatch budget accounting
(which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
requests from the IO scheduler as long as it is willing to give out any.
That defeats scheduling heuristics inside the scheduler by creating
false impression that the device can take more IO when it in fact
cannot.

For example with BFQ IO scheduler on top of virtio-blk device setting
blkio cgroup weight has barely any impact on observed throughput of
async IO because __blk_mq_do_dispatch_sched() always sucks out all the
IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
when that is all dispatched, it will give out IO of lower weight cgroups
as well. And then we have to wait for all this IO to be dispatched to
the disk (which means lot of it actually has to complete) before the
IO scheduler is queried again for dispatching more requests. This
completely destroys any service differentiation.

So grab request tag for a request pulled out of the IO scheduler already
in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
cannot get it because we are unlikely to be able to dispatch it. That
way only single request is going to wait in the dispatch list for some
tag to free.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0e0ccdec 25-May-2021 Christoph Hellwig <hch@lst.de>

block: remove bdget_disk

Just opencode the xa_load in the callers, as none of them actually
needs a reference to the bdev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c97d93c3 25-May-2021 Christoph Hellwig <hch@lst.de>

block: factor out a part_devt helper

Add a helper to find the dev_t for a disk + partno tuple.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ab4b5705 25-May-2021 Christoph Hellwig <hch@lst.de>

block: move bd_part_count to struct gendisk

The bd_part_count value only makes sense for whole devices, so move it
to struct gendisk and give it a more descriptive name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8698707 25-May-2021 Christoph Hellwig <hch@lst.de>

block: move bd_mutex to struct gendisk

Replace the per-block device bd_mutex with a per-gendisk open_mutex,
thus simplifying locking wherever we deal with partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da7ba729 20-May-2021 Christoph Hellwig <hch@lst.de>

block: unexport blk_alloc_queue

blk_alloc_queue is just an internal helper now, unexport it and remove
it from the public header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f525464a 20-May-2021 Christoph Hellwig <hch@lst.de>

block: add blk_alloc_disk and blk_cleanup_disk APIs

Add two new APIs to allocate and free a gendisk including the
request_queue for use with BIO based drivers. This is to avoid
boilerplate code in drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

958229a7 20-May-2021 Christoph Hellwig <hch@lst.de>

block: add a flag to make put_disk on partially initalized disks safer

Add a flag to indicate that __device_add_disk did grab a queue reference
so that disk_release only drops it if we actually had it. This sort
out one of the major pitfals with partially initialized gendisk that
a lot of drivers did get wrong or still do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0d1feb72 20-May-2021 Christoph Hellwig <hch@lst.de>

block: automatically enable GENHD_FL_EXT_DEVT

Automatically set the GENHD_FL_EXT_DEVT flag for all disks allocated
without an explicit number of minors. This is what all new block
drivers should do, so make sure it is the default without boilerplate
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e3c73fa 20-May-2021 Christoph Hellwig <hch@lst.de>

block: move the DISK_MAX_PARTS sanity check into __device_add_disk

Keep this together with the first place that actually looks at
->minors and prepare for not passing a minors argument to
alloc_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7c3f828b 20-May-2021 Christoph Hellwig <hch@lst.de>

block: refactor device number setup in __device_add_disk

Untangle the mess around blk_alloc_devt by moving the check for
the used allocation scheme into the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

54cf31d0 27-Apr-2021 Hannes Reinecke <hare@suse.de>

scsi: core: Drop message byte helper

The message byte is now unused, so we can drop the helper to set the
message byte and the check for message bytes during error recovery.

Link: https://lore.kernel.org/r/20210427083046.31620-38-hare@suse.de
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

54c29086 27-Apr-2021 Hannes Reinecke <hare@suse.com>

scsi: core: Drop the now obsolete driver_byte definitions

The driver_byte field in the result is now unused, so we can drop the
definitions.

Link: https://lore.kernel.org/r/20210427083046.31620-15-hare@suse.de
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

464a00c9 27-Apr-2021 Hannes Reinecke <hare@suse.de>

scsi: core: Kill DRIVER_SENSE

Replace the check for DRIVER_SENSE with a check for
scsi_status_is_check_condition().

Audit all callsites to ensure the SAM status is set correctly. For
backwards compability move the DRIVER_SENSE definition to sg.h, and update
sg, bsg, and scsi_ioctl to set the DRIVER_SENSE driver_status whenever
SAM_STAT_CHECK_CONDITION is present.

[mkp: fix zeroday srp warning]

Link: https://lore.kernel.org/r/20210427083046.31620-10-hare@suse.de
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

fix

21eccf30 27-Apr-2021 Hannes Reinecke <hare@suse.de>

scsi: scsi_ioctl: Return error code when blk_rq_map_kern() fails

The callers of sg_scsi_ioctl() already check for negative return values, so
we can drop the usage of DRIVER_ERROR and return the error from
blk_rq_map_kern() instead.

Link: https://lore.kernel.org/r/20210427083046.31620-3-hare@suse.de
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

d97e594c 13-May-2021 John Garry <john.garry@huawei.com>

blk-mq: Use request queue-wide tags for tagset-wide sbitmap

The tags used for an IO scheduler are currently per hctx.

As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.

This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.

Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.

Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).

For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

56b68085 13-May-2021 John Garry <john.garry@huawei.com>

blk-mq: Some tag allocation code refactoring

The tag allocation code to alloc the sbitmap pairs is common for regular
bitmaps tags and shared sbitmap, so refactor into a common function.

Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

364b6181 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: clearing flush request reference in tags->rqs[]

Before we free request queue, clearing flush request reference in
tags->rqs[], so that potential UAF can be avoided.

Based on one patch written by David Jeffery.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bd63141d 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: clear stale request in tags->rq[] before freeing one request pool

refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.

Fix the issue by the following approach:

1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()

2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]

So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.

The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e315dc0 11-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter

Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.

Fix request use-after-free(UAF) related with completion race or queue
releasing:

- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.

- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.

However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

84da7acc 11-May-2021 Ming Lei <ming.lei@redhat.com>

block: avoid double io accounting for flush request

For flush request, rq->end_io() may be called two times, one is from
timeout handling(blk_mq_check_expired()), another is from normal
completion(__blk_mq_end_request()).

Move blk_account_io_flush() after flush_rq->ref drops to zero, so
io accounting can be done just once for flush request.

Fixes: b68663186577 ("block: add iostat counters for flush requests")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8c390ff9 11-May-2021 Max Gurtovoy <mgurtovoy@nvidia.com>

block: remove unneeded parenthesis from blk-sysfs

Align to common code conventions.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20210511155319.1885277-1-mgurtovoy@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5f3352e 11-May-2021 Tejun Heo <tj@kernel.org>

blkcg: drop CLONE_IO check in blkcg_can_attach()

blkcg has always rejected to attach if any of the member tasks has shared
io_context. The rationale was that io_contexts can be shared across
different cgroups making it impossible to define what the appropriate
control behavior should be. However, this check causes more problems than it
solves:

* The check prevents controller enable and migrations but not CLONE_IO
itself, which can lead to surprises as the outcome changes depending on
the order of operations.

* Sharing within a cgroup is fine but the check can't distinguish that. This
leads to unnecessary conflicts with the recent CLONE_IO usage in io_uring.

io_context sharing doesn't make any difference for rq_qos based controllers
and the way it's used is safe as long as tasks aren't migrated dynamically
which is the vast majority of use cases. While we can try to make the check
more precise to avoid false positives, the added complexity doesn't seem
worthwhile. Let's just drop blkcg_can_attach().

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YJrTvHbrRDbJjw+S@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3af3d772 12-Mar-2021 zhangyi (F) <yi.zhang@huawei.com>

block_dump: remove block_dump feature

We have already delete block_dump feature in mark_inode_dirty() because
it can be replaced by tracepoints, now we also remove the part in
submit_bio() for the same reason. The part of block dump feature in
submit_bio() dump the write process, write region and sectors on the
target disk into kernel message. it can be replaced by
block_bio_queue tracepoint in submit_bio_checks(), so we do not need
block_dump anymore, remove the whole block_dump feature.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4ff2473b 22-May-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Fix BLKRRPART and deletion race (Gulam, Christoph)

- NVMe pull request (Christoph):
- nvme-tcp corruption and timeout fixes (Sagi Grimberg, Keith
Busch)
- nvme-fc teardown fix (James Smart)
- nvmet/nvme-loop memory leak fixes (Wu Bo)"

* tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
block: fix a race between del_gendisk and BLKRRPART
block: prevent block device lookups at the beginning of del_gendisk
nvme-fc: clear q_live at beginning of association teardown
nvme-tcp: rerun io_work if req_list is not empty
nvme-tcp: fix possible use-after-completion
nvme-loop: fix memory leak in nvme_loop_create_ctrl()
nvmet: fix memory leak in nvmet_alloc_ctrl()


6c60ff04 14-May-2021 Christoph Hellwig <hch@lst.de>

block: prevent block device lookups at the beginning of del_gendisk

As an artifact of how gendisk lookup used to work in earlier kernels,
GENHD_FL_UP is only cleared very late in del_gendisk, and a global lock
is used to prevent opens from succeeding while del_gendisk is tearing
down the gendisk. Switch to clearing the flag early and under bd_mutex
so that callers can use bd_mutex to stabilize the flag, which removes
the need for the global mutex.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1b177499 07-May-2021 Arnd Bergmann <arnd@arndb.de>

partitions: msdos: fix one-byte get_unaligned()

A simplification of get_unaligned() clashes with callers that pass
in a character pointer, causing a harmless warning like:

block/partitions/msdos.c: In function 'msdos_partition':
include/asm-generic/unaligned.h:13:22: warning: 'packed' attribute ignored for field of type 'u8' {aka 'unsigned char'} [-Wattributes]

Remove the SYS_IND() macro with the get_unaligned() call
and just use the ->ind field directly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

8f4ae0f6 15-May-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Fix for shared tag set exit (Bart)

- Correct ioctl range for zoned ioctls (Damien)

- Removed dead/unused function (Lin)

- Fix perf regression for shared tags (Ming)

- Fix out-of-bounds issue with kyber and preemption (Omar)

- BFQ merge fix (Paolo)

- Two error handling fixes for nbd (Sun)

- Fix weight update in blk-iocost (Tejun)

- NVMe pull request (Christoph):
- correct the check for using the inline bio in nvmet (Chaitanya
Kulkarni)
- demote unsupported command warnings (Chaitanya Kulkarni)
- fix corruption due to double initializing ANA state (me, Hou Pu)
- reset ns->file when open fails (Daniel Wagner)
- fix a NULL deref when SEND is completed with error in nvmet-rdma
(Michal Kalderon)

- Fix kernel-doc warning (Bart)

* tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-block:
block/partitions/efi.c: Fix the efi_partition() kernel-doc header
blk-mq: Swap two calls in blk_mq_exit_queue()
blk-mq: plug request for shared sbitmap
nvmet: use new ana_log_size instead the old one
nvmet: seset ns->file when open fails
nbd: share nbd_put and return by goto put_nbd
nbd: Fix NULL pointer in flush_workqueue
blkdev.h: remove unused codes blk_account_rq
block, bfq: avoid circular stable merges
blk-iocost: fix weight updates of inner active iocgs
nvmet: demote fabrics cmd parse err msg to debug
nvmet: use helper to remove the duplicate code
nvmet: demote discovery cmd parse err msg to debug
nvmet-rdma: Fix NULL deref when SEND is completed with error
nvmet: fix inline bio check for passthru
nvmet: fix inline bio check for bdev-ns
nvme-multipath: fix double initialization of ANA state
kyber: fix out of bounds access when preempted
block: uapi: fix comment about block device ioctl


4bc20823 13-May-2021 Bart Van Assche <bvanassche@acm.org>

block/partitions/efi.c: Fix the efi_partition() kernel-doc header

Fix the following kernel-doc warning:

block/partitions/efi.c:685: warning: wrong kernel-doc identifier on line:
* efi_partition(struct parsed_partitions *state)

Cc: Alexander Viro <viro@math.psu.edu>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210513171708.8391-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

630ef623 13-May-2021 Bart Van Assche <bvanassche@acm.org>

blk-mq: Swap two calls in blk_mq_exit_queue()

If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30e4 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

03f26d8f 13-May-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: plug request for shared sbitmap

In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.

Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.

Fix the issue by recovering original behavior to allow to hold request
in plug list.

Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7ea96eef 12-May-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: avoid circular stable merges

BFQ may merge a new bfq_queue, stably, with the last bfq_queue
created. In particular, BFQ first waits a little bit for some I/O to
flow inside the new queue, say Q2, if this is needed to understand
whether it is better or worse to merge Q2 with the last queue created,
say Q1. This delayed stable merge is performed by assigning
bic->stable_merge_bfqq = Q1, for the bic associated with Q1.

Yet, while waiting for some I/O to flow in Q2, a non-stable queue
merge of Q2 with Q1 may happen, causing the bic previously associated
with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
be recycled as a non-shared bfq_queue. In that case, Q1 may then
happen to undergo a stable merge with the bfq_queue pointed by
bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
Q1. So Q1 would be merged with itself.

This commit fixes this error by intercepting this situation, and
canceling the schedule of the stable merge.

Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e9f4eee9 11-May-2021 Tejun Heo <tj@kernel.org>

blk-iocost: fix weight updates of inner active iocgs

When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.

The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.

However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.

R
/ \
A B
| |
AA BB

1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.

2. echo 200 > A/io.weight

3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
it's already between 1 and the new active, making A:active=200,
A:inuse=100. As R's active_sum is updated along with A's active,
A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
hwi's remain unchanged at 0.5.

4. The weight of A is now twice that of B but AA and BB still have the same
hwi of 0.5 and thus are doing the same amount of IOs.

Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

efed9a33 10-May-2021 Omar Sandoval <osandov@fb.com>

kyber: fix out of bounds access when preempted

__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
for the current CPU again and uses that to get the corresponding Kyber
context in the passed hctx. However, the thread may be preempted between
the two calls to blk_mq_get_ctx(), and the ctx returned the second time
may no longer correspond to the passed hctx. This "works" accidentally
most of the time, but it can cause us to read garbage if the second ctx
came from an hctx with more ctx's than the first one (i.e., if
ctx->index_hw[hctx->type] > hctx->nr_ctx).

This manifested as this UBSAN array index out of bounds error reported
by Jakub:

UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
index 13106 is out of range for type 'long unsigned int [128]'
Call Trace:
dump_stack+0xa4/0xe5
ubsan_epilogue+0x5/0x40
__ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
queued_spin_lock_slowpath+0x476/0x480
do_raw_spin_lock+0x1c2/0x1d0
kyber_bio_merge+0x112/0x180
blk_mq_submit_bio+0x1f5/0x1100
submit_bio_noacct+0x7b0/0x870
submit_bio+0xc2/0x3a0
btrfs_map_bio+0x4f0/0x9d0
btrfs_submit_data_bio+0x24e/0x310
submit_one_bio+0x7f/0xb0
submit_extent_page+0xc4/0x440
__extent_writepage_io+0x2b8/0x5e0
__extent_writepage+0x28d/0x6e0
extent_write_cache_pages+0x4d7/0x7a0
extent_writepages+0xa2/0x110
do_writepages+0x8f/0x180
__writeback_single_inode+0x99/0x7f0
writeback_sb_inodes+0x34e/0x790
__writeback_inodes_wb+0x9e/0x120
wb_writeback+0x4d2/0x660
wb_workfn+0x64d/0xa10
process_one_work+0x53a/0xa80
worker_thread+0x69/0x5b0
kthread+0x20b/0x240
ret_from_fork+0x1f/0x30

Only Kyber uses the hctx, so fix it by passing the request_queue to
->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
map the queues itself to avoid the mismatch.

Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

506c3079 09-May-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
"Turns out the bio max size change still has issues, so let's get it
reverted for 5.13-rc1. We'll shake out the issues there and defer it
to 5.14 instead"

* tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block:
Revert "bio: limit bio max size"


35c820e7 08-May-2021 Jens Axboe <axboe@kernel.dk>

Revert "bio: limit bio max size"

This reverts commit cd2c7545ae1beac3b6aae033c7f31193b3255946.

Alex reports that the commit causes corruption with LUKS on ext4. Revert
it for now so that this can be investigated properly.

Link: https://lore.kernel.org/linux-block/1620493841.bxdq8r5haw.none@localhost/
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bd313968 07-May-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- dasd spelling fixes (Bhaskar)

- Limit bio max size on multi-page bvecs to the hardware limit, to
avoid overly large bio's (and hence latencies). Originally queued for
the merge window, but needed a fix and was dropped from the initial
pull (Changheun)

- NVMe pull request (Christoph):
- reset the bdev to ns head when failover (Daniel Wagner)
- remove unsupported command noise (Keith Busch)
- misc passthrough improvements (Kanchan Joshi)
- fix controller ioctl through ns_head (Minwoo Im)
- fix controller timeouts during reset (Tao Chiu)

- rnbd fixes/cleanups (Gioh, Md, Dima)

- Fix iov_iter re-expansion (yangerkun)

* tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-block:
block: reexpand iov_iter after read/write
nvmet: remove unsupported command noise
nvme-multipath: reset bdev to ns head when failover
nvme-pci: fix controller reset hang when racing with nvme_timeout
nvme: move the fabrics queue ready check routines to core
nvme: avoid memset for passthrough requests
nvme: add nvme_get_ns helper
nvme: fix controller ioctl through ns_head
bio: limit bio max size
RDMA/rtrs: fix uninitialized symbol 'cnt'
s390: dasd: Mundane spelling fixes
block/rnbd: Remove all likely and unlikely
block/rnbd-clt: Check the return value of the function rtrs_clt_query
block/rnbd: Fix style issues
block/rnbd-clt: Change queue_depth type in rnbd_clt_session to size_t


4ee60ec1 06-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

include: remove pagemap.h from blkdev.h

My UEK-derived config has 1030 files depending on pagemap.h before this
change. Afterwards, just 326 files need to be rebuilt when I touch
pagemap.h. I think blkdev.h is probably included too widely, but
untangling that dependency is harder and this solves my problem. x86
allmodconfig builds, but there may be implicit include problems on other
architectures.

Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm]
Acked-by: Jens Axboe <axboe@kernel.dk> [block]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi]
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cd2c7545 03-May-2021 Changheun Lee <nanich.lee@samsung.com>

bio: limit bio max size

bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.

When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.

| bio merge for 32MB. total 8,192 pages are merged.
| total elapsed time is over 2ms.
|------------------ ... ----------------------->|
| 8,192 pages merged a bio.
| at this time, first bio submit is done.
| 1 bio is split to 32 read request and issue.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 19ms elapsed to complete 32MB read done from device. |

If bio max size is limited with 1MB, behavior is changed below.

| bio merge for 1MB. 256 pages are merged for each bio.
| total 32 bio will be made.
| total elapsed time is over 2ms. it's same.
| but, first bio submit timing is fast. about 100us.
|--->|--->|--->|---> ... -->|--->|--->|--->|--->|
| 256 pages merged a bio.
| at this time, first bio submit is done.
| and 1 read request is issued for 1 bio.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 17ms elapsed to complete 32MB read done from device. |

As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.

Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dc26532a 29-Apr-2021 Johannes Weiner <hannes@cmpxchg.org>

cgroup: rstat: punt root-level optimization to individual controllers

Current users of the rstat code can source root-level statistics from
the native counters of their respective subsystem, allowing them to
forego aggregation at the root level. This optimization is currently
implemented inside the generic rstat code, which doesn't track the root
cgroup and doesn't invoke the subsystem flush callbacks on it.

However, the memory controller cannot do this optimization, because
cgroup1 breaks out memory specifically for the local level, including at
the root level. In preparation for the memory controller switching to
rstat, move the optimization from rstat core to the controllers.

Afterwards, rstat will always track the root cgroup for changes and
invoke the subsystem callbacks on it; and it's up to the subsystem to
special-case and skip aggregation of the root cgroup if it can source
this information through other, cheaper means.

This is the case for the io controller and the cgroup base stats. In
their respective flush callbacks, check whether the parent is the root
cgroup, and if so, skip the unnecessary upward propagation.

The extra cost of tracking the root cgroup is negligible: on stat
changes, we actually remove a branch that checks for the root. The
queueing for a flush touches only per-cpu data, and only the first stat
change since a flush requires a (per-cpu) lock.

Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d72cd4ad 28-Apr-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This consists of the usual driver updates (ufs, target, tcmu,
smartpqi, lpfc, zfcp, qla2xxx, mpt3sas, pm80xx).

The major core change is using a sbitmap instead of an atomic for
queue tracking"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (412 commits)
scsi: target: tcm_fc: Fix a kernel-doc header
scsi: target: Shorten ALUA error messages
scsi: target: Fix two format specifiers
scsi: target: Compare explicitly with SAM_STAT_GOOD
scsi: sd: Introduce a new local variable in sd_check_events()
scsi: dc395x: Open-code status_byte(u8) calls
scsi: 53c700: Open-code status_byte(u8) calls
scsi: smartpqi: Remove unused functions
scsi: qla4xxx: Remove an unused function
scsi: myrs: Remove unused functions
scsi: myrb: Remove unused functions
scsi: mpt3sas: Fix two kernel-doc headers
scsi: fcoe: Suppress a compiler warning
scsi: libfc: Fix a format specifier
scsi: aacraid: Remove an unused function
scsi: core: Introduce enum scsi_disposition
scsi: core: Modify the scsi_send_eh_cmnd() return value for the SDEV_BLOCK case
scsi: core: Rename scsi_softirq_done() into scsi_complete()
scsi: core: Remove an incorrect comment
scsi: core: Make the scsi_alloc_sgtables() documentation more accurate
...


6c002921 28-Apr-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
"Pretty quiet round this time, which is nice. In detail:

- Series revamping bounce buffer support (Christoph)

- Dead code removal (Christoph, Bart)

- Partition iteration revamp, now using xarray (Christoph)

- Passthrough request scheduler improvements (Lin)

- Series of BFQ improvements (Paolo)

- Fix ioprio task iteration (Peter)

- Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max,
Nikolay)"

* tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits)
blk-iocost: don't ignore vrate_min on QD contention
blk-mq: Fix spurious debugfs directory creation during initialization
bfq/mq-deadline: remove redundant check for passthrough request
blk-mq: bypass IO scheduler's limit_depth for passthrough request
block: Remove an obsolete comment from sg_io()
block: move bio_list_copy_data to pktcdvd
block: remove zero_fill_bio_iter
block: add queue_to_disk() to get gendisk from request_queue
block: remove an incorrect check from blk_rq_append_bio
block: initialize ret in bdev_disk_changed
block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration
block: remove disk_part_iter
block: simplify diskstats_show
block: simplify show_partition
block: simplify printk_all_partitions
block: simplify partition_overlaps
block: simplify partition removal
block: take bd_mutex around delete_partitions in del_gendisk
block: refactor blk_drop_partitions
block: move more syncing and invalidation to delete_partition
...


57fa2369 27-Apr-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull CFI on arm64 support from Kees Cook:
"This builds on last cycle's LTO work, and allows the arm64 kernels to
be built with Clang's Control Flow Integrity feature. This feature has
happily lived in Android kernels for almost 3 years[1], so I'm excited
to have it ready for upstream.

The wide diffstat is mainly due to the treewide fixing of mismatched
list_sort prototypes. Other things in core kernel are to address
various CFI corner cases. The largest code portion is the CFI runtime
implementation itself (which will be shared by all architectures
implementing support for CFI). The arm64 pieces are Acked by arm64
maintainers rather than coming through the arm64 tree since carrying
this tree over there was going to be awkward.

CFI support for x86 is still under development, but is pretty close.
There are a handful of corner cases on x86 that need some improvements
to Clang and objtool, but otherwise works well.

Summary:

- Clean up list_sort prototypes (Sami Tolvanen)

- Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)"

* tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arm64: allow CONFIG_CFI_CLANG to be selected
KVM: arm64: Disable CFI for nVHE
arm64: ftrace: use function_nocfi for ftrace_call
arm64: add __nocfi to __apply_alternatives
arm64: add __nocfi to functions that jump to a physical address
arm64: use function_nocfi with __pa_symbol
arm64: implement function_nocfi
psci: use function_nocfi for cpu_resume
lkdtm: use function_nocfi
treewide: Change list_sort to use const pointers
bpf: disable CFI in dispatcher functions
kallsyms: strip ThinLTO hashes from static functions
kthread: use WARN_ON_FUNCTION_MISMATCH
workqueue: use WARN_ON_FUNCTION_MISMATCH
module: ensure __cfi_check alignment
mm: add generic function_nocfi macro
cfi: add __cficanonical
add support for Clang CFI


f46ec84b 22-Apr-2021 Tejun Heo <tj@kernel.org>

blk-iocost: don't ignore vrate_min on QD contention

ioc_adjust_base_vrate() ignored vrate_min when rq_wait_pct indicates that
there is QD contention. The reasoning was that QD depletion always reliably
indicates device saturation and thus it's safe to override user specified
vrate_min. However, this sometimes leads to unnecessary throttling,
especially on really fast devices, because vrate adjustments have delays and
inertia. It also confuses users because the behavior violates the explicitly
specified configuration.

This patch drops the special case handling so that vrate_min is always
applied.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YIIo1HuyNmhDeiNx@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

68e6582e 21-Apr-2021 Christoph Hellwig <hch@lst.de>

block: return -EBUSY when there are open partitions in blkdev_reread_part

The switch to go through blkdev_get_by_dev means we now ignore the
return value from bdev_disk_changed in __blkdev_get. Add a manual
check to restore the old semantics.

Fixes: 4601b4b130de ("block: reopen the device in blkdev_reread_part")
Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210421160502.447418-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1e91e28e 07-Apr-2021 Saravanan D <saravanand@fb.com>

blk-mq: Fix spurious debugfs directory creation during initialization

blk_mq_debugfs_register_sched_hctx() called from
device_add_disk()->elevator_init_mq()->blk_mq_init_sched()
initialization sequence does not have relevant parent directory
setup and thus spuriously attempts "sched" directory creation
from root mount of debugfs for every hw queue detected on the
block device

dmesg
...
debugfs: Directory 'sched' with parent '/' already present!
debugfs: Directory 'sched' with parent '/' already present!
.
.
debugfs: Directory 'sched' with parent '/' already present!
...

The parent debugfs directory for hw queues get properly setup
device_add_disk()->blk_register_queue()->blk_mq_debugfs_register()
->blk_mq_debugfs_register_hctx() later in the block device
initialization sequence.

A simple check for debugfs_dir has been added to thwart premature
debugfs directory/file creation attempts.

Signed-off-by: Saravanan D <saravanand@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7687b38a 14-Apr-2021 Lin Feng <linf@wangsu.com>

bfq/mq-deadline: remove redundant check for passthrough request

Since commit 01e99aeca39796003 'blk-mq: insert passthrough request into
hctx->dispatch directly', passthrough request should not appear in
IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO
schedulers is redundant.

(Notes: this patch passes generic IO load test with hdds under SAS
controller and hdds under AHCI controller but obviously not covers all.
Not sure if passthrough request can still escape into IO scheduler from
blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and
has lots of indirect callers.)

Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8d663f34 14-Apr-2021 Lin Feng <linf@wangsu.com>

blk-mq: bypass IO scheduler's limit_depth for passthrough request

Commit 01e99aeca39796003 ("blk-mq: insert passthrough request into
hctx->dispatch directly") gives high priority to passthrough requests and
bypass underlying IO scheduler. But as we allocate tag for such request it
still runs io-scheduler's callback limit_depth, while we really want is to
give full sbitmap-depth capabity to such request for acquiring available
tag.
blktrace shows PC requests(dmraid -s -c -i) hit bfq's limit_depth:
8,0 2 0 0.000000000 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 1 0.000008134 39952 D R 4 [dmraid]
8,0 2 2 0.000021538 24 C R [0]
8,0 2 0 0.000035442 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 3 0.000038813 39952 D R 24 [dmraid]
8,0 2 4 0.000044356 24 C R [0]

This patch introduce a new wrapper to make code not that ugly.

Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210415033920.213963-1-linf@wangsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

347b546d 12-Apr-2021 Bart Van Assche <bvanassche@acm.org>

block: Remove an obsolete comment from sg_io()

Commit b7819b925918 ("block: remove the blk_execute_rq return value")
changed the return type of blk_execute_rq() from int into void. That
change made a comment in sg_io() obsolete. Hence remove that comment.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20210413034142.23460-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5f03414d 12-Apr-2021 Christoph Hellwig <hch@lst.de>

block: move bio_list_copy_data to pktcdvd

bio_list_copy_data is only used by pktcdvd, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210412134658.2623190-2-hch@lst.de
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6f822e1b 12-Apr-2021 Christoph Hellwig <hch@lst.de>

block: remove zero_fill_bio_iter

zero_fill_bio_iter is only used to implement zero_fill_bio, so
remove the indirection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210412134658.2623190-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cbb749cf 09-Apr-2021 Christoph Hellwig <hch@lst.de>

block: remove an incorrect check from blk_rq_append_bio

blk_rq_append_bio is also used for the copy case, not just the map case,
so tis debug check is not correct.

Fixes: 393bb12e0058 ("block: stop calling blk_queue_bounce for passthrough requests")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210409150447.1977410-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4f0f586b 08-Apr-2021 Sami Tolvanen <samitolvanen@google.com>

treewide: Change list_sort to use const pointers

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com

40c7fd3f 08-Apr-2021 Peter Zijlstra <peterz@infradead.org>

block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration

do_each_pid_thread() { } while_each_pid_thread() is a double loop and
thus break doesn't work as expected. Also, it should be used under
tasklist_lock because otherwise we can race against change_pid() for
PGID/SID.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/YG7Q5C4Rb5dx5GFx@hirez.programming.kicks-ass.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3212135a 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: remove disk_part_iter

Just open code the xa_for_each in the remaining user.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7fae67cc 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: simplify diskstats_show

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ecc75a98 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: simplify show_partition

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e559f58d 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: simplify printk_all_partitions

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e3069123 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: simplify partition_overlaps

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6c4541a8 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: simplify partition removal

Always look up the first available entry instead of the complicated
stateful traversal.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c76f48eb 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: take bd_mutex around delete_partitions in del_gendisk

There is nothing preventing an ioctl from trying do delete partition
concurrenly with del_gendisk, so take open_mutex to serialize against
that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d3c4a43d 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: refactor blk_drop_partitions

Move the busy check and disk-wide sync into the only caller, so that
the remainder can be shared with del_gendisk. Also pass the gendisk
instead of the bdev as that is all that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

473338be 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: move more syncing and invalidation to delete_partition

Move the calls to fsync_bdev and __invalidate_device from del_gendisk to
delete_partition. For the other two callers that check that there are
no openers for the delete partitions(s) the callouts are a no-op as no
file system can be mounted, but this keeps all the cleanup in one
place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

45611837 06-Apr-2021 Christoph Hellwig <hch@lst.de>

block: remove invalidate_partition

invalidate_partition has two callers, one of which already performs
the remove_inode_hash just after the call. Just open code the
function in the two callsites.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b896fa85 06-Apr-2021 Christoph Hellwig <hch@lst.de>

dasd: use bdev_disk_changed instead of blk_drop_partitions

Use the more general interface - the behavior is the same except
that now a change uevent is sent, which is the right thing to do
when the device becomes unusable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20210406062303.811835-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

540ad3f3 06-Apr-2021 Bart Van Assche <bvanassche@acm.org>

blk-zoned: Remove the definition of blk_zone_start()

Commit e76239a3748c ("block: add a report_zones method") removed the last
blk_zone_start() call. Hence also remove the definition of this function.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406200820.15180-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

580dca81 05-Apr-2021 Ming Lei <ming.lei@redhat.com>

blk-mq: set default elevator as deadline in case of hctx shared tagset

Yanhui found that write performance is degraded a lot after applying
hctx shared tagset on one test machine with megaraid_sas. And turns out
it is caused by none scheduler which becomes default elevator caused by
hctx shared tagset patchset.

Given more scsi HBAs will apply hctx shared tagset, and the similar
performance exists for them too.

So keep previous behavior by still using default mq-deadline for queues
which apply hctx shared tagset, just like before.

Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset")
Reported-by: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210406031933.767228-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

393bb12e 31-Mar-2021 Christoph Hellwig <hch@lst.de>

block: stop calling blk_queue_bounce for passthrough requests

Instead of overloading the passthrough fast path with the deprecated
block layer bounce buffering let the users that combine an old
undermaintained driver with a highmem system pay the price by always
falling back to copies in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210331073001.46776-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9bb33f24 31-Mar-2021 Christoph Hellwig <hch@lst.de>

block: refactor the bounce buffering code

Get rid of all the PFN arithmetics and just use an enum for the two
remaining options, and use PageHighMem for the actual bounce decision.

Add a fast path to entirely avoid the call for the common case of a queue
not using the legacy bouncing code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ce288e05 31-Mar-2021 Christoph Hellwig <hch@lst.de>

block: remove BLK_BOUNCE_ISA support

Remove the BLK_BOUNCE_ISA support now that all users are gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210331073001.46776-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

39aa56db 11-Mar-2021 Nikolay Borisov <nborisov@suse.com>

blk-mq: Always use blk_mq_is_sbitmap_shared

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20210311081713.2763171-1-nborisov@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

28af7428 05-Apr-2021 Max Gurtovoy <mgurtovoy@nvidia.com>

block: add sysfs entry for virt boundary mask

This entry will expose the bio vector alignment mask for a specific
block device.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210405132012.12504-1-mgurtovoy@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f06c6096 02-Apr-2021 Christoph Hellwig <hch@lst.de>

block: remove the unused RQF_ALLOCED flag

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3edf5346 31-Mar-2021 Yufen Yu <yuyufen@huawei.com>

block: only update parent bi_status when bio fail

For multiple split bios, if one of the bio is fail, the whole
should return error to application. But we found there is a race
between bio_integrity_verify_fn and bio complete, which return
io success to application after one of the bio fail. The race as
following:

split bio(READ) kworker

nvme_complete_rq
blk_update_request //split error=0
bio_endio
bio_integrity_endio
queue_work(kintegrityd_wq, &bip->bip_work);

bio_integrity_verify_fn
bio_endio //split bio
__bio_chain_endio
if (!parent->bi_status)

<interrupt entry>
nvme_irq
blk_update_request //parent error=7
req_bio_endio
bio->bi_status = 7 //parent bio
<interrupt exit>

parent->bi_status = 0
parent->bi_end_io() // return bi_status=0

The bio has been split as two: split and parent. When split
bio completed, it depends on kworker to do endio, while
bio_integrity_verify_fn have been interrupted by parent bio
complete irq handler. Then, parent bio->bi_status which have
been set in irq handler will overwrite by kworker.

In fact, even without the above race, we also need to conside
the concurrency beteen mulitple split bio complete and update
the same parent bi_status. Normally, multiple split bios will
be issued to the same hctx and complete from the same irq
vector. But if we have updated queue map between multiple split
bios, these bios may complete on different hw queue and different
irq vector. Then the concurrency update parent bi_status may
cause the final status error.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e82fc785 27-Mar-2021 Ming Lei <ming.lei@redhat.com>

block: don't create too many partitions

Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") drops the
check on max supported number of partitionsr, and allows partition with
bigger partition numbers to be added. However, ->bd_partno is defined as
u8, so partition index of xarray table may not match with ->bd_partno.
Then delete_partition() may delete one unmatched partition, and caused
use-after-free.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: syzbot+8fede7e30c7cee0de139@syzkaller.appspotmail.com
Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

430a67f9 04-Mar-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: merge bursts of newly-created queues

Many throughput-sensitive workloads are made of several parallel I/O
flows, with all flows generated by the same application, or more
generically by the same task (e.g., system boot). The most
counterproductive action with these workloads is plugging I/O dispatch
when one of the bfq_queues associated with these flows remains
temporarily empty.

To avoid this plugging, BFQ has been using a burst-handling mechanism
for years now. This mechanism has proven effective for throughput, and
not detrimental for service guarantees. This commit pushes this
mechanism a little bit further, basing on the following two facts.

First, all the I/O flows of a the same application or task contribute
to the execution/completion of that common application or task. So the
performance figures that matter are total throughput of the flows and
task-wide I/O latency. In particular, these flows do not need to be
protected from each other, in terms of individual bandwidth or
latency.

Second, the above fact holds regardless of the number of flows.

Putting these two facts together, this commits merges stably the
bfq_queues associated with these I/O flows, i.e., with the processes
that generate these IO/ flows, regardless of how many the involved
processes are.

To decide whether a set of bfq_queues is actually associated with the
I/O flows of a common application or task, and to merge these queues
stably, this commit operates as follows: given a bfq_queue, say Q2,
currently being created, and the last bfq_queue, say Q1, created
before Q2, Q2 is merged stably with Q1 if
- very little time has elapsed since when Q1 was created
- Q2 has the same ioprio as Q1
- Q2 belongs to the same group as Q1

Merging bfq_queues also reduces scheduling overhead. A fio test with
ten random readers on /dev/nullb shows a throughput boost of 40%, with
a quadcore. Since BFQ's execution time amounts to ~50% of the total
per-request processing time, the above throughput boost implies that
BFQ's overhead is reduced by more than 50%.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

85686d0d 04-Mar-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: keep shared queues out of the waker mechanism

Shared queues are likely to receive I/O at a high rate. This may
deceptively let them be considered as wakers of other queues. But a
false waker will unjustly steal bandwidth to its supposedly woken
queue. So considering also shared queues in the waking mechanism may
cause more control troubles than throughput benefits. This commit
keeps shared queues out of the waker-detection mechanism.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8c544770 04-Mar-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: fix weight-raising resume with !low_latency

When the io_latency heuristic is off, bfq_queues must not start to be
weight-raised. Unfortunately, by mistake, this may happen when the
state of a previously weight-raised bfq_queue is resumed after a queue
split. This commit fixes this error.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-5-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8ef3fc3a 04-Mar-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: make shared queues inherit wakers

Consider a bfq_queue bfqq that is about to be merged with another
bfq_queue new_bfqq. The processes associated with bfqq are cooperators
of the processes associated with new_bfqq. So, if bfqq has a waker,
then it is reasonable (and beneficial for throughput) to assume that
all these processes will be happy to let bfqq's waker freely inject
I/O when they have no I/O. So this commit makes new_bfqq inherit
bfqq's waker.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-4-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7cc4ffc5 04-Mar-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: put reqs of waker and woken in dispatch list

Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
this happens, the only active bfq_queues are bfqq and either its waker
bfq_queue or one of its woken bfq_queues, then there is no point in
queueing this new I/O request in bfqq for service. In fact, the
in-service queue and bfqq agree on serving this new I/O request as
soon as possible. So this commit puts this new I/O request directly
into the dispatch list.

Tested-by: Jan Kara <jack@suse.cz>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-3-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2ec5a5c4 04-Mar-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: always inject I/O of queues blocked by wakers

Suppose that I/O dispatch is plugged, to wait for new I/O for the
in-service bfq-queue, say bfqq. Suppose then that there is a further
bfq_queue woken by bfqq, and that this woken queue has pending I/O. A
woken queue does not steal bandwidth from bfqq, because it remains
soon without I/O if bfqq is not served. So there is virtually no risk
of loss of bandwidth for bfqq if this woken queue has I/O dispatched
while bfqq is waiting for new I/O. In contrast, this extra I/O
injection boosts throughput. This commit performs this extra
injection.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9cf1adc6 19-Mar-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

blk-mq: Sentence reconstruct for better readability

Sentence reconstruction for better readability.

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7de55b7d 24-Mar-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

block: support zone append bvecs

Christoph reported that we'll likely trigger the WARN_ON_ONCE() checking
that we're not submitting a bvec with REQ_OP_ZONE_APPEND in
bio_iov_iter_get_pages() some time ago using zoned btrfs, but I couldn't
reproduce it back then.

Now Naohiro was able to trigger the bug as well with xfstests generic/095
on a zoned btrfs.

There is nothing that prevents bvec submissions via REQ_OP_ZONE_APPEND if
the hardware's zone append limit is met.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/10bd414d9326c90cd69029077db63b363854eee5.1616600835.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a958937f 11-Feb-2021 David Jeffery <djeffery@redhat.com>

block: recalculate segment count for multi-segment discards correctly

When a stacked block device inserts a request into another block device
using blk_insert_cloned_request, the request's nr_phys_segments field gets
recalculated by a call to blk_recalc_rq_segments in
blk_cloned_rq_check_limits. But blk_recalc_rq_segments does not know how to
handle multi-segment discards. For disk types which can handle
multi-segment discards like nvme, this results in discard requests which
claim a single segment when it should report several, triggering a warning
in nvme and causing nvme to fail the discard from the invalid state.

WARNING: CPU: 5 PID: 191 at drivers/nvme/host/core.c:700 nvme_setup_discard+0x170/0x1e0 [nvme_core]
...
nvme_setup_cmd+0x217/0x270 [nvme_core]
nvme_loop_queue_rq+0x51/0x1b0 [nvme_loop]
__blk_mq_try_issue_directly+0xe7/0x1b0
blk_mq_request_issue_directly+0x41/0x70
? blk_account_io_start+0x40/0x50
dm_mq_queue_rq+0x200/0x3e0
blk_mq_dispatch_rq_list+0x10a/0x7d0
? __sbitmap_queue_get+0x25/0x90
? elv_rb_del+0x1f/0x30
? deadline_remove_request+0x55/0xb0
? dd_dispatch_request+0x181/0x210
__blk_mq_do_dispatch_sched+0x144/0x290
? bio_attempt_discard_merge+0x134/0x1f0
__blk_mq_sched_dispatch_requests+0x129/0x180
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x47/0xe0
__blk_mq_delay_run_hw_queue+0x15b/0x170
blk_mq_sched_insert_requests+0x68/0xe0
blk_mq_flush_plug_list+0xf0/0x170
blk_finish_plug+0x36/0x50
xlog_cil_committed+0x19f/0x290 [xfs]
xlog_cil_process_committed+0x57/0x80 [xfs]
xlog_state_do_callback+0x1e0/0x2a0 [xfs]
xlog_ioend_work+0x2f/0x80 [xfs]
process_one_work+0x1b6/0x350
worker_thread+0x53/0x3e0
? process_one_work+0x350/0x350
kthread+0x11b/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30

This patch fixes blk_recalc_rq_segments to be aware of devices which can
have multi-segment discards. It calculates the correct discard segment
count by counting the number of bio as each discard bio is considered its
own segment.

Fixes: 1e739730c5b9 ("block: optionally merge discontiguous discard bios into a single request")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Link: https://lore.kernel.org/r/20210211143807.GA115624@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e5113505 11-Mar-2021 Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

block: Discard page cache of zone reset target range

When zone reset ioctl and data read race for a same zone on zoned block
devices, the data read leaves stale page cache even though the zone
reset ioctl zero clears all the zone data on the device. To avoid
non-zero data read from the stale page cache after zone reset, discard
page cache of reset target zones in blkdev_zone_mgmt_ioctl(). Introduce
the helper function blkdev_truncate_zone_range() to discard the page
cache. Ensure the page cache discarded by calling the helper function
before and after zone reset in same manner as fallocate does.

This patch can be applied back to the stable kernel version v5.10.y.
Rework is needed for older stable kernels.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Cc: <stable@vger.kernel.org> # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210311072546.678999-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9ec49144 11-Mar-2021 Daniel Wagner <dwagner@suse.de>

block: Suppress uevent for hidden device when removed

register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN
but enables uevents at the end again in order to announce disk after
possible partitions are created.

When the device is removed the uevents are still on and user land sees
'remove' messages for devices which were never 'add'ed to the system.

KERNEL[95481.571887] remove /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block)

Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the
uevents at all.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8affc03 10-Mar-2021 Christoph Hellwig <hch@lst.de>

block: rename BIO_MAX_PAGES to BIO_MAX_VECS

Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

faa44c69 10-Mar-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: Fix REQ_OP_ZONE_RESET_ALL handling

Similarly to a single zone reset operation (REQ_OP_ZONE_RESET), execute
REQ_OP_ZONE_RESET_ALL operations with REQ_SYNC set.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4f44657d 05-Mar-2021 Xunlei Pang <xlpang@linux.alibaba.com>

blk-cgroup: Fix the recursive blkg rwstat

The current blkio.throttle.io_service_bytes_recursive doesn't
work correctly.

As an example, for the following blkcg hierarchy:
(Made 1GB READ in test1, 512MB READ in test2)
test
/ \
test1 test2

$ head -n 1 test/test1/blkio.throttle.io_service_bytes_recursive
8:0 Read 1073684480
$ head -n 1 test/test2/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448
$ head -n 1 test/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448

Clearly, above data of "test" reflects "test2" not "test1"+"test2".

Do the correct summary in blkg_rwstat_recursive_sum().

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2a5a24aa 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: blk-mq: Return budget token from .get_budget callback

SCSI uses a global atomic variable to track queue depth for each
LUN/request queue.

This doesn't scale well when there are lots of CPU cores and the disk is
very fast. It has been observed that IOPS is affected a lot by tracking
queue depth via sdev->device_busy in the I/O path.

Return budget token from .get_budget callback. The budget token can be
passed to driver so that we can replace the atomic variable with
sbitmap_queue and alleviate the scaling problems that way.

Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

c548e62b 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Move allocation hint into sbitmap

Allocation hint should have belonged to sbitmap. Also, when sbitmap's depth
is high and there is no need to use mulitple wakeup queues, user can
benefit from percpu allocation hint too.

Move allocation hint into sbitmap, then SCSI device queue can benefit from
allocation hint when converting to plain sbitmap.

Convert vhost/scsi.c to use sbitmap allocation with percpu alloc hint. This
is more efficient than the previous approach.

Link: https://lore.kernel.org/r/20210122023317.687987-5-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: virtualization@lists.linux-foundation.org
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

efe1f3a1 21-Jan-2021 Ming Lei <ming.lei@redhat.com>

scsi: sbitmap: Maintain allocation round_robin in sbitmap

Currently the allocation round_robin info is maintained by sbitmap_queue.

However, bit allocation really belongs to sbitmap. Move it there.

Link: https://lore.kernel.org/r/20210122023317.687987-3-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: virtualization@lists.linux-foundation.org
Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

4168a8d2 22-Feb-2021 Joseph Qi <joseph.qi@linux.alibaba.com>

block/bfq: update comments and default value in docs for fifo_expire

Correct the comments since bfq_fifo_expire[0] is for async request,
while bfq_fifo_expire[1] is for sync request.
Also update docs, according the source code, the default
fifo_expire_async is 250ms, and fifo_expire_sync is 125ms.

Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5218e12e 01-Mar-2021 Jean Delvare <jdelvare@suse.de>

block: Drop leftover references to RQF_SORTED

Commit a1ce35fa49852db60fc6e268038530be533c5b15 ("block: remove dead
elevator code") removed all users of RQF_SORTED. However it is still
defined, and there is one reference left to it (which in effect is
dead code). Clear it all up.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0f472277 28-Feb-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: revert "block: fix bd_size_lock use"

With the removal of the skd driver, using IRQ safe locking of a bdev
bd_size_lock spinlock to protect the bdev inode size is not necessary
anymore as there is no other known driver using this lock under an IRQ
disabled context (e.g. calling set_capacity() with IRQ disabled).
Revert commit 0fe37724f8e7 ("block: fix bd_size_lock use") which
introduced the IRQ safe change.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3ab6608e 28-Feb-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
"A few stragglers (and one due to me missing it originally), and fixes
for changes in this merge window mostly. In particular:

- blktrace cleanups (Chaitanya, Greg)

- Kill dead blk_pm_* functions (Bart)

- Fixes for the bio alloc changes (Christoph)

- Fix for the partition changes (Christoph, Ming)

- Fix for turning off iopoll with polled IO inflight (Jeffle)

- nbd disconnect fix (Josef)

- loop fsync error fix (Mauricio)

- kyber update depth fix (Yang)

- max_sectors alignment fix (Mikulas)

- Add bio_max_segs helper (Matthew)"

* tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits)
block: Add bio_max_segs
blktrace: fix documentation for blk_fill_rw()
block: memory allocations in bounce_clone_bio must not fail
block: remove the gfp_mask argument to bounce_clone_bio
block: fix bounce_clone_bio for passthrough bios
block-crypto-fallback: use a bio_set for splitting bios
block: fix logging on capacity change
blk-settings: align max_sectors on "logical_block_size" boundary
block: reopen the device in blkdev_reread_part
block: don't skip empty device in in disk_uevent
blktrace: remove debugfs file dentries from struct blk_trace
nbd: handle device refs for DESTROY_ON_DISCONNECT properly
kyber: introduce kyber_depth_updated()
loop: fix I/O error on fsync() in detached loop devices
block: fix potential IO hang when turning off io_poll
block: get rid of the trace rq insert wrapper
blktrace: fix blk_rq_merge documentation
blktrace: fix blk_rq_issue documentation
blktrace: add blk_fill_rwbs documentation comment
block: remove superfluous param in blk_fill_rwbs()
...


5f7136db 28-Jan-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

block: Add bio_max_segs

It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the
sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to
be unsigned to make it easier for the users.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

47dc096a 24-Feb-2021 Christoph Hellwig <hch@lst.de>

block: memory allocations in bounce_clone_bio must not fail

The caller can't cope with a failure from bounce_clone_bio, so
use __GFP_NOFAIL for the passthrough case. bio_alloc_bioset already
won't fail due to the use of mempools.

And yes, we need to get rid of this bock layer bouncing code entirely
sooner or later..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ebfe4183 24-Feb-2021 Christoph Hellwig <hch@lst.de>

block: remove the gfp_mask argument to bounce_clone_bio

The only caller always passes GFP_NOIO.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b90994c6 24-Feb-2021 Christoph Hellwig <hch@lst.de>

block: fix bounce_clone_bio for passthrough bios

Now that bio_alloc_bioset does not fall back to kmalloc for a NULL
bio_set, handle that case explicitly and simplify the calling
conventions.

Based on an earlier patch from Chaitanya Kulkarni.

Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset")
Reported-by: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5407334c 24-Feb-2021 Christoph Hellwig <hch@lst.de>

block-crypto-fallback: use a bio_set for splitting bios

bio_split with a NULL bs argumen used to fall back to kmalloc the
bio, which does not guarantee forward progress and could to deadlocks.
Now that the overloading of the NULL bs argument to bio_alloc_bioset
has been removed it crashes instead. Fix all that by using a special
crafted bioset.

Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

452c0bf8 23-Feb-2021 Ming Lei <ming.lei@redhat.com>

block: fix logging on capacity change

Local variable of 'capacity' stores the previous disk capacity, and
'size' variable records the latest disk capacity, so swap them for
fixing logging on capacity change.

Cc: Christoph Hellwig <hch@lst.de>
Fixes: a782483cc1f8 ("block: remove the nr_sects field in struct hd_struct")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

97f433c3 23-Feb-2021 Mikulas Patocka <mpatocka@redhat.com>

blk-settings: align max_sectors on "logical_block_size" boundary

We get I/O errors when we run md-raid1 on the top of dm-integrity on the
top of ramdisk.
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb

The ramdisk device has logical_block_size 512 and max_sectors 255. The
dm-integrity device uses logical_block_size 4096 and it doesn't affect the
"max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have
a device with max_sectors not aligned on logical_block_size.

The md-raid device sees that the underlying leg has max_sectors 255 and it
will split the bios on 255-sector boundary, making the bios unaligned on
logical_block_size.

In order to fix the bug, we round down max_sectors to logical_block_size.

Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4601b4b1 23-Feb-2021 Christoph Hellwig <hch@lst.de>

block: reopen the device in blkdev_reread_part

Historically the BLKRRPART ioctls called into the now defunct ->revalidate
method, which caused the sd driver to check if any media is present.
When the ->revalidate method was removed this revalidation was lost,
leading to lots of I/O errors when using the eject command. Fix this by
reopening the device to rescan the partitions, and thus calling the
revalidation logic in the sd driver.

Fixes: 471bd0af544b ("sd: use bdev_check_media_change")
Reported--by: Tom Seewald <tseewald@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Tom Seewald <tseewald@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75ab6afa 23-Feb-2021 Christoph Hellwig <hch@lst.de>

block: don't skip empty device in in disk_uevent

Restore the previous behavior by using the correct flag for the whole device
("part0").

Fixes: 99dfc43ecbf6 ("block: use ->bi_bdev for bio based I/O accounting")
Reported-by: John Stultz <john.stultz@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ffa772cf 05-Feb-2021 Yang Yang <yang.yang@vivo.com>

kyber: introduce kyber_depth_updated()

Hang occurs when user changes the scheduler queue depth, by writing to
the 'nr_requests' sysfs file of that device.

The details of the environment that we found the problem are as follows:
an eMMC block device
total driver tags: 16
default queue_depth: 32
kqd->async_depth initialized in kyber_init_sched() with queue_depth=32

Then we change queue_depth to 256, by writing to the 'nr_requests' sysfs
file. But kqd->async_depth don't be updated after queue_depth changes.
Now the value of async depth is too small for queue_depth=256, this may
cause hang.

This patch introduces kyber_depth_updated(), so that kyber can update
async depth when queue depth changes.

Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ae42c317 22-Feb-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.12/block-ipi-2021-02-21' of git://git.kernel.dk/linux-block

Pull block IPI updates from Jens Axboe:
"Avoid IRQ locking for the block IPI handling (Sebastian Andrzej
Siewior)"

* tag 'for-5.12/block-ipi-2021-02-21' of git://git.kernel.dk/linux-block:
blk-mq: Use llist_head for blk_cpu_done
blk-mq: Always complete remote completions requests in softirq
smp: Process pending softirqs in flush_smp_call_function_from_idle()


325b7640 22-Feb-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

- Fix DM integrity's HMAC support to provide enhanced security of
internal_hash and journal_mac capabilities.

- Various DM writecache fixes to address performance, fix table output
to match what was provided at table creation, fix writing beyond end
of device when shrinking underlying data device, and a couple other
small cleanups.

- Add DM crypt support for using trusted keys.

- Fix deadlock when swapping to DM crypt device by throttling number of
in-flight REQ_SWAP bios. Implemented in DM core so that other
bio-based targets can opt-in by setting ti->limit_swap_bios.

- Fix various inverted logic bugs in the .iterate_devices callout
functions that are used to assess if specific feature or capability
is supported across all devices being combined/stacked by DM.

- Fix DM era target bugs that exposed users to lost writes or memory
leaks.

- Add DM core support for passing through inline crypto support of
underlying devices. Includes block/keyslot-manager changes that
enable extending this support to DM.

- Various small fixes and cleanups (spelling fixes, front padding
calculation cleanup, cleanup conditional zoned support in targets,
etc).

* tag 'for-5.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (31 commits)
dm: fix deadlock when swapping to encrypted device
dm: simplify target code conditional on CONFIG_BLK_DEV_ZONED
dm: set DM_TARGET_PASSES_CRYPTO feature for some targets
dm: support key eviction from keyslot managers of underlying devices
dm: add support for passing through inline crypto support
block/keyslot-manager: Introduce functions for device mapper support
block/keyslot-manager: Introduce passthrough keyslot manager
dm era: only resize metadata in preresume
dm era: Use correct value size in equality function of writeset tree
dm era: Fix bitset memory leaks
dm era: Verify the data block size hasn't changed
dm era: Reinitialize bitset cache before digesting a new writeset
dm era: Update in-core bitset after committing the metadata
dm era: Recover committed writeset after crash
dm writecache: use bdev_nr_sectors() instead of open-coded equivalent
dm writecache: fix writing beyond end of underlying device when shrinking
dm table: remove needless request_queue NULL pointer checks
dm table: fix zoned iterate_devices based device capability checks
dm table: fix DAX iterate_devices based device capability checks
dm table: fix iterate_devices based device capability checks
...


19472481 22-Feb-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'mmc-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC updates from Ulf Hansson:
"MMC core:
- Add support for eMMC inline encryption
- Add a helper function to parse DT properties for clock phases
- Some improvements and cleanups for the mmc_test module

MMC host:
- android-goldfish: Remove driver
- cqhci: Add support for eMMC inline encryption
- dw_mmc-zx: Remove driver
- meson-gx: Extend support for scatter-gather to allow SD_IO_RW_EXTENDED
- mmci: Add support for probing bus voltage level translator
- mtk-sd: Address race condition for request timeouts
- sdhci_am654: Add Support for the variant on TI's AM64 SoC
- sdhci-esdhc-imx: Prevent kernel panic at ->remove()
- sdhci-iproc: Add ACPI bindings for the RPi to enable SD and WiFi on RPi4
- sdhci-msm: Add Inline Crypto Engine support
- sdhci-msm: Use actual_clock to improve timeout calculations
- sdhci-of-aspeed: Add Andrew Jeffery as maintainer
- sdhci-of-aspeed: Extend clock support for the AST2600 variant
- sdhci-pci-gli: Increase idle period for low power state for GL9763E
- sdhci-pci-o2micro: Make tuning for SDR104 HW more robust
- sdhci-sirf: Remove driver
- sdhci-xenon: Add support for the AP807 variant
- sunxi-mmc: Add support for the A100 variant
- sunxi-mmc: Ensure host is suspended during system sleep
- tmio: Add detection of data timeout errors
- tmio/renesas_sdhi: Extend support for retuning
- renesas_sdhi_internal_dmac: Add support for the ->pre|post_req() ops"

* tag 'mmc-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (86 commits)
mmc: sdhci-esdhc-imx: fix kernel panic when remove module
mmc: host: Retire MMC_GOLDFISH
mmc: cb710: Use new tasklet API
mmc: sdhci-pci-o2micro: Bug fix for SDR104 HW tuning failure
mmc: mmc_test: use erase_arg for mmc_erase command
mmc: wbsd: Use new tasklet API
mmc: via-sdmmc: Use new tasklet API
mmc: uniphier-sd: Use new tasklet API
mmc: tifm_sd: Use new tasklet API
mmc: s3cmci: Use new tasklet API
mmc: omap: Use new tasklet API
mmc: dw_mmc: Use new tasklet API
mmc: au1xmmc: Use new tasklet API
mmc: atmel-mci: Use new tasklet API
mmc: cavium: Replace spin_lock_irqsave with spin_lock in hard IRQ
mmc: queue: Remove unused define
mmc: core: Drop redundant bouncesz from struct mmc_card
mmc: core: Drop redundant member in struct mmc host
mmc: core: Use host instead of card argument to mmc_spi_send_csd()
mmc: core: Exclude unnecessary header file
...


6b09b4d3 21-Feb-2021 Jeffle Xu <jefflexu@linux.alibaba.com>

block: fix potential IO hang when turning off io_poll

QUEUE_FLAG_POLL flag will be cleared when turning off 'io_poll', while
at that moment there may be IOs stuck in hw queue uncompleted. The
following polling routine won't help reap these IOs, since blk_poll()
will return immediately because of cleared QUEUE_FLAG_POLL flag. Thus
these IOs will hang until they finnaly time out. The hang out can be
observed by 'fio --engine=io_uring iodepth=1', while turning off
'io_poll' at the same time.

To fix this, freeze and flush the request queue first when turning off
'io_poll'.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b357e4a6 21-Feb-2021 Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

block: get rid of the trace rq insert wrapper

Get rid of the wrapper for trace_block_rq_insert() and call the function
directly.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9fb40717 21-Feb-2021 Bart Van Assche <bvanassche@acm.org>

block: Remove unused blk_pm_*() function definitions

Commit a1ce35fa4985 ("block: remove dead elevator code") removed the last
callers of blk_pm_requeue_request(), blk_pm_add_request() and
blk_pm_put_request(). Hence remove the definitions of these functions.
Removing these functions removes all users of the struct request nr_pending
member. Hence also remove 'nr_pending'. Note: 'nr_pending' is no longer
used since commit 7cedffec8e75 ("block: Make blk_get_request() block for
non-PM requests while suspended").

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

582cd91f 21-Feb-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
"Another nice round of removing more code than what is added, mostly
due to Christoph's relentless pursuit of tech debt removal/cleanups.
This pull request contains:

- Two series of BFQ improvements (Paolo, Jan, Jia)

- Block iov_iter improvements (Pavel)

- bsg error path fix (Pan)

- blk-mq scheduler improvements (Jan)

- -EBUSY discard fix (Jan)

- bvec allocation improvements (Ming, Christoph)

- bio allocation and init improvements (Christoph)

- Store bdev pointer in bio instead of gendisk + partno (Christoph)

- Block trace point cleanups (Christoph)

- hard read-only vs read-only split (Christoph)

- Block based swap cleanups (Christoph)

- Zoned write granularity support (Damien)

- Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
mm: simplify swapdev_block
sd_zbc: clear zone resources for non-zoned case
block: introduce blk_queue_clear_zone_settings()
zonefs: use zone write granularity as block size
block: introduce zone_write_granularity limit
block: use blk_queue_set_zoned in add_partition()
nullb: use blk_queue_set_zoned() to setup zoned devices
nvme: cleanup zone information initialization
block: document zone_append_max_bytes attribute
block: use bi_max_vecs to find the bvec pool
md/raid10: remove dead code in reshape_request
block: mark the bio as cloned in bio_iov_bvec_set
block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
block: remove a layer of indentation in bio_iov_iter_get_pages
block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
block: remove the 1 and 4 vec bvec_slabs entries
block: streamline bvec_alloc
block: factor out a bvec_alloc_gfp helper
block: move struct biovec_slab to bio.c
block: reuse BIO_INLINE_VECS for integrity bvecs
...


f9ab4918 23-Jan-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Use llist_head for blk_cpu_done

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU without additional locking.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions.

blk_mq_raise_softirq() needs a preempt-disable section to ensure the
request is enqueued on the same CPU as the softirq is raised.
Some callers (USB-storage) invoke this path in preemptible context.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0a2efafb 23-Jan-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Always complete remote completions requests in softirq

Controllers with multiple queues have their IRQ-handelers pinned to a
CPU. The core shouldn't need to complete the request on a remote CPU.

Remove this case and always raise the softirq to complete the request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9355a9eb 31-Jan-2021 Satya Tangirala <satyat@google.com>

dm: support key eviction from keyslot managers of underlying devices

Now that device mapper supports inline encryption, add the ability to
evict keys from all underlying devices. When an upper layer requests
a key eviction, we simply iterate through all underlying devices
and evict that key from each device.

Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

d3b17a24 31-Jan-2021 Satya Tangirala <satyat@google.com>

block/keyslot-manager: Introduce functions for device mapper support

Introduce blk_ksm_update_capabilities() to update the capabilities of
a keyslot manager (ksm) in-place. The pointer to a ksm in a device's
request queue may not be easily replaced, because upper layers like
the filesystem might access it (e.g. for programming keys/checking
capabilities) at the same time the device wants to replace that
request queue's ksm (and free the old ksm's memory). This function
allows the device to update the capabilities of the ksm in its request
queue directly. Devices can safely update the ksm this way without any
synchronization with upper layers *only* if the updated (new) ksm
continues to support all the crypto capabilities that the old ksm did
(see description below for blk_ksm_is_superset() for why this is so).

Also introduce blk_ksm_is_superset() which checks whether one ksm's
capabilities are a (not necessarily strict) superset of another ksm's.
The blk-crypto framework requires that crypto capabilities that were
advertised when a bio was created continue to be supported by the
device until that bio is ended - in practice this probably means that
a device's advertised crypto capabilities can *never* "shrink" (since
there's no synchronization between bio creation and when a device may
want to change its advertised capabilities) - so a previously
advertised crypto capability must always continue to be supported.
This function can be used to check that a new ksm is a valid
replacement for an old ksm.

Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7bdcc48f 31-Jan-2021 Satya Tangirala <satyat@google.com>

block/keyslot-manager: Introduce passthrough keyslot manager

The device mapper may map over devices that have inline encryption
capabilities, and to make use of those capabilities, the DM device must
itself advertise those inline encryption capabilities. One way to do this
would be to have the DM device set up a keyslot manager with a
"sufficiently large" number of keyslots, but that would use a lot of
memory. Also, the DM device itself has no "keyslots", and it doesn't make
much sense to talk about "programming a key into a DM device's keyslot
manager", so all that extra memory used to represent those keyslots is just
wasted. All a DM device really needs to be able to do is advertise the
crypto capabilities of the underlying devices in a coherent manner and
expose a way to evict keys from the underlying devices.

There are also devices with inline encryption hardware that do not
have a limited number of keyslots. One can send a raw encryption key along
with a bio to these devices (as opposed to typical inline encryption
hardware that require users to first program a raw encryption key into a
keyslot, and send the index of that keyslot along with the bio). These
devices also only need the same things from the keyslot manager that DM
devices need - a way to advertise crypto capabilities and potentially a way
to expose a function to evict keys from hardware.

So we introduce a "passthrough" keyslot manager that provides a way to
represent a keyslot manager that doesn't have just a limited number of
keyslots, and for which do not require keys to be programmed into keyslots.
DM devices can set up a passthrough keyslot manager in their request
queues, and advertise appropriate crypto capabilities based on those of the
underlying devices. Blk-crypto does not attempt to program keys into any
keyslots in the passthrough keyslot manager. Instead, if/when the bio is
resubmitted to the underlying device, blk-crypto will try to program the
key into the underlying device's keyslot manager.

Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

508aebb8 27-Jan-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: introduce blk_queue_clear_zone_settings()

Introduce the internal function blk_queue_clear_zone_settings() to
cleanup all limits and resources related to zoned block devices. This
new function is called from blk_queue_set_zoned() when a disk zoned
model is set to BLK_ZONED_NONE. This particular case can happens when a
partition is created on a host-aware scsi disk.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a805a4fa 27-Jan-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: introduce zone_write_granularity limit

Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that
all writes into sequential write required zones be aligned to the device
physical block size. However, NVMe ZNS does not have this constraint and
allows write operations into sequential zones to be aligned to the
device logical block size. This inconsistency does not help with
software portability across device types.

To solve this, introduce the zone_write_granularity queue limit to
indicate the alignment constraint, in bytes, of write operations into
zones of a zoned block device. This new limit is exported as a
read-only sysfs queue attribute and the helper
blk_queue_zone_write_granularity() introduced for drivers to set this
limit.

The function blk_queue_set_zoned() is modified to set this new limit to
the device logical block size by default. NVMe ZNS devices as well as
zoned nullb devices use this default value as is. The scsi disk driver
is modified to execute the blk_queue_zone_write_granularity() helper to
set the zone write granularity of host-managed SMR disks to the disk
physical block size.

The accessor functions queue_zone_write_granularity() and
bdev_zone_write_granularity() are also introduced.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eafc63a9 27-Jan-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: use blk_queue_set_zoned in add_partition()

When changing the zoned model of host-aware zoned block devices, use
blk_queue_set_zoned() instead of directly assigning the gendisk queue
zoned limit.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ae29333f 04-Feb-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

block: add bio_add_zone_append_page

Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7a800a20 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: use bi_max_vecs to find the bvec pool

Instead of encoding of the bvec pool using magic bio flags, just use
a helper to find the pool based on the max_vecs value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

977be012 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: mark the bio as cloned in bio_iov_bvec_set

bio_iov_bvec_set clones the bio_vecs from the iter, and thus should be
treated like a cloned bio in every respect. That also includes not
touching bi_max_vecs as that is a property of the bio allocation and not
its current payload.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ed97ce5e 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: set BIO_NO_PAGE_REF in bio_iov_bvec_set

bio_iov_bvec_set assigns the foreign bvec, so setting the NO_PAGE_REF
directly there seems like the best fit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

86004515 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: remove a layer of indentation in bio_iov_iter_get_pages

Remove a pointless layer of indentation after a return statement.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0f2e6ab8 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: turn the nr_iovecs argument to bio_alloc* into an unsigned short

The bi_max_vecs and bi_vcnt fields are defined as unsigned short, so
don't allow passing larger values in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

de76fd89 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: remove the 1 and 4 vec bvec_slabs entries

All bios with up to 4 bvecs use the inline bvecs in the bio itself, so
don't bother to define bvec_slabs entries for them. Also decruftify
the bvec_slabs definition and initialization while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f007a3d6 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: streamline bvec_alloc

Avoid the pointless goto by trying the slab allocation first and falling
through to the mempool.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f2c3eb9b 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: factor out a bvec_alloc_gfp helper

Clean up bvec_alloc a little by factoring out a helper for the gfp_t
manipulations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6ac0b715 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: move struct biovec_slab to bio.c

struct biovec_slab is only used inside of bio.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dc0b8a57 02-Feb-2021 Christoph Hellwig <hch@lst.de>

block: reuse BIO_INLINE_VECS for integrity bvecs

bvec_alloc always uses biovec_slabs, and thus always needs to use the
same number of inline vecs. Share a single definition for the data
and integrity bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eec79181 06-Feb-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A few small regression fixes:

- NVMe pull request from Christoph:
- more quirks for buggy devices (Thorsten Leemhuis, Claus Stovgaard)
- update the email address for Keith (Keith Busch)
- fix an out of bounds access in nvmet-tcp (Sagi Grimberg)

- Regression fix for BFQ shallow depth calculations introduced in
this merge window (Lin)"

* tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-block:
nvmet-tcp: fix out-of-bounds access when receiving multiple h2cdata PDUs
bfq-iosched: Revert "bfq: Fix computation of shallow depth"
update the email address for Keith Bush
nvme-pci: ignore the subsysem NQN on Phison E16
nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs


388c705b 02-Feb-2021 Lin Feng <linf@wangsu.com>

bfq-iosched: Revert "bfq: Fix computation of shallow depth"

This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.

bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%

That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:

sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.

Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8358c28a 02-Feb-2021 Ming Lei <ming.lei@redhat.com>

block: fix memory leak of bvec

bio_init() clears bio instance, so the bvec index has to be set after
bio_init(), otherwise bio->bi_io_vec may be leaked.

Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset")
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5851d3b0 21-Jan-2021 Eric Biggers <ebiggers@google.com>

block/keyslot-manager: introduce devm_blk_ksm_init()

Add a resource-managed variant of blk_ksm_init() so that drivers don't
have to worry about calling blk_ksm_destroy().

Note that the implementation uses a custom devres action to call
blk_ksm_destroy() rather than switching the two allocations to be
directly devres-managed, e.g. with devm_kmalloc(). This is because we
need to keep zeroing the memory containing the keyslots when it is
freed, and also because we want to continue using kvmalloc() (and there
is no devm_kvmalloc()).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20210121082155.111333-2-ebiggers@kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

2ba1c4d1 29-Jan-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"All over the place fixes for this release:

- blk-cgroup iteration teardown resched fix (Baolin)

- NVMe pull request from Christoph:
- add another Write Zeroes quirk (Chaitanya Kulkarni)
- handle a no path available corner case (Daniel Wagner)
- use the proper RCU aware list_add helper (Chao Leng)

- bcache regression fix (Coly)

- bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12,
but for now, we'll make it IRQ safe (Damien)

- null_blk zoned init fix (Damien)

- add_partition() error handling fix (Dinghao)

- s390 dasd kobject fix (Jan)

- nbd fix for freezing queue while adding connections (Josef)

- tag queueing regression fix (Ming)

- revert of a patch that inadvertently meant that we regressed write
performance on raid (Maxim)"

* tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block:
null_blk: cleanup zoned mode initialization
nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head
nvme-multipath: Early exit if no path is available
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device
bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
block: fix bd_size_lock use
blk-cgroup: Use cond_resched() when destroy blkgs
Revert "block: simplify set_init_blocksize" to regain lost performance
nbd: freeze the queue while we're adding connections
s390/dasd: Fix inconsistent kobject removal
block: Fix an error handling in add_partition
blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue


f7bf5e24 28-Jan-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

block: drop removed argument from kernel-doc of blk_execute_rq()

Commit 684da7628d93 ("block: remove unnecessary argument from
blk_execute_rq") changes the signature of blk_execute_rq(), but misses
to adjust its kernel-doc.

Hence, make htmldocs warns on ./block/blk-exec.c:78:

warning: Excess function parameter 'q' description in 'blk_execute_rq'

Drop removed argument from kernel-doc of blk_execute_rq() as well.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Guoqing Jiang <Guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7f31bee3 28-Jan-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

block: remove typo in kernel-doc of set_disk_ro()

Commit 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk")
provides some kernel-doc for set_disk_ro(), but introduces a small typo.

Hence, make htmldocs warns on ./block/genhd.c:1441:

warning: Function parameter or member 'read_only' not described in 'set_disk_ro'
warning: Excess function parameter 'ready_only' description in 'set_disk_ro'

Remove that typo in the kernel-doc for set_disk_ro().

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6b4eeba3 28-Jan-2021 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-cgroup: Remove obsolete macro

Remove the obsolete 'MAX_KEY_LEN' macro.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0fe37724 27-Jan-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: fix bd_size_lock use

Some block device drivers, e.g. the skd driver, call set_capacity() with
IRQ disabled. This results in lockdep ito complain about inconsistent
lock states ("inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage")
because set_capacity takes a block device bd_size_lock using the
functions spin_lock() and spin_unlock(). Ensure a consistent locking
state by replacing these calls with spin_lock_irqsave() and
spin_lock_irqrestore(). The same applies to bdev_set_nr_sectors().
With this fix, all lockdep complaints are resolved.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6c635cae 27-Jan-2021 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-cgroup: Use cond_resched() when destroy blkgs

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c6bf3f0e 26-Jan-2021 Christoph Hellwig <hch@lst.de>

block: use an on-stack bio in blkdev_issue_flush

There is no point in allocating memory for a synchronous flush.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3175199a 26-Jan-2021 Christoph Hellwig <hch@lst.de>

block: split bio_kmalloc from bio_alloc_bioset

bio_kmalloc shares almost no logic with the bio_set based fast path
in bio_alloc_bioset. Split it into an entirely separate implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4eb1d689 26-Jan-2021 Christoph Hellwig <hch@lst.de>

blk-crypto: use bio_kmalloc in blk_crypto_clone_bio

Use bio_kmalloc instead of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7684fbde 05-Jun-2020 Jan Kara <jack@suse.cz>

bfq: Use only idle IO periods for think time calculations

Currently whenever bfq queue has a request queued we add now -
last_completion_time to the think time statistics. This is however
misleading in case the process is able to submit several requests in
parallel because e.g. if the queue has request completed at time T0 and
then queues new requests at times T1, T2, then we will add T1-T0 and
T2-T0 to think time statistics which just doesn't make any sence (the
queue's think time is penalized by the queue being able to submit more
IO). So add to think time statistics only time intervals when the queue
had no IO pending.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
[axboe: fix whitespace on empty line]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

28c6def0 05-Jun-2020 Jan Kara <jack@suse.cz>

bfq: Use 'ttime' local variable

Use local variable 'ttime' instead of dereferencing bfqq.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

41e76c85 05-Jun-2020 Jan Kara <jack@suse.cz>

bfq: Avoid false bfq queue merging

bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:

[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read

[reader]
numjobs=4
directory=/mnt

where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).

Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.

Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

49d1822b 24-Jan-2021 Chunguang Xu <brookxu@tencent.com>

blkcg: delete redundant get/put operations for queue

When calling blkcg_schedule_throttle(), for the same queue,
redundant get/put operations can be removed.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

482e302a 25-Jan-2021 Lei Chen <lennychen@tencent.com>

blk: wbt: remove unused parameter from wbt_should_throttle

The first parameter rwb is not used for this function.
So just remove it.

Signed-off-by: Lei Chen <lennychen@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

46bbf653 26-Jan-2021 Christoph Hellwig <hch@lst.de>

block: inherit BIO_REMAPPED when cloning bios

Cloned bios are can be used to on the same device, in which case we need
to inherit the BIO_REMAPPED flag to avoid a double partition remap. When
the cloned bios are used on another device, bio_set_dev will clear the flag.

Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a5bf0a92 25-Jan-2021 Jens Axboe <axboe@kernel.dk>

bfq: bfq_check_waker() should be static

It's only used in the same file, mark is appropriately static.

Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

71217df3 25-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: make waker-queue detection more robust

In the presence of many parallel I/O flows, the detection of waker
bfq_queues suffers from false positives. This commits addresses this
issue by making the filtering of actual wakers more selective. In more
detail, a candidate waker must be found to meet waker requirements
three times before being promoted to actual waker.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a5436b9 25-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: save also injection state on queue merging

To prevent injection information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e673914d 25-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: save also weight-raised service on queue merging

To prevent weight-raising information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d1f600fa 25-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: fix switch back from soft-rt weitgh-raising

A bfq_queue may happen to be deemed as soft real-time while it is
still enjoying interactive weight-raising. If this happens because of
a false positive, then the bfq_queue is likely to loose its soft
real-time status soon. Upon losing such a status, the bfq_queue must
get back its interactive weight-raising, if its interactive period is
not over yet. But this case is not handled. This commit corrects this
error.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7f1995c2 25-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: re-evaluate convenience of I/O plugging on rq arrivals

Upon an I/O-dispatch attempt, BFQ may detect that it was better to
plug I/O dispatch, and to wait for a new request to arrive for the
currently in-service queue. But the arrival of a new request for an
empty bfq_queue, and thus the switch from idle to busy of the
bfq_queue, may cause the scenario to change, and make plugging no
longer needed for service guarantees, or more convenient for
throughput. In this case, keeping I/O-dispatch plugged would certainly
lower throughput.

To address this issue, this commit makes such a check, and stops
plugging I/O if it is better to stop plugging I/O.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eb2fd80f 25-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: replace mechanism for evaluating I/O intensity

Some BFQ mechanisms make their decisions on a bfq_queue basing also on
whether the bfq_queue is I/O bound. In this respect, the current logic
for evaluating whether a bfq_queue is I/O bound is rather rough. This
commits replaces this logic with a more effective one.

The new logic measures the percentage of time during which a bfq_queue
is active, and marks the bfq_queue as I/O bound if the latter if this
percentage is above a fixed threshold.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3a905c37 25-Jan-2021 Christoph Hellwig <hch@lst.de>

block: skip bio_check_eod for partition-remapped bios

When an already remapped bio is resubmitted (e.g. by blk_queue_split),
bio_check_eod will compare the remapped bi_sector against the size
of the partition, leading to spurious I/O failures.

Skip the EOD check in this case.

Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c42bca92 09-Jan-2021 Pavel Begunkov <asml.silence@gmail.com>

bio: don't copy bvec for direct IO

The block layer spends quite a while in blkdev_direct_IO() to copy and
initialise bio's bvec. However, if we've already got a bvec in the input
iterator it might be reused in some cases, i.e. when new
ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
performance boost, and it also reduces memory footprint.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0cf41e5e 09-Jan-2021 Pavel Begunkov <asml.silence@gmail.com>

block/psi: remove PSI annotations from direct IO

Direct IO does not operate on the current working set of pages managed
by the kernel, so it should not be accounted as memory stall to PSI
infrastructure.

The block layer and iomap direct IO use bio_iov_iter_get_pages()
to build bios, and they are the only users of it, so to avoid PSI
tracking for them clear out BIO_WORKINGSET flag. Do same for
dio_bio_submit() because fs/direct_io constructs bios by hand directly
calling bio_add_page().

Reported-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

684da762 24-Jan-2021 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

block: remove unnecessary argument from blk_execute_rq

We can remove 'q' from blk_execute_rq as well after the previous change
in blk_execute_rq_nowait.

And more importantly it never really was needed to start with given
that we can trivial derive it from struct request.

Cc: linux-scsi@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-nfs@vger.kernel.org
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8eeed0b5 24-Jan-2021 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

block: remove unnecessary argument from blk_execute_rq_nowait

The 'q' is not used since commit a1ce35fa4985 ("block: remove dead
elevator code"), also update the comment of the function.

And more importantly it never really was needed to start with given
that we can trivial derive it from struct request.

Cc: target-devel@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0f7b4bc6 19-Jan-2021 Pan Bian <bianpan2016@163.com>

bsg: free the request before return error code

Free the request rq before returning error code.

Fixes: 972248e9111e ("scsi: bsg-lib: handle bidi requests without block layer help")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ef49d40b 17-Jan-2021 Dinghao Liu <dinghao.liu@zju.edu.cn>

block: Fix an error handling in add_partition

Once we have called device_initialize(), we should use put_device() to
give up the reference on error, just like what we have done on failure
of device_add().

Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2569063c 27-Dec-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue

In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against
q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE.

So fix it.

Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: f1b49fdc1c64 ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eec716a1 10-Jan-2021 Ming Lei <ming.lei@redhat.com>

block: move three bvec helpers declaration into private helper

bvec_alloc(), bvec_free() and bvec_nr_vecs() are only used inside block
layer core functions, no need to declare them in public header.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

baa2c7c9 10-Jan-2021 Ming Lei <ming.lei@redhat.com>

block: set .bi_max_vecs as actual allocated vector number

bvec_alloc() may allocate more bio vectors than requested, so set
.bi_max_vecs as actual allocated vector number, instead of the requested
number. This way can help fs build bigger bio because new bio often won't
be allocated until the current one becomes full.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9f180e31 10-Jan-2021 Ming Lei <ming.lei@redhat.com>

block: don't allocate inline bvecs if this bioset needn't bvecs

The inline bvecs won't be used if user needn't bvecs by not passing
BIOSET_NEED_BVECS, so don't allocate bvecs in this situation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c495a176 10-Jan-2021 Ming Lei <ming.lei@redhat.com>

block: don't pass BIOSET_NEED_BVECS for q->bio_split

q->bio_split is only used by bio_split() for fast cloning bio, and no
need to allocate bvecs, so remove this flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

49d1ec85 10-Jan-2021 Ming Lei <ming.lei@redhat.com>

block: manage bio slab cache by xarray

Managing bio slab cache via xarray by using slab cache size as xarray
index, and storing 'struct bio_slab' instance into xarray.

So code is simplified a lot, meantime it becomes more readable than before.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1a23e06c 25-Dec-2020 huhai <huhai@tj.kylinos.cn>

bfq: don't duplicate code for different paths

As we can see, returns parent_sched_may_change whether
sd->next_in_service changes or not, so remove this judgment.

Signed-off-by: huhai <huhai@tj.kylinos.cn>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b6e68ee8 11-Jan-2021 Jan Kara <jack@suse.cz>

blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

clients MQ SQ MQ-Patched
Amean 1 39.12 (0.00%) 43.29 * -10.67%* 36.09 * 7.74%*
Amean 2 128.58 (0.00%) 101.30 * 21.22%* 96.14 * 25.23%*
Amean 4 577.42 (0.00%) 494.47 * 14.37%* 508.49 * 11.94%*
Amean 8 610.95 (0.00%) 363.86 * 40.44%* 362.12 * 40.73%*
Amean 16 391.78 (0.00%) 261.49 * 33.25%* 282.94 * 27.78%*
Amean 32 324.64 (0.00%) 267.71 * 17.54%* 233.00 * 28.23%*
Amean 64 295.04 (0.00%) 253.02 * 14.24%* 242.37 * 17.85%*
Amean 512 10281.61 (0.00%) 10211.16 * 0.69%* 10447.53 * -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5ac83c64 11-Jan-2021 Jan Kara <jack@suse.cz>

Revert "blk-mq, elevator: Count requests per hctx to improve performance"

This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2391d13e 22-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: do not expire a queue when it is the only busy one

This commits preserves I/O-dispatch plugging for a special symmetric
case that may suddenly turn into asymmetric: the case where only one
bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not
cause any harm to any other queues in terms of service guarantees. In
contrast, it avoids the following unlucky sequence of events: (1) bfqq
is expired, (2) a new queue with a lower weight than bfqq becomes busy
(or more queues), (3) the new queue is served until a new request
arrives for bfqq, (4) when bfqq is finally served, there are so many
requests of the new queue in the drive that the pending requests for
bfqq take a lot of time to be served. In particular, event (2) may
case even already dispatched requests of bfqq to be delayed, inside
the drive. So, to avoid this series of events, the scenario is
preventively declared as asymmetric also if bfqq is the only busy
queues. By doing so, I/O-dispatch plugging is performed for bfqq.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3c337690 22-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: avoid spurious switches to soft_rt of interactive queues

BFQ tags some bfq_queues as interactive or soft_rt if it deems that
these bfq_queues contain the I/O of, respectively, interactive or soft
real-time applications. BFQ privileges both these special types of
bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ
mainly raises the weight of the bfq_queue. In particular, soft_rt
bfq_queues get a higher weight than interactive bfq_queues.

A bfq_queue may turn from interactive to soft_rt. And this leads to a
tricky issue. Soft real-time applications usually start with an
I/O-bound, interactive phase, in which they load themselves into main
memory. BFQ correctly detects this phase, and keeps the bfq_queues
associated with the application in interactive mode for a
while. Problems arise when the I/O pattern of the application finally
switches to soft real-time. One of the conditions for a bfq_queue to
be deemed as soft_rt is that the bfq_queue does not consume too much
bandwidth. But the bfq_queues associated with a soft real-time
application consume as much bandwidth as they can in the loading phase
of the application. So, after the application becomes truly soft
real-time, a lot of time should pass before the average bandwidth
consumed by its bfq_queues finally drops to a value acceptable for
soft_rt bfq_queues. As a consequence, there might be a time gap during
which the application is not privileged at all, because its bfq_queues
are not interactive any longer, but cannot be deemed as soft_rt yet.

To avoid this problem, BFQ pretends that an interactive bfq_queue
consumes zero bandwidth, and allows an interactive bfq_queue to switch
to soft_rt. Yet, this fake zero-bandwidth consumption easily causes
the bfq_queue to often switch to soft_rt deceptively, during its
loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth
correctly computed, and therefore soon switches back to
interactive. Then it switches again to soft_rt, and so on. These
spurious fluctuations usually cause losses of throughput, because they
deceive BFQ's mechanisms for boosting throughput (injection,
I/O-plugging avoidance, ...).

This commit addresses this issue as follows:
1) It does compute actual bandwidth consumption also for interactive
bfq_queues. This avoids the above false positives.
2) When a bfq_queue switches from interactive to normal mode, the
consumed bandwidth is reset (forgotten). This allows the
bfq_queue to enjoy soft_rt very quickly. In particular, two
alternatives are possible in this switch:
- the bfq_queue still has backlog, and therefore there is a budget
already scheduled to serve the bfq_queue; in this case, the
scheduling of the current budget of the bfq_queue is not
hindered, because only the scheduling of the next budget will
be affected by the weight drop. After that, if the bfq_queue is
actually in a soft_rt phase, and becomes empty during the
service of its current budget, which is the natural behavior of
a soft_rt bfq_queue, then the bfq_queue will be considered as
soft_rt when its next I/O arrives. If, in contrast, the
bfq_queue remains constantly non-empty, then its next budget
will be scheduled with a low weight, which is the natural
treatment for an I/O-bound (non soft_rt) bfq_queue.
- the bfq_queue is empty; in this case, the bfq_queue may be
considered unjustly soft_rt when its new I/O arrives. Yet
the problem is now much smaller than before, because it is
unlikely that more than one spurious fluctuation occurs.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91b896f6 22-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: do not raise non-default weights

BFQ heuristics try to detect interactive I/O, and raise the weight of
the queues containing such an I/O. Yet, if also the user changes the
weight of a queue (i.e., the user changes the ioprio of the process
associated with that queue), then it is most likely better to prevent
BFQ heuristics from silently changing the same weight.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ab1fb47e 22-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: increase time window for waker detection

Tests on slower machines showed current window to be way too
small. This commit increases it.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d4fc3640 22-Jan-2021 Jia Cheng Hu <jia.jiachenghu@gmail.com>

block, bfq: set next_rq to waker_bfqq->next_rq in waker injection

Since commit c5089591c3ba ("block, bfq: detect wakers and
unconditionally inject their I/O"), when the in-service bfq_queue, say
Q, is temporarily empty, BFQ checks whether there are I/O requests to
inject (also) from the waker bfq_queue for Q. To this goal, the value
pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the
current implementation mistakenly looks at bfqq->next_rq, which
instead points to the next request of the currently served queue.

This mistake evidently causes losses of throughput in scenarios with
waker bfq_queues.

This commit corrects this mistake.

Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O")
Signed-off-by: Jia Cheng Hu <jia.jiachenghu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5f74eca 22-Jan-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: use half slice_idle as a threshold to check short ttime

The value of the I/O plugging (idling) timeout is used also as the
think-time threshold to decide whether a process has a short think
time. In this respect, a good value of this timeout for rotational
drives is un the order of several ms. Yet, this is often too long a
time interval to be effective as a think-time threshold. This commit
mitigates this problem (by a lot, according to tests), by halving the
threshold.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a33df75c 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: use an xarray for disk->part_tbl

Now that no fast path lookups in the partition table are left, there is
no point in micro-optimizing the data structure for it. Just use a bog
standard xarray.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0470dd9d 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: remove DISK_PITER_REVERSE

There is good reason to iterate backwards when deleting all partitions in
del_gendisk, just like we don't in blk_drop_partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bc359d03 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: add a disk_uevent helper

Add a helper to call kobject_uevent for the disk and all partitions, and
unexport the disk_part_iter_* helpers that are now only used in the core
block code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0b6e522c 24-Jan-2021 Christoph Hellwig <hch@lst.de>

blk-mq: use ->bi_bdev for I/O accounting

Remove the reverse map from a sector to a partition for I/O accounting by
simply using ->bi_bdev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

99dfc43e 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: use ->bi_bdev for bio based I/O accounting

Rework the I/O accounting for bio based drivers to use ->bi_bdev. This
means all drivers can now simply use bio_start_io_acct to start
accounting, and it will take partitions into account automatically. To
end I/O account either bio_end_io_acct can be used if the driver never
remaps I/O to a different device, or bio_end_io_acct_remapped if the
driver did remap the I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

30c5d345 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: do not reassig ->bi_bdev when partition remapping

There is no good reason to reassign ->bi_bdev when remapping the
partition-relative block number to the device wide one, as all the
information required by the drivers comes from the gendisk anyway.

Keeping the original ->bi_bdev alive will allow to greatly simplify
the partition-away I/O accounting.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2f9f6221 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: simplify submit_bio_checks a bit

Merge a few checks for whole devices vs partitions to streamline the
sanity checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

309dca30 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: store a block_device pointer in struct bio

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

947139bf 09-Jan-2021 Christoph Hellwig <hch@lst.de>

block: propagate BLKROSET on the whole device to all partitions

Change the policy so that a BLKROSET on the whole device also affects
partitions. To quote Martin K. Petersen:

It's very common for database folks to twiddle the read-only state of
block devices and partitions. I know that our users will find it very
counter-intuitive that setting /dev/sda read-only won't prevent writes
to /dev/sda1.

The existing behavior is inconsistent in the sense that doing:

# blockdev --setro /dev/sda
# echo foo > /dev/sda1

permits writes. But:

# blockdev --setro /dev/sda
<something triggers revalidate>
# echo foo > /dev/sda1

doesn't.

And a subsequent:

# blockdev --setrw /dev/sda
# echo foo > /dev/sda1

doesn't work either since sda1's read-only policy has been inherited
from the whole-disk device.

You need to do:

# blockdev --rereadpt

after setting the whole-disk device rw to effectuate the same change on
the partitions, otherwise they are stuck being read-only indefinitely.

However, setting the read-only policy on a partition does *not* require
the revalidate step. As a matter of fact, doing the revalidate will blow
away the policy setting you just made.

So the user needs to take different actions depending on whether they
are trying to read-protect a whole-disk device or a partition. Despite
using the same ioctl. That is really confusing.

I have lost count how many times our customers have had data clobbered
because of ambiguity of the existing whole-disk device policy. The
current behavior violates the principle of least surprise by letting the
user think they write protected the whole disk when they actually
didn't.

Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

52f019d4 09-Jan-2021 Christoph Hellwig <hch@lst.de>

block: add a hard-readonly flag to struct gendisk

Commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading
partition") addressed a long-standing problem with user read-only
policy being overridden as a result of a device-initiated revalidate.
The commit has since been reverted due to a regression that left some
USB devices read-only indefinitely.

To fix the underlying problems with revalidate we need to keep track
of hardware state and user policy separately.

The gendisk has been updated to reflect the current hardware state set
by the device driver. This is done to allow returning the device to
the hardware state once the user clears the BLKROSET flag.

The resulting semantics are as follows:

- If BLKROSET sets a given partition read-only, that partition will
remain read-only even if the underlying storage stack initiates a
revalidate. However, the BLKRRPART ioctl will cause the partition
table to be dropped and any user policy on partitions will be lost.

- If BLKROSET has not been set, both the whole disk device and any
partitions will reflect the current write-protect state of the
underlying device.

Based on a patch from Martin K. Petersen <martin.petersen@oracle.com>.

Reported-by: Oleksii Kurochko <olkuroch@cisco.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201221
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6f0d9689 09-Jan-2021 Christoph Hellwig <hch@lst.de>

block: remove the NULL bdev check in bdev_read_only

Only a single caller can end up in bdev_read_only, so move the check
there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ed41fd07 10-Jan-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Missing CRC32 selections (Arnd)

- Fix for a merge window regression with bdev inode init (Christoph)

- bcache fixes

- rnbd fixes

- NVMe pull request from Christoph:
- fix a race in the nvme-tcp send code (Sagi Grimberg)
- fix a list corruption in an nvme-rdma error path (Israel Rukshin)
- avoid a possible double fetch in nvme-pci (Lalithambika Krishnakumar)
- add the susystem NQN quirk for a Samsung driver (Gopal Tiwari)
- fix two compiler warnings in nvme-fcloop (James Smart)
- don't call sleeping functions from irq context in nvme-fc (James Smart)
- remove an unused argument (Max Gurtovoy)
- remove unused exports (Minwoo Im)

- Use-after-free fix for partition iteration (Ming)

- Missing blk-mq debugfs flag annotation (John)

- Bdev freeze regression fix (Satya)

- blk-iocost NULL pointer deref fix (Tejun)

* tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block: (26 commits)
bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket
bcache: check unsupported feature sets for bcache register
bcache: fix typo from SUUP to SUPP in features.h
bcache: set pdev_set_uuid before scond loop iteration
blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
block/rnbd-clt: avoid module unload race with close confirmation
block/rnbd: Adding name to the Contributors List
block/rnbd-clt: Fix sg table use after free
block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close
block/rnbd: Select SG_POOL for RNBD_CLIENT
block: pre-initialize struct block_device in bdev_alloc_inode
fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb
nvme: remove the unused status argument from nvme_trace_bio_complete
nvmet-rdma: Fix list_del corruption on queue establishment failure
nvme: unexport functions with no external caller
nvme: avoid possible double fetch in handling CQE
nvme-tcp: Fix possible race of io_work and direct send
nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings
...


02f938e9 08-Jan-2021 John Garry <john.garry@huawei.com>

blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED

Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:

root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3

Add the decoding for that flag.

Fixes: 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

aebf5db9 20-Dec-2020 Ming Lei <ming.lei@redhat.com>

block: fix use-after-free in disk_part_iter_next

Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6d4d2735 10-Dec-2020 Jan Kara <jack@suse.cz>

bfq: Fix computation of shallow depth

BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d16baa3f 04-Jan-2021 Tejun Heo <tj@kernel.org>

blk-iocost: fix NULL iocg deref from racing against initialization

When initializing iocost for a queue, its rqos should be registered before
the blkcg policy is activated to allow policy data initiailization to lookup
the associated ioc. This unfortunately means that the rqos methods can be
called on bios before iocgs are attached to all existing blkgs.

While the race is theoretically possible on ioc_rqos_throttle(), it mostly
happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
The former determines it from the passed in @rqos and then bails before
dereferencing iocg if the looked up ioc is disabled, which most likely is
the case if initialization is still in progress. The latter looked up ioc by
dereferencing the possibly NULL iocg making it a lot more prone to actually
triggering the bug.

* Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
up ioc for consistency.

* Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
dereferencing it.

* Explain the danger of NULL iocgs in blk_iocost_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jonathan Lemon <bsd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eda809ae 01-Jan-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).

The big core two fixes are for power management ("block: Do not accept
any requests while suspended" and "block: Fix a race in the runtime
power management code") which finally sorts out the resume problems
we've occasionally been having.

To make the resume fix, there are seven necessary precursors which
effectively renames REQ_PREEMPT to REQ_PM, so every "special" request
in block is automatically a power management exempt one.

All of the non-PM preempt cases are removed except for the one in the
SCSI Parallel Interface (spi) domain validation which is a genuine
case where we have to run requests at high priority to validate the
bus so this becomes an autopm get/put protected request"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits)
scsi: cxgb4i: Fix TLS dependency
scsi: ufs: Un-inline ufshcd_vops_device_reset function
scsi: ufs: Re-enable WriteBooster after device reset
scsi: ufs-mediatek: Use correct path to fix compile error
scsi: mpt3sas: Signedness bug in _base_get_diag_triggers()
scsi: block: Do not accept any requests while suspended
scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
scsi: block: Introduce BLK_MQ_REQ_PM
scsi: block: Fix a race in the runtime power management code
scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
scsi: ufs-pci: Fix restore from S4 for Intel controllers
scsi: ufs-mediatek: Keep VCC always-on for specific devices
scsi: ufs: Allow regulators being always-on
scsi: ufs: Clear UAC for RPMB after ufshcd resets
...


dc304326 28-Dec-2020 Andres Freund <andres@anarazel.de>

block: add debugfs stanza for QUEUE_FLAG_NOWAIT

This was missed in 021a24460dc2. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.

Fixes: 021a24460dc28e7412aecfae89f60e1847e685c0
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b51e703 10-Dec-2020 Christoph Hellwig <hch@lst.de>

block: update some copyrights

Update copyrights for files that have gotten some major rewrites lately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

71425189 04-Dec-2020 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

blk-mq: Don't complete on a remote CPU in force threaded mode

With force threaded interrupts enabled, raising softirq from an SMP
function call will always result in waking the ksoftirqd thread. This is
not optimal given that the thread runs at SCHED_OTHER priority.

Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
the force threaded mode) is bad because the completion handler may
acquire sleeping locks which violate the locking context.

Disable request completing on a remote CPU in force threaded mode.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

76efc1c7 10-Dec-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-iocost: Add iocg idle state tracepoint

It will be helpful to trace the iocg's whole state, including active and
idle state. And we can easily expand the original iocost_iocg_activate
trace event to support a state trace class, including active and idle
state tracing.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e6582cb5 30-Nov-2020 Daniel Wagner <dwagner@suse.de>

blk-mq: Remove 'running from the wrong CPU' warning

It's guaranteed that no request is in flight when a hctx is going
offline. This warning is only triggered when the wq's CPU is hot
plugged and the blk-mq is not synced up yet.

As this state is temporary and the request is still processed
correctly, better remove the warning as this is the fast path.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

60f7c503 16-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This consists of the usual driver updates (ufs, qla2xxx, smartpqi,
target, zfcp, fnic, mpt3sas, ibmvfc) plus a load of cleanups, a major
power management rework and a load of assorted minor updates.

There are a few core updates (formatting fixes being the big one) but
nothing major this cycle"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits)
scsi: mpt3sas: Update driver version to 36.100.00.00
scsi: mpt3sas: Handle trigger page after firmware update
scsi: mpt3sas: Add persistent MPI trigger page
scsi: mpt3sas: Add persistent SCSI sense trigger page
scsi: mpt3sas: Add persistent Event trigger page
scsi: mpt3sas: Add persistent Master trigger page
scsi: mpt3sas: Add persistent trigger pages support
scsi: mpt3sas: Sync time periodically between driver and firmware
scsi: qla2xxx: Update version to 10.02.00.104-k
scsi: qla2xxx: Fix device loss on 4G and older HBAs
scsi: qla2xxx: If fcport is undergoing deletion complete I/O with retry
scsi: qla2xxx: Fix the call trace for flush workqueue
scsi: qla2xxx: Fix flash update in 28XX adapters on big endian machines
scsi: qla2xxx: Handle aborts correctly for port undergoing deletion
scsi: qla2xxx: Fix N2N and NVMe connect retry failure
scsi: qla2xxx: Fix FW initialization error on big endian machines
scsi: qla2xxx: Fix crash during driver load on big endian machines
scsi: qla2xxx: Fix compilation issue in PPC systems
scsi: qla2xxx: Don't check for fw_started while posting NVMe command
scsi: qla2xxx: Tear down session if FW say it is down
...


69f637c3 16-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
"Nothing major in here:

- NVMe pull request from Christoph:
- nvmet passthrough improvements (Chaitanya Kulkarni)
- fcloop error injection support (James Smart)
- read-only support for zoned namespaces without Zone Append
(Javier González)
- improve some error message (Minwoo Im)
- reject I/O to offline fabrics namespaces (Victor Gladkov)
- PCI queue allocation cleanups (Niklas Schnelle)
- remove an unused allocation in nvmet (Amit Engel)
- a Kconfig spelling fix (Colin Ian King)
- nvme_req_qid simplication (Baolin Wang)

- MD pull request from Song:
- Fix race condition in md_ioctl() (Dae R. Jeong)
- Initialize read_slot properly for raid10 (Kevin Vigor)
- Code cleanup (Pankaj Gupta)
- md-cluster resync/reshape fix (Zhao Heming)

- Move null_blk into its own directory (Damien Le Moal)

- null_blk zone and discard improvements (Damien Le Moal)

- bcache race fix (Dongsheng Yang)

- Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
Lutz Pogrell, Md Haris Iqbal)

- lightnvm NULL pointer deref fix (tangzhenhao)

- sr in_interrupt() removal (Sebastian Andrzej Siewior)

- FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
as it made it easier for them to funnel the feature through the
block driver tree.

- Follow up fixes (Colin Ian King)"

* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
block: drop dead assignments in loop_init()
sr: Remove in_interrupt() usage in sr_init_command().
sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
cdrom: Reset sector_size back it is not 2048.
drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
null_blk: Move driver into its own directory
null_blk: Allow controlling max_hw_sectors limit
null_blk: discard zones on reset
null_blk: cleanup discard handling
null_blk: Improve implicit zone close
null_blk: improve zone locking
block: Align max_hw_sectors to logical blocksize
null_blk: Fail zone append to conventional zones
null_blk: Fix zone size initialization
bcache: fix race between setting bdev state to none and new write request direct to backing
block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
block/rnbd: call kobject_put in the failure path
Documentation/ABI/rnbd-srv: add document for force_close
block/rnbd-srv: close a mapped device from server side.
...


ac7ac461 16-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
"Another series of killing more code than what is being added, again
thanks to Christoph's relentless cleanups and tech debt tackling.

This contains:

- blk-iocost improvements (Baolin Wang)

- part0 iostat fix (Jeffle Xu)

- Disable iopoll for split bios (Jeffle Xu)

- block tracepoint cleanups (Christoph Hellwig)

- Merging of struct block_device and hd_struct (Christoph Hellwig)

- Rework/cleanup of how block device sizes are updated (Christoph
Hellwig)

- Simplification of gendisk lookup and removal of block device
aliasing (Christoph Hellwig)

- Block device ioctl cleanups (Christoph Hellwig)

- Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

- Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

- sbitmap improvements (Pavel Begunkov)

- Hybrid polling fix (Pavel Begunkov)

- bvec iteration improvements (Pavel Begunkov)

- Zone revalidation fixes (Damien Le Moal)

- blk-throttle limit fix (Yu Kuai)

- Various little fixes"

* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
blk-mq: fix msec comment from micro to milli seconds
blk-mq: update arg in comment of blk_mq_map_queue
blk-mq: add helper allocating tagset->tags
Revert "block: Fix a lockdep complaint triggered by request queue flushing"
nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
block: disable iopoll for split bio
block: Improve blk_revalidate_disk_zones() checks
sbitmap: simplify wrap check
sbitmap: replace CAS with atomic and
sbitmap: remove swap_lock
sbitmap: optimise sbitmap_deferred_clear()
blk-mq: skip hybrid polling if iopoll doesn't spin
blk-iocost: Factor out the base vrate change into a separate function
blk-iocost: Factor out the active iocgs' state check into a separate function
blk-iocost: Move the usage ratio calculation to the correct place
blk-iocost: Remove unnecessary advance declaration
blk-iocost: Fix some typos in comments
blktrace: fix up a kerneldoc comment
block: remove the request_queue to argument request based tracepoints
...


adb35e8d 14-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().

- A fair amount of topology and NUMA related improvements

- Improvements for the frequency invariant calculations

- Enhanced robustness for the global CPU priority tracking and decision
making

- The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...


fa94ba8a 04-Dec-2020 Minwoo Im <minwoo.im.dev@gmail.com>

blk-mq: fix msec comment from micro to milli seconds

Delay to wait for queue running is milli second unit which is passed to
delayed work via msecs_to_jiffies() which is to convert milliseconds to
jiffies.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d220a214 04-Dec-2020 Minwoo Im <minwoo.im.dev@gmail.com>

blk-mq: update arg in comment of blk_mq_map_queue

Update mis-named argument description of blk_mq_map_queue(). This patch
also updates description that argument to software queue percpu context.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91cdf265 04-Dec-2020 Minwoo Im <minwoo.im.dev@gmail.com>

blk-mq: add helper allocating tagset->tags

tagset->set is allocated from blk_mq_alloc_tag_set() rather than being
reallocated. This patch added a helper to make its meaning explicitly
which is to allocate rather than to reallocate.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

52abca64 08-Dec-2020 Alan Stern <stern@rowland.harvard.edu>

scsi: block: Do not accept any requests while suspended

blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
power management state. Now that SCSI domain validation no longer depends
on this behavior, modify the behavior of blk_queue_enter() as follows:

- Do not accept any requests while suspended.

- Only process power management requests while suspending or resuming.

Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
causes runtime-suspended devices not to resume as they should. The request
which should cause a runtime resume instead gets issued directly, without
resuming the device first. Of course the device can't handle it properly,
the I/O fails, and the device remains suspended.

The problem is fixed by checking that the queue's runtime-PM status isn't
RPM_SUSPENDED before allowing a request to be issued, and queuing a
runtime-resume request if it is. In particular, the inline
blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
code is unified by merging the surrounding checks into the routine. If the
queue isn't set up for runtime PM, or there currently is no restriction on
allowed requests, the request is allowed. Likewise if the BLK_MQ_REQ_PM
flag is set and the status isn't RPM_SUSPENDED. Otherwise a runtime resume
is queued and the request is blocked until conditions are more suitable.

[ bvanassche: modified commit message and removed Cc: stable because
without the previous patches from this series this patch would break
parallel SCSI domain validation + introduced queue_rpm_status() ]

Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-and-tested-by: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

a4d34da7 08-Dec-2020 Bart Van Assche <bvanassche@acm.org>

scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT

Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.

Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

0854bcdc 08-Dec-2020 Bart Van Assche <bvanassche@acm.org>

scsi: block: Introduce BLK_MQ_REQ_PM

Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.

Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Can Guo <cang@codeaurora.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

fa4d0f19 08-Dec-2020 Bart Van Assche <bvanassche@acm.org>

scsi: block: Fix a race in the runtime power management code

With the current implementation the following race can happen:

* blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
blk_mq_unfreeze_queue().

* blk_queue_enter() calls blk_queue_pm_only() and that function returns
true.

* blk_queue_enter() calls blk_pm_request_resume() and that function does
not call pm_request_resume() because the queue runtime status is
RPM_ACTIVE.

* blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.

Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Stanley Chu <stanley.chu@mediatek.com>
Co-developed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

7aa390ec 02-Dec-2020 Ming Lei <ming.lei@redhat.com>

Revert "block: Fix a lockdep complaint triggered by request queue flushing"

This reverts commit b3c6a59975415bde29cfd76ff1ab008edbf614a9.

Now we can avoid nvme-loop lockdep warning of 'lockdep possible recursive locking'
by nvme-loop's lock class, no need to apply dynamically allocated lock class key,
so revert commit b3c6a5997541("block: Fix a lockdep complaint triggered by request
queue flushing").

This way fixes horrible SCSI probe delay issue on megaraid_sas, and it is reported
the whole probe may take more than half an hour.

Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Qian Cai <cai@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fb01a293 02-Dec-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: add new API of blk_mq_hctx_set_fq_lock_class

flush_end_io() may be called recursively from some driver, such as
nvme-loop, so lockdep may complain 'possible recursive locking'.
Commit b3c6a5997541("block: Fix a lockdep complaint triggered by
request queue flushing") tried to address this issue by assigning
dynamically allocated per-flush-queue lock class. This solution
adds synchronize_rcu() for each hctx's release handler, and causes
horrible SCSI MQ probe delay(more than half an hour on megaraid sas).

Add new API of blk_mq_hctx_set_fq_lock_class() for these drivers, so
we just need to use driver specific lock class for avoiding the
lockdep warning of 'possible recursive locking'.

Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Qian Cai <cai@redhat.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cc29e1bf 26-Nov-2020 Jeffle Xu <jefflexu@linux.alibaba.com>

block: disable iopoll for split bio

iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.

Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.

iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.

iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.

To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

817046ec 19-Nov-2020 Damien Le Moal <damien.lemoal@wdc.com>

block: Align max_hw_sectors to logical blocksize

Block device drivers do not have to call blk_queue_max_hw_sectors() to
set a limit on request size if the default limit BLK_SAFE_MAX_SECTORS
is acceptable. However, this limit (255 sectors) may not be aligned
to the device logical block size which cannot be used as is for a
request maximum size. This is the case for the null_blk device driver.

Modify blk_queue_max_hw_sectors() to make sure that the request size
limits specified by the max_hw_sectors and max_sectors queue limits
are always aligned to the device logical block size. Additionally, to
avoid introducing a dependence on the execution order of this function
with blk_queue_logical_block_size(), also modify
blk_queue_logical_block_size() to perform the same alignment when the
logical block size is set after max_hw_sectors.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2afdeb23 11-Nov-2020 Damien Le Moal <damien.lemoal@wdc.com>

block: Improve blk_revalidate_disk_zones() checks

Improves the checks on the zones of a zoned block device done in
blk_revalidate_disk_zones() by making sure that the device report_zones
method did report at least one zone and that the zones reported exactly
cover the entire disk capacity, that is, that there are no missing zones
at the end of the disk sector range.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f6f371f7 06-Dec-2020 Pavel Begunkov <asml.silence@gmail.com>

blk-mq: skip hybrid polling if iopoll doesn't spin

If blk_poll() is not going to spin (i.e. @spin=false), it also must not
sleep in hybrid polling, otherwise it might be pretty suprising for
users trying to do a quick check and expecting no-wait behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

926f75f6 26-Nov-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-iocost: Factor out the base vrate change into a separate function

Factor out the base vrate change code into a separate function
to fimplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2474787a 26-Nov-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-iocost: Factor out the active iocgs' state check into a separate function

Factor out the iocgs' state check into a separate function to
simplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c09245f6 26-Nov-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-iocost: Move the usage ratio calculation to the correct place

We only use the hweight based usage ratio to calculate the new
hweight_inuse of the iocg to decide if this iocg can donate some
surplus vtime.

Thus move the usage ratio calculation to the correct place to
avoid unnecessary calculation for some vtime shortage iocgs.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

647c9f03 26-Nov-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-iocost: Remove unnecessary advance declaration

Remove unnecessary advance declaration of struct ioc_gq.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5ba1add2 26-Nov-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-iocost: Fix some typos in comments

Fix some typos in comments.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

be1515ba 05-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
"Single fix for an issue with chunk_sectors and stacked devices"

* tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
block: use gcd() to fix chunk_sectors limit stacking


b3298500 04-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

- Fix DM's bio splitting changes that were made during v5.9. This
restores splitting in terms of varied per-target ti->max_io_len
rather than use block core's single stacked 'chunk_sectors' limit.

- Like DM crypt, update DM integrity to not use crypto drivers that
have CRYPTO_ALG_ALLOCATES_MEMORY set.

- Fix DM writecache target's argument parsing and status display.

- Remove needless BUG() from dm writecache's persistent_memory_claim()

- Remove old gcc workaround in DM cache target's block_div() for ARM
link errors now that gcc >= 4.9 is required.

- Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range.

- Remove old, and now frowned upon, BUG_ON(in_interrupt()) in
dm_table_event().

- Remove invalid sparse annotations from dm_prepare_ioctl() and
dm_unprepare_ioctl().

* tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: remove invalid sparse __acquires and __releases annotations
dm: fix double RCU unlock in dm_dax_zero_page_range() error path
dm: fix IO splitting
dm writecache: remove BUG() and fail gracefully instead
dm table: Remove BUG_ON(in_interrupt())
dm: fix bug with RCU locking in dm_blk_report_zones
Revert "dm cache: fix arm link errors with inline"
dm writecache: fix the maximum number of arguments
dm writecache: advance the number of arguments when reporting max_age
dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY


3ee16db3 30-Nov-2020 Mike Snitzer <snitzer@redhat.com>

dm: fix IO splitting

Commit 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account
for target-specific splitting") caused a couple regressions:
1) Using lcm_not_zero() when stacking chunk_sectors was a bug because
chunk_sectors must reflect the most limited of all devices in the
IO stack.
2) DM targets that set max_io_len but that do _not_ provide an
.iterate_devices method no longer had there IO split properly.

And commit 5091cdec56fa ("dm: change max_io_len() to use
blk_max_size_offset()") also caused a regression where DM no longer
supported varied (per target) IO splitting. The implication being the
potential for severely reduced performance for IO stacks that use a DM
target like dm-cache to hide performance limitations of a slower
device (e.g. one that requires 4K IO splitting).

Coming full circle: Fix all these issues by discontinuing stacking
chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(),
add optional chunk_sectors override argument to blk_max_size_offset()
and update DM's max_io_len() to pass ti->max_io_len to its
blk_max_size_offset() call.

Passing in an optional chunk_sectors override to blk_max_size_offset()
allows for code reuse of block's centralized calculation for max IO
size based on provided offset and split boundary.

Fixes: 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting")
Fixes: 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()")
Cc: stable@vger.kernel.org
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>

a54895fa 03-Dec-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue to argument request based tracepoints

The request_queue can trivially be derived from the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1c02fca6 03-Dec-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue argument to the block_bio_remap tracepoint

The request_queue can trivially be derived from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eb6f7f7c 03-Dec-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue argument to the block_split tracepoint

The request_queue can trivially be derived from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e8a676d6 03-Dec-2020 Christoph Hellwig <hch@lst.de>

block: simplify and extend the block_bio_merge tracepoint class

The block_bio_merge tracepoint class can be reused for most bio-based
tracepoints. For that it just needs to lose the superfluous q and rq
parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

acaf523a 25-Nov-2020 Yu Kuai <yukuai3@huawei.com>

blk-throttle: don't check whether or not lower limit is valid if CONFIG_BLK_DEV_THROTTLING_LOW is off

blk_throtl_update_limit_valid() will search for descendants to see if
'LIMIT_LOW' of bps/iops and READ/WRITE is nonzero. However, they're always
zero if CONFIG_BLK_DEV_THROTTLING_LOW is not set, furthermore, a lot of
time will be wasted to iterate descendants.

Thus do nothing in blk_throtl_update_limit_valid() in such situation.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b0d97557 02-Dec-2020 Jeffle Xu <jefflexu@linux.alibaba.com>

block: fix inflight statistics of part0

The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.

Thus the following case is possible:

$ cat /sys/block/vda/inflight
       0        0
$ cat /sys/block/vda/vda1/inflight
       0      128

While single queue device (on a previous version, e.g. v3.10) has no
this issue:

$cat /sys/block/sda/sda3/inflight
0 33
$cat /sys/block/sda/inflight
0 33

Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba65dd ("blk-mq: fix
sysfs inflight counter").

Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)

$cat /proc/diskstats
259 0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
259 2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0

This is introduced since commit f299b7c7a9de ("blk-mq: provide internal
in-flight variant").

Fixes: bf0ddaba65dd ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

22b56c29 24-Nov-2020 Pavel Begunkov <asml.silence@gmail.com>

bio: optimise bvec iteration

__bio_for_each_bvec(), __bio_for_each_segment() and bio_copy_data_iter()
fall under conditions of bvec_iter_advance_single(), which is a faster
and slimmer version of bvec_iter_advance(). Add
bio_advance_iter_single() and convert them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

977115c0 26-Nov-2020 Christoph Hellwig <hch@lst.de>

block: stop using bdget_disk for partition 0

We can just dereference the point in struct gendisk instead. Also
remove the now unused export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0d02129e 27-Nov-2020 Christoph Hellwig <hch@lst.de>

block: merge struct block_device and struct hd_struct

Instead of having two structures that represent each block device with
different life time rules, merge them into a single one. This also
greatly simplifies the reference counting rules, as we can use the inode
reference count as the main reference count for the new struct
block_device, with the device model reference front ending it for device
model interaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ad1eaa53 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: switch disk_part_iter_* to use a struct block_device

Switch the partition iter infrastructure to iterate over block_device
references instead of hd_struct ones mostly used to get at the
block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

71773cf7 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: pass a block_device to invalidate_partition

Pass the block_device actually needed instead of looking it up using
bdget_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9fc995a6 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: pass a block_device to blk_alloc_devt

Pass the block_device actually needed instead of the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

41e5c819 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove the partno field from struct hd_struct

Just use the bd_partno field in struct block_device everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8446fe92 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: switch partition lookup to use struct block_device

Use struct block_device to lookup partitions on a disk. This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cb8432d6 26-Nov-2020 Christoph Hellwig <hch@lst.de>

block: allocate struct hd_struct as part of struct bdev_inode

Allocate hd_struct together with struct block_device to pre-load
the lifetime rule changes in preparation of merging the two structures.

Note that part0 was previously embedded into struct gendisk, but is
a separate allocation now, and already points to the block_device instead
of the hd_struct. The lifetime of struct gendisk is still controlled by
the struct device embedded in the part0 hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

83950d35 23-Nov-2020 Christoph Hellwig <hch@lst.de>

block: move the policy field to struct block_device

Move the policy field to struct block_device and rename it to the
more descriptive bd_read_only. Also turn the field into a bool as it
is used as such.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b309e993 23-Nov-2020 Christoph Hellwig <hch@lst.de>

block: move make_it_fail to struct block_device

Move the make_it_fail flag to struct block_device an turn it into a bool
in preparation of killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1bdd5ae0 23-Nov-2020 Christoph Hellwig <hch@lst.de>

block: move holder_dir to struct block_device

Move the holder_dir field to struct block_device in preparation for
kill struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

231926db 23-Nov-2020 Christoph Hellwig <hch@lst.de>

block: move the partition_meta_info to struct block_device

Move the partition_meta_info to struct block_device in preparation for
killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

29ff57c6 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: move the start_sect field to struct block_device

Move the start_sect field to struct block_device in preparation
of killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

15e3d2c5 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: move disk stat accounting to struct block_device

Move the dkstats and stamp field to struct block_device in preparation
of killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a782483c 26-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove the nr_sects field in struct hd_struct

Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.

Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes. By only using the block_device field
this problem also gets fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

22ae8ce8 26-Nov-2020 Christoph Hellwig <hch@lst.de>

block: simplify bdev/disk lookup in blkdev_get

To simplify block device lookup and a few other upcoming areas, make sure
that we always have a struct block_device available for each disk and
each partition, and only find existing block devices in bdget. The only
downside of this is that each device and partition uses a little more
memory. The upside will be that a lot of code can be simplified.

With that all we need to look up the block device is to lookup the inode
and do a few sanity checks on the gendisk, instead of the separate lookup
for the gendisk. For blk-cgroup which wants to access a gendisk without
opening it, a new blkdev_{get,put}_no_open low-level interface is added
to replace the previous get_gendisk use.

Note that the change to look up block device directly instead of the two
step lookup using struct gendisk causes a subtile change in behavior:
accessing a non-existing partition on an existing block device can now
cause a call to request_module. That call is harmless, and in practice
no recent system will access these nodes as they aren't created by udev
and static /dev/ setups are unusual.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4e7b5671 23-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove i_bdev

Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case). This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

efdc41c8 09-Nov-2020 Christoph Hellwig <hch@lst.de>

block: use put_device in put_disk

Use put_device to put the device instead of poking into the internals
and using kobject_put.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e79319af 09-Nov-2020 Christoph Hellwig <hch@lst.de>

block: use disk_part_iter_exit in disk_part_iter_next

Call disk_part_iter_exit in disk_part_iter_next instead of duplicating
the functionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3f50b95e 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove a superflous check in blkpg_do_ioctl

sector_t is now always a u64, so this check is not needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

53ffabfd 09-Nov-2020 Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

block: move blk_rq_bio_prep() to linux/blk-mq.h

This is a preparation patch to have minimal block layer request bio
append functionality in the context of the NVMeOF Passthru driver which
falls in the fast path and doesn't need calls from blk_rq_append_bio().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

7e7986f9 01-Dec-2020 Mike Snitzer <snitzer@redhat.com>

block: use gcd() to fix chunk_sectors limit stacking

commit 22ada802ede8 ("block: use lcm_not_zero() when stacking
chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must
reflect the most limited of all devices in the IO stack.

Otherwise malformed IO may result. E.g.: prior to this fix,
->chunk_sectors = lcm_not_zero(8, 128) would result in
blk_max_size_offset() splitting IO at 128 sectors rather than the
required more restrictive 8 sectors.

And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be
non-power-of-2") care must be taken to properly stack chunk_sectors to
be compatible with the possibility that a non-power-of-2 chunk_sectors
may be stacked. This is why gcd() is used instead of reverting back
to using min_not_zero().

Fixes: 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors")
Fixes: 07d098e6bbad ("block: allow 'chunk_sectors' to be non-power-of-2")
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: John Dorminy <jdorminy@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a20d073 29-Nov-2020 Lei Chen <lennychen@tencent.com>

block: wbt: Remove unnecessary invoking of wbt_update_limits in wbt_init

It's unnecessary to call wbt_update_limits explicitly within wbt_init,
because it will be called in the following function wbt_queue_depth_changed.

Signed-off-by: Lei Chen <lennychen@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a787bdaf 27-Nov-2020 Ingo Molnar <mingo@kernel.org>

Merge branch 'linus' into sched/core, to resolve semantic conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>


545b8c8d 15-Jun-2020 Peter Zijlstra <peterz@infradead.org>

smp: Cleanup smp_call_function*()

Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

47a84653 11-Nov-2020 Eric Biggers <ebiggers@google.com>

block/keyslot-manager: prevent crash when num_slots=1

If there is only one keyslot, then blk_ksm_init() computes
slot_hashtable_size=1 and log_slot_ht_size=0. This causes
blk_ksm_find_keyslot() to crash later because it uses
hash_ptr(key, log_slot_ht_size) to find the hash bucket containing the
key, and hash_ptr() doesn't support the bits == 0 case.

Fix this by making the hash table always have at least 2 buckets.

Tested by running:

kvm-xfstests -c ext4 -g encrypt -m inlinecrypt \
-o blk-crypto-fallback.num_keyslots=1

Fixes: 1b2628397058 ("block: Keyslot Manager for Inline Encryption")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

449f4ec9 16-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove the update_bdev parameter to set_capacity_revalidate_and_notify

The update_bdev argument is always set to true, so remove it. Also
rename the function to the slighly less verbose set_capacity_and_notify,
as propagating the disk size to the block device isn't really
revalidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e2b6b301 14-Nov-2020 Christoph Hellwig <hch@lst.de>

block: fix the kerneldoc comment for __register_blkdev

Switch the comment to talk about __register_blkdev instead of
register_blkdev and document the new probe parameter.

Fixes: 3da1a61e7046 ("block: add an optional probe callback to major_names")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e418de3a 29-Oct-2020 Christoph Hellwig <hch@lst.de>

block: switch gendisk lookup to a simple xarray

Now that bdev_map is only used for finding gendisks, we can use
a simple xarray instead of the regions tracking structure for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a160c615 29-Oct-2020 Christoph Hellwig <hch@lst.de>

block: add an optional probe callback to major_names

Add a callback to the major_names array that allows a driver to override
how to probe for dev_t that doesn't currently have a gendisk registered.
This will help separating the lookup of the gendisk by dev_t vs probe
action for a not currently registered dev_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bd8eff3b 29-Oct-2020 Christoph Hellwig <hch@lst.de>

block: rework requesting modules for unclaimed devices

Instead of reusing the ranges in bdev_map, add a new helper that is
called if no ranges was found. This is a first step to unpeel and
eventually remove the complex ranges structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e49fbbbf 29-Oct-2020 Christoph Hellwig <hch@lst.de>

block: split block_class_lock

Split the block_class_lock mutex into one each to protect bdev_map
and major_names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

62b508f8 29-Oct-2020 Christoph Hellwig <hch@lst.de>

block: open code kobj_map into in block/genhd.c

Copy and paste the kobj_map functionality in the block code in preparation
for completely rewriting it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6b3ba976 29-Oct-2020 Christoph Hellwig <hch@lst.de>

block: cleanup del_gendisk a bit

Merge three hidden gendisk checks into one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a7cb3d2f 03-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove __blkdev_driver_ioctl

Just open code it in the few callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

98f49b63 03-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove set_device_ro

Fold set_device_ro into its only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

732e12d8 03-Nov-2020 Christoph Hellwig <hch@lst.de>

block: don't call into the driver for BLKROSET

Now that all drivers that want to hook into setting or clearing the
read-only flag use the set_read_only method, this code can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e00adcad 03-Nov-2020 Christoph Hellwig <hch@lst.de>

block: add a new set_read_only method

Add a new method to allow for driver-specific processing when setting or
clearing the block device read-only state. This allows to replace the
cumbersome and error-prone override of the whole ioctl implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4a9d6d66 03-Nov-2020 Christoph Hellwig <hch@lst.de>

block: don't call into the driver for BLKFLSBUF

BLKFLSBUF is entirely contained in the block core, and there is no
good reason to give the driver a hook into processing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b7131ee0 14-Nov-2020 Christoph Hellwig <hch@lst.de>

blk-cgroup: fix a hd_struct leak in blkcg_fill_root_iostats

disk_get_part needs to be paired with a disk_put_part.

Cc: stable@vger.kernel.org
Fixes: ef45fe470e1 ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9f16a667 13-Nov-2020 Ming Lei <ming.lei@redhat.com>

block: mark flush request as IDLE when it is really finished

For avoiding use-after-free on flush request, we call its .end_io() from
both timeout code path and __blk_mq_end_request().

When flush request's ref doesn't drop to zero, it is still used, we
can't mark it as IDLE, so fix it by marking IDLE when its refcount drops
to zero really.

Fixes: 65ff5cd04551 ("blk-mq: mark flush request as IDLE in flush_end_io()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7e890c37 12-Nov-2020 Christoph Hellwig <hch@lst.de>

block: add a return value to set_capacity_revalidate_and_notify

Return if the function ended up sending an uevent or not.

Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e21ee5a6 30-Sep-2020 Hannes Reinecke <hare@suse.de>

scsi: block: Return status code in blk_mq_end_request()

blk_mq_end_request() will use the block status returned from queue_rq() as
argument, except in one instance in blk_mq_dispatch_rq_list(), where the
generic BLK_STS_IOERR is used.

Link: https://lore.kernel.org/r/20200930080256.90964-2-hare@suse.de
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

65ff5cd0 29-Oct-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: mark flush request as IDLE in flush_end_io()

Mark flush request as IDLE in its .end_io(), aligning it with how normal
requests behave. The flush request stays in in-flight tags if we're not
using an IO scheduler, so we need to change its state into IDLE.
Otherwise, we will hang in blk_mq_tagset_wait_completed_request() during
error recovery because flush the request state is kept as COMPLETED.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Chao Leng <lengchao@huawei.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4977d121 28-Oct-2020 Naohiro Aota <naohiro.aota@wdc.com>

block: advance iov_iter on bio_add_hw_page failure

When the bio's size reaches max_append_sectors, bio_add_hw_page returns
0 then __bio_iov_append_get_pages returns -EINVAL. This is an expected
result of building a small enough bio not to be split in the IO path.
However, iov_iter is not advanced in this case, causing the same pages
are filled for the bio again and again.

Fix the case by properly advancing the iov_iter for already processed
pages.

Fixes: 0512a75b98f8 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org # 5.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f255c19b 22-Oct-2020 Gabriel Krisman Bertazi <krisman@collabora.com>

blk-cgroup: Pre-allocate tree node on blkg_conf_prep

Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq
outside request queue spinlock"), blkg_create can also trigger
occasional -ENOMEM failures at the radix insertion because any
allocation inside blkg_create has to be non-blocking, making it more
likely to fail. This causes trouble for userspace tools trying to
configure io weights who need to deal with this condition.

This patch reduces the occurrence of -ENOMEMs on this path by preloading
the radix tree element on a GFP_KERNEL context, such that we guarantee
the later non-blocking insertion won't fail.

A similar solution exists in blkcg_init_queue for the same situation.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

52abfcbd 22-Oct-2020 Gabriel Krisman Bertazi <krisman@collabora.com>

blk-cgroup: Fix memleak on error path

If new_blkg allocation raced with blk_policy change and
blkg_lookup_check fails, new_blkg is leaked.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d7691390 24-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request from Christoph
- rdma error handling fixes (Chao Leng)
- fc error handling and reconnect fixes (James Smart)
- fix the qid displace when tracing ioctl command (Keith Busch)
- don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
- fix MTDT for passthru (Logan Gunthorpe)
- blacklist Write Same on more devices (Kai-Heng Feng)
- fix an uninitialized work struct (zhenwei pi)"

- lightnvm out-of-bounds fix (Colin)

- SG allocation leak fix (Doug)

- rnbd fixes (Gioh, Guoqing, Jack)

- zone error translation fixes (Keith)

- kerneldoc markup fix (Mauro)

- zram lockdep fix (Peter)

- Kill unused io_context members (Yufen)

- NUMA memory allocation cleanup (Xianting)

- NBD config wakeup fix (Xiubo)

* tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
block: blk-mq: fix a kernel-doc markup
nvme-fc: shorten reconnect delay if possible for FC
nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
nvme-fc: fix error loop in create_hw_io_queues
nvme-fc: fix io timeout to abort I/O
null_blk: use zone status for max active/open
nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
nvmet: cleanup nvmet_passthru_map_sg()
nvmet: limit passthru MTDS by BIO_MAX_PAGES
nvmet: fix uninitialized work for zero kato
nvme-pci: disable Write Zeroes on Sandisk Skyhawk
nvme: use queuedata for nvme_req_qid
nvme-rdma: fix crash due to incorrect cqe
nvme-rdma: fix crash when connect rejected
block: remove unused members for io_context
blk-mq: remove the calling of local_memory_node()
zram: Fix __zram_bvec_{read,write}() locking order
skd_main: remove unused including <linux/version.h>
sgl_alloc_order: fix memory leak
lightnvm: fix out-of-bounds write to array devices->info[]
...


24f7bb88 23-Oct-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

block: blk-mq: fix a kernel-doc markup

Fix a typo:
blk_mq_run_hw_queue -> blk_mq_run_hw_queues

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

576e85c5 19-Oct-2020 Xianting Tian <tian.xianting@h3c.com>

blk-mq: remove the calling of local_memory_node()

We don't need to check whether the node is memoryless numa node before
calling allocator interface. SLUB(and SLAB,SLOB) relies on the page
allocator to pick a node. Page allocator should deal with memoryless
nodes just fine. It has zonelists constructed for each possible nodes.
And it will automatically fall back into a node which is closest to the
requested node. As long as __GFP_THISNODE is not enforced of course.

The code comments of kmem_cache_alloc_node() of SLAB also showed this:
* Fallback to other node is possible if __GFP_THISNODE is not set.

blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling
of local_memory_node().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5cd3ddc1 09-Sep-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

docs: bio: fix a kerneldoc markup

Fix this warning:

./block/bio.c:1098: WARNING: Inline emphasis start-string without end-string.

The thing is that *iter is not a valid markup.

That seems to be a typo:
*iter -> @iter

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

5b874af6 26-Aug-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

block: bio: fix a warning at the kernel-doc markups

Using "@bio's parent" causes the following waring:
./block/bio.c:10: WARNING: Inline emphasis start-string without end-string.

The main problem here is that this would be converted into:

**bio**'s parent

By kernel-doc, which is not a valid notation. It would be
possible to use, instead, this kernel-doc markup:

``bio's`` parent

Yet, here, is probably simpler to just use an altenative language:

the parent of @bio

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

4815519e 14-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

- Improve DM core's bio splitting to use blk_max_size_offset(). Also
fix bio splitting for bios that were deferred to the worker thread
due to a DM device being suspended.

- Remove DM core's special handling of NVMe devices now that block core
has internalized efficiencies drivers previously needed to be
concerned about (via now removed direct_make_request).

- Fix request-based DM to not bounce through indirect dm_submit_bio;
instead have block core make direct call to blk_mq_submit_bio().

- Various DM core cleanups to simplify and improve code.

- Update DM cryot to not use drivers that set
CRYPTO_ALG_ALLOCATES_MEMORY.

- Fix DM raid's raid1 and raid10 discard limits for the purposes of
linux-stable. But then remove DM raid's discard limits settings now
that MD raid can efficiently handle large discards.

- A couple small cleanups across various targets.

* tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix request-based DM to not bounce through indirect dm_submit_bio
dm: remove special-casing of bio-based immutable singleton target on NVMe
dm: export dm_copy_name_and_uuid
dm: fix comment in __dm_suspend()
dm: fold dm_process_bio() into dm_submit_bio()
dm: fix missing imposition of queue_limits from dm_wq_work() thread
dm snap persistent: simplify area_io()
dm thin metadata: Remove unused local variable when create thin and snap
dm raid: remove unnecessary discard limits for raid10
dm raid: fix discard limits for raid1 and raid10
dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
dm: use dm_table_get_device_name() where appropriate in targets
dm table: make 'struct dm_table' definition accessible to all of DM core
dm: eliminate need for start_io_acct() forward declaration
dm: simplify __process_abnormal_io()
dm: push use of on-stack flush_bio down to __send_empty_flush()
dm: optimize max_io_len() by inlining max_io_len_target_boundary()
dm: push md->immutable_target optimization down to __process_bio()
dm: change max_io_len() to use blk_max_size_offset()
dm table: stack 'chunk_sectors' limit to account for target-specific splitting


3b481d91 24-Sep-2020 Keith Busch <kbusch@kernel.org>

block: add zone specific block statuses

A zoned device with limited resources to open or activate zones may
return an error when the host exceeds those limits. The same command may
be successful if retried later, but the host needs to wait for specific
zone states before it should expect a retry to succeed. Have the block
layer provide an appropriate status for these conditions so applications
can distinuguish this error for special handling.

Cc: linux-api@vger.kernel.org
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7cd4ecd9 13-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
"Here are the driver updates for 5.10.

A few SCSI updates in here too, in coordination with Martin as they
depend on core block changes for the shared tag bitmap.

This contains:

- NVMe pull requests via Christoph:
- fix keep alive timer modification (Amit Engel)
- order the PCI ID list more sensibly (Andy Shevchenko)
- cleanup the open by controller helper (Chaitanya Kulkarni)
- use an xarray for the CSE log lookup (Chaitanya Kulkarni)
- support ZNS in nvmet passthrough mode (Chaitanya Kulkarni)
- fix nvme_ns_report_zones (Christoph Hellwig)
- add a sanity check to nvmet-fc (James Smart)
- fix interrupt allocation when too many polled queues are
specified (Jeffle Xu)
- small nvmet-tcp optimization (Mark Wunderlich)
- fix a controller refcount leak on init failure (Chaitanya
Kulkarni)
- misc cleanups (Chaitanya Kulkarni)
- major refactoring of the scanning code (Christoph Hellwig)

- MD updates via Song:
- Bug fixes in bitmap code, from Zhao Heming
- Fix a work queue check, from Guoqing Jiang
- Fix raid5 oops with reshape, from Song Liu
- Clean up unused code, from Jason Yan
- Discard improvements, from Xiao Ni
- raid5/6 page offset support, from Yufen Yu

- Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap,
Hannes)

- null_blk open/active zone limit support (Niklas)

- Set of bcache updates (Coly, Dongsheng, Qinglang)"

* tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits)
md/raid5: fix oops during stripe resizing
md/bitmap: fix memory leak of temporary bitmap
md: fix the checking of wrong work queue
md/bitmap: md_bitmap_get_counter returns wrong blocks
md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks
md/raid0: remove unused function is_io_in_chunk_boundary()
nvme-core: remove extra condition for vwc
nvme-core: remove extra variable
nvme: remove nvme_identify_ns_list
nvme: refactor nvme_validate_ns
nvme: move nvme_validate_ns
nvme: query namespace identifiers before adding the namespace
nvme: revalidate zone bitmaps in nvme_update_ns_info
nvme: remove nvme_update_formats
nvme: update the known admin effects
nvme: set the queue limits in nvme_update_ns_info
nvme: remove the 0 lba_shift check in nvme_update_ns_info
nvme: clean up the check for too large logic block sizes
nvme: freeze the queue over ->lba_shift updates
nvme: factor out a nvme_configure_metadata helper
...


3ad11d7a 13-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

- Series of merge handling cleanups (Baolin, Christoph)

- Series of blk-throttle fixes and cleanups (Baolin)

- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)

- Removal of bdget() as a generic API (Christoph)

- Removal of blkdev_get() as a generic API (Christoph)

- Cleanup of is-partition checks (Christoph)

- Series reworking disk revalidation (Christoph)

- Series cleaning up bio flags (Christoph)

- bio crypt fixes (Eric)

- IO stats inflight tweak (Gabriel)

- blk-mq tags fixes (Hannes)

- Buffer invalidation fixes (Jan)

- Allow soft limits for zone append (Johannes)

- Shared tag set improvements (John, Kashyap)

- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

- DM no-wait support (Mike, Konstantin)

- Request allocation improvements (Ming)

- Allow md/dm/bcache to use IO stat helpers (Song)

- Series improving blk-iocost (Tejun)

- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...


85ed13e7 12-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull compat iovec cleanups from Al Viro:
"Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
security/keys: remove compat_keyctl_instantiate_key_iov
mm: remove compat_process_vm_{readv,writev}
fs: remove compat_sys_vmsplice
fs: remove the compat readv/writev syscalls
fs: remove various compat readv/writev helpers
iov_iter: transparently handle compat iovecs in import_iovec
iov_iter: refactor rw_copy_check_uvector and import_iovec
iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
compat.h: fix a spelling error in <linux/compat.h>


47ce030b 09-Oct-2020 Yang Yang <yang.yang@vivo.com>

blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue

blk_exit_queue will free elevator_data, while blk_mq_run_work_fn
will access it. Move cancel of hctx->run_work to the front of
blk_exit_queue to avoid use-after-free.

Fixes: 1b97871b501f ("blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release")
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c7281524 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

blk-mq: get rid of the dead flush handle code path

After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
directly to handle flush request rather than blk_mq_sched_insert_request()
in the case of elevator.

Then, all flush request either have set RQF_FLUSH_SEQ flag when call
blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
So, remove the dead code path.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0546858c 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: get rid of unnecessary local variable

Since whole elevator register is protectd by sysfs_lock, we
don't need extras 'has_elevator'. Just use q->elevator directly.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f0c6ae09 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: fix comment and add lockdep assert

After commit b89f625e28d4 ("block: don't release queue's sysfs
lock during switching elevator"), whole elevator register and
unregister function are covered by sysfs_lock. So, remove wrong
comment and add lockdep assert.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0841031a 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

blk-mq: use helper function to test hw stopped

We have introduced helper function blk_mq_hctx_stopped() to test
BLK_MQ_S_STOPPED.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75e6c00f 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: use helper function to test queue register

We have defined common interface blk_queue_registered() to
test QUEUE_FLAG_REGISTERED. Just use it.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6251b754 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: remove redundant mq check

elv_support_iosched() will check queue_is_mq() for us. So, remove
the redundant check to clean code.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dd1c372d 08-Oct-2020 Yufen Yu <yuyufen@huawei.com>

block: invoke blk_mq_exit_sched no matter whether have .exit_sched

We will register debugfs for scheduler no matter whether it have
defined callback funciton .exit_sched. So, blk_mq_exit_sched()
is always needed to unregister debugfs. Also, q->elevator should
be set as NULL after exiting scheduler.

For now, since all register scheduler have defined .exit_sched,
it will not cause any actual problem. But It will be more reasonable
to do this change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

583090b1 08-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block5.9-2020-10-08' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A few fixes that should go into this release:

- NVMe controller error path reference fix (Chaitanya)

- Fix regression with IBM partitions on non-dasd devices (Christoph)

- Fix a missing clear in the compat CDROM packet structure (Peilin)"

* tag 'block5.9-2020-10-08' of git://git.kernel.dk/linux-block:
partitions/ibm: fix non-DASD devices
nvme-core: put ctrl ref when module ref get fail
block/scsi-ioctl: Fix kernel-infoleak in scsi_put_cdrom_generic_arg()


f4ac712e 08-Oct-2020 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

block: ratelimit handle_bad_sector() message

syzbot is reporting unkillable task [1], for the caller is failing to
handle a corrupted filesystem image which attempts to access beyond
the end of the device. While we need to fix the caller, flooding the
console with handle_bad_sector() message is unlikely useful.

[1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1da30f95 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Re-use the throtl_set_slice_end()

Re-use throtl_set_slice_end() to remove duplicate code.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

29379674 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Open code __throtl_de/enqueue_tg()

The __throtl_de/enqueue_tg() functions are only be called by
throtl_de/enqueue_tg(), thus we can just open code them to
make code more readable.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2397611a 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Move service tree validation out of the throtl_rb_first()

The throtl_schedule_next_dispatch() will validate if the service queue
is empty before calling update_min_dispatch_time(), and the
update_min_dispatch_time() will call throtl_rb_first(), which will
validate service queue again.

Thus we can move the service queue validation out of the
throtl_rb_first() to remove the redundant validation in the fast path.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b7b609de 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Move the list operation after list validation

We should move the list operation after validation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5b7048b8 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Fix IO hang for a corner case

It can not scale up in throtl_adjusted_limit() if we set bps or iops is
1, which will cause IO hang when enable low limit. Thus we should treat
1 as a illegal value to avoid this issue.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b185efa7 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Avoid tracking latency if low limit is invalid

The IO latency tracking is only for LOW limit, so we should add a
validation to avoid redundant latency tracking if the LOW limit
is not valid.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7901601a 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Avoid getting the current time if tg->last_finish_time is 0

We only update the tg->last_finish_time when the low limitaion is
enabled, so we can move the tg->last_finish_time validation a little
forward to avoid getting the unnecessary current time stamp if the
the low limitation is not enabled.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4247d9c8 07-Oct-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()

The throtl_downgrade_state() is always used to change to LIMIT_LOW
limitation, thus remove the latter meaningless parameter which
indicates the limitation index.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fa1c3eaf 27-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

block: Remove redundant 'return' statement

Remove redundant 'return' statement for 'void' functions.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

681cc5e8 07-Oct-2020 Mike Snitzer <snitzer@redhat.com>

dm: fix request-based DM to not bounce through indirect dm_submit_bio

It is unnecessary to force request-based DM to call into bio-based
dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then
call blk_mq_submit_bio().

Fix this by establishing a request-based DM block_device_operations
(dm_rq_blk_dops, which doesn't have .submit_bio) and update
dm_setup_md_queue() to set md->disk->fops to it for
DM_TYPE_REQUEST_BASED.

Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport
blk_mq_submit_bio.

Fixes: c62b37d96b6eb ("block: move ->make_request_fn to struct block_device_operations")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7370997d 07-Oct-2020 Christoph Hellwig <hch@lst.de>

partitions/ibm: fix non-DASD devices

Don't error out if the dasd_biodasdinfo symbol is not available.

Cc: stable@vger.kernel.org
Fixes: 26d7e28e3820 ("s390/dasd: remove ioctl_by_bdev calls")
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a926c7af 06-Oct-2020 Gabriel Krisman Bertazi <krisman@collabora.com>

block: Consider only dispatched requests for inflight statistic

According to Documentation/block/stat.rst, inflight should not include
I/O requests that are in the queue but not yet dispatched to the device,
but blk-mq identifies as inflight any request that has a tag allocated,
which, for queues without elevator, happens at request allocation time
and before it is queued in the ctx (default case in blk_mq_submit_bio).

In addition, current behavior is different for queues with elevator from
queues without it, since for the former the driver tag is allocated at
dispatch time. A more precise approach would be to only consider
requests with state MQ_RQ_IN_FLIGHT.

This effectively reverts commit 6131837b1de6 ("blk-mq: count allocated
but not started requests in iostats inflight") to consolidate blk-mq
behavior with itself (elevator case) and with original documentation,
but it differs from the behavior used by the legacy path.

This version differs from v1 by using blk_mq_rq_state to access the
state attribute. Avoid using blk_mq_request_started, which was
suggested, since we don't want to include MQ_RQ_COMPLETE.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eda5cc99 06-Oct-2020 Christoph Hellwig <hch@lst.de>

block: move blk_mq_sched_try_merge to blk-merge.c

Move blk_mq_sched_try_merge to blk-merge.c, which allows to mark
a lot of the merge infrastructure static there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d59da419 06-Oct-2020 Christoph Hellwig <hch@lst.de>

block: remove the unused blk_integrity_merge_bio export

Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

92cf2fd1 06-Oct-2020 Christoph Hellwig <hch@lst.de>

block: remove the unused blk_integrity_merge_rq export

Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cf785af1 15-Sep-2020 Eric Biggers <ebiggers@google.com>

block: warn if !__GFP_DIRECT_RECLAIM in bio_crypt_set_ctx()

bio_crypt_set_ctx() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

For now this assumption is still fine, since no callers violate it.
Making bio_crypt_set_ctx() able to fail would add unneeded complexity.

However, if a caller didn't use __GFP_DIRECT_RECLAIM, it would be very
hard to notice the bug. Make it easier by adding a WARN_ON_ONCE().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

93f221ae 15-Sep-2020 Eric Biggers <ebiggers@google.com>

block: make blk_crypto_rq_bio_prep() able to fail

blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c.

This case isn't currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
blk_crypto_rq_bio_prep() able to fail.

Suggested-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

07560151 15-Sep-2020 Eric Biggers <ebiggers@google.com>

block: make bio_crypt_clone() able to fail

bio_crypt_clone() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, bio_crypt_clone() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
kcryptd_io_read() in drivers/md/dm-crypt.c.

Neither case is currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
bio_crypt_clone() able to fail, analogous to bio_integrity_clone().

Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

10ed1666 25-Sep-2020 Christoph Hellwig <hch@lst.de>

block: add a bdget_part helper

All remaining callers of bdget() outside of fs/block_dev.c want to get a
reference to the struct block_device for a given struct hd_struct. Add
a helper just for that and then mark bdget static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

89cd35c5 24-Sep-2020 Christoph Hellwig <hch@lst.de>

iov_iter: transparently handle compat iovecs in import_iovec

Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f5ace5ef 02-Oct-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

block: scsi_ioctl: Avoid the use of one-element arrays

One-element arrays are being deprecated[1]. Replace the one-element array
with a simple object of type compat_caddr_t: 'compat_caddr_t unused'[2],
once it seems this field is actually never used.

Also, update struct cdrom_generic_command in UAPI by adding an
anonimous union to avoid using the one-element array _reserved_.

[1] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays
[2] https://github.com/KSPP/linux/issues/86

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/lkml/5f76f5d0.qJ4t%2FHWuRzSW7bTa%25lkp@intel.com/
Build-tested-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f016a540 02-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.9-2020-10-02' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
"Single fix for a ->commit_rqs failure case"

* tag 'block-5.9-2020-10-02' of git://git.kernel.dk/linux-block:
blk-mq: call commit_rqs while list empty but error happen


6d53a9fe 02-Oct-2020 Peilin Ye <yepeilin.cs@gmail.com>

block/scsi-ioctl: Fix kernel-infoleak in scsi_put_cdrom_generic_arg()

scsi_put_cdrom_generic_arg() is copying uninitialized stack memory to
userspace, since the compiler may leave a 3-byte hole in the middle of
`cgc32`. Fix it by adding a padding field to `struct
compat_cdrom_generic_command`.

Cc: stable@vger.kernel.org
Fixes: f3ee6e63a9df ("compat_ioctl: move CDROM_SEND_PACKET handling into scsi")
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: syzbot+85433a479a646a064ab3@syzkaller.appspotmail.com
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1471308f 29-Sep-2020 Mike Snitzer <snitzer@redhat.com>

Merge remote-tracking branch 'jens/for-5.10/block' into dm-5.10

DM depends on these block 5.10 commits:

22ada802ede8 block: use lcm_not_zero() when stacking chunk_sectors
07d098e6bbad block: allow 'chunk_sectors' to be non-power-of-2
021a24460dc2 block: add QUEUE_FLAG_NOWAIT
6abc49468eea dm: add support for REQ_NOWAIT and enable it for linear target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


76cffccd 18-Sep-2020 yangerkun <yangerkun@huawei.com>

block-mq: fix comments in blk_mq_queue_tag_busy_iter

'f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with
blk_mq_queue_tag_busy_iter")' introduce a bug what we may sleep between
rcu lock. Then '530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter
callbacks")' fix it by get request_queue's ref. And 'a9a808084d6a ("block:
Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()")'
remove the synchronize_rcu in __blk_mq_update_nr_hw_queues. We need
update the confused comments in blk_mq_queue_tag_busy_iter.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

632bfb63 05-Sep-2020 yangerkun <yangerkun@huawei.com>

blk-mq: call commit_rqs while list empty but error happen

Blk-mq should call commit_rqs once 'bd.last != true' and no more
request will come(so virtscsi can kick the virtqueue, e.g.). We already
do that in 'blk_mq_dispatch_rq_list/blk_mq_try_issue_list_directly' while
list not empty and 'queued > 0'. However, we can seen the same scene
once the last request in list call queue_rq and return error like
BLK_STS_IOERR which will not requeue the request, and lead that list
empty but need call commit_rqs too(Or the request for virtscsi will stay
timeout until other request kick virtqueue).

We found this problem by do fsstress test with offline/online virtscsi
device repeat quickly.

Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()")
Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8229cca8 25-Sep-2020 Xianting Tian <tian.xianting@h3c.com>

blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.

Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;

set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);

2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}

Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1

So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a1bffa48 26-Sep-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Three fixes: one in drivers (lpfc) and two for zoned block devices.

The latter also impinges on the block layer but only to introduce a
new block API for setting the zone model rather than fiddling with the
queue directly in the zoned block driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: sd_zbc: Fix ZBC disk initialization
scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported


bec02dbb 18-Sep-2020 Tejun Heo <tj@kernel.org>

iocost: consider iocgs with active delays for debt forgiveness

An iocg may have 0 debt but non-zero delay. The current debt forgiveness
logic doesn't act on such iocgs. This can lead to unexpected behaviors - an
iocg with a little bit of debt will have its delay canceled through debt
forgiveness but one w/o any debt but active delay will have to wait out
until its delay decays out.

This patch updates the debt handling logic so that it treats delays the same
as debts. If either debt or delay is active, debt forgiveness logic kicks in
and acts on both the same way.

Also, avoid turning the debt and delay directly to zero as that can confuse
state transitions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c5a6561b 17-Sep-2020 Tejun Heo <tj@kernel.org>

iocost: add iocg_forgive_debt tracepoint

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c7af2a00 17-Sep-2020 Tejun Heo <tj@kernel.org>

iocost: reimplement debt forgiveness using average usage

Debt forgiveness logic was counting the number of consecutive !busy periods
as the trigger condition. While this usually works, it can easily be thrown
off by temporary fluctuations especially on configurations w/ short periods.

This patch reimplements debt forgiveness so that:

* Use the average usage over the forgiveness period instead of counting
consecutive periods.

* Debt is reduced at around the target rate (1/2 every 100ms) regardless of
ioc period duration.

* Usage threshold is raised to 50%. Combined with the preceding changes and
the switch to average usage, this makes debt forgivness a lot more
effective at reducing the amount of unnecessary idleness.

* Constants are renamed with DFGV_ prefix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d9517841 17-Sep-2020 Tejun Heo <tj@kernel.org>

iocost: recalculate delay after debt reduction

Debt sets the initial delay duration which is decayed over time. The current
debt reduction halved the debt but didn't change the delay. It prevented
future debts from increasing delay but didn't do anything to lower the
existing delay, limiting the mechanism's ability to reduce unnecessary
idling.

Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq()
recalculates new delay value based on the reduced debt amount.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

33a1fe6d 17-Sep-2020 Tejun Heo <tj@kernel.org>

iocost: replace nr_shortages cond in ioc_forgive_debts() with busy_level one

Debt reduction was blocked if any iocg was short on budget in the past
period to avoid reducing debts while some iocgs are saturated. However, this
ends up unnecessarily blocking debt reduction due to temporary local
imbalances when the device is generally being underutilized, while also
failing to block when the underlying device is overwhelmed and the usage
becomes low from high latency.

Given that debt accumulation mostly happens with swapout bursts which can
significantly deteriorate the underlying device's latency response, the
current logic is not great.

Let's replace it with ioc->busy_level based condition so that we block debt
reduction when the underlying device is being saturated. ioc_forgive_debts()
call is moved after busy_level determination.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ab8df828 17-Sep-2020 Tejun Heo <tj@kernel.org>

iocost: factor out ioc_forgive_debts()

Debt reduction logic is going to be improved and expanded. Factor it out
into ioc_forgive_debts() and generalize the comment a bit. No functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

021a2446 23-Sep-2020 Mike Snitzer <snitzer@redhat.com>

block: add QUEUE_FLAG_NOWAIT

Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
applicable.

Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also
update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.

Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8a63a86e 02-Sep-2020 Christoph Hellwig <hch@lst.de>

block: use bd_partno in bdevname

No need to go through the hd_struct to find the partition number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fa01b1e9 02-Sep-2020 Christoph Hellwig <hch@lst.de>

block: add a bdev_is_partition helper

Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ac8f7a02 24-Sep-2020 Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.10/block' into for-5.10/drivers

* for-5.10/block: (140 commits)
bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
bdi: invert BDI_CAP_NO_ACCT_WB
bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
mm: use SWP_SYNCHRONOUS_IO more intelligently
bdi: remove BDI_CAP_SYNCHRONOUS_IO
bdi: remove BDI_CAP_CGROUP_WRITEBACK
block: lift setting the readahead size into the block layer
md: update the optimal I/O size on reshape
bdi: initialize ->ra_pages and ->io_pages in bdi_init
aoe: set an optimal I/O size
bcache: inherit the optimal I/O size
drbd: remove dead code in device_to_statistics
fs: remove the unused SB_I_MULTIROOT flag
block: mark blkdev_get static
PM: mm: cleanup swsusp_swap_check
mm: split swap_type_of
PM: rewrite is_hibernate_resume_dev to not require an inode
mm: cleanup claim_swapfile
ocfs2: cleanup o2hb_region_dev_store
dasd: cleanup dasd_scan_partitions
...


1cb039f3 24-Sep-2020 Christoph Hellwig <hch@lst.de>

bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ed7b6b4f 24-Sep-2020 Christoph Hellwig <hch@lst.de>

bdi: remove BDI_CAP_CGROUP_WRITEBACK

Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c2e4cd57 24-Sep-2020 Christoph Hellwig <hch@lst.de>

block: lift setting the readahead size into the block layer

Drivers shouldn't really mess with the readahead size, as that is a VM
concept. Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk. Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors. To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

55b2598e 24-Sep-2020 Christoph Hellwig <hch@lst.de>

bdi: initialize ->ra_pages and ->io_pages in bdi_init

Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

47816282 21-Sep-2020 Christoph Hellwig <hch@lst.de>

block: cleanup blkdev_bszset

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9301fe73 21-Sep-2020 Christoph Hellwig <hch@lst.de>

block: cleanup partition scanning in register_disk

Use blkdev_get_by_dev instead of open coding it using bdget_disk +
blkdev_get, and split the code to read the partition table into a
separate helper to make it a little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

38430f08 21-Sep-2020 Christoph Hellwig <hch@lst.de>

block: move the NEED_PART_SCAN flag to struct gendisk

We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

07d098e6 21-Sep-2020 Mike Snitzer <snitzer@redhat.com>

block: allow 'chunk_sectors' to be non-power-of-2

It is possible, albeit more unlikely, for a block device to have a non
power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
which results in a full-stripe size of 1280K. This causes the RAID6's
io_opt to be advertised as 1280K, and a stacked device _could_ then be
made to use a blocksize, aka chunk_sectors, that matches non power-of-2
io_opt of underlying RAID6 -- resulting in stacked device's
chunk_sectors being a non power-of-2).

Update blk_queue_chunk_sectors() and blk_max_size_offset() to
accommodate drivers that need a non power-of-2 chunk_sectors.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

22ada802 21-Sep-2020 Mike Snitzer <snitzer@redhat.com>

block: use lcm_not_zero() when stacking chunk_sectors

Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
lcm_not_zero() rather than min_not_zero() -- otherwise the final
'chunk_sectors' could result in sub-optimal alignment of IO to
component devices in the IO stack.

Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
then it is a bug in the driver and the device should be flagged as
'misaligned'.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

03859717 23-Sep-2020 Christoph Hellwig <hch@lst.de>

block: fix bmd->is_null_mapped initialization

bmd is allocated using kmalloc in bio_alloc_map_data, so make sure
is_null_mapped is properly initialized to false for the !null_mapped
case.

Fixes: f3256075ba49 ("block: remove the BIO_NULL_MAPPED flag")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f952eefe 20-Sep-2020 Julia Lawall <Julia.Lawall@inria.fr>

block: drop double zeroing

sg_init_table zeroes its first argument, so the allocation of that argument
doesn't have to.

the semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x;
@@

x =
- kzalloc
+ kmalloc
(...)
...
sg_init_table(x,...)
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

27ba3e8f 15-Sep-2020 Damien Le Moal <damien.lemoal@wdc.com>

scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks

When CONFIG_BLK_DEV_ZONED is disabled, allow using host-aware ZBC disks as
regular disks. In this case, ensure that command completion is correctly
executed by changing sd_zbc_complete() to return good_bytes instead of 0
and causing a hang during device probe (endless retries).

When CONFIG_BLK_DEV_ZONED is enabled and a host-aware disk is detected to
have partitions, it will be used as a regular disk. In this case, make sure
to not do anything in sd_zbc_revalidate_zones() as that triggers warnings.

Since all these different cases result in subtle settings of the disk queue
zoned model, introduce the block layer helper function
blk_queue_set_zoned() to generically implement setting up the effective
zoned model according to the disk type, the presence of partitions on the
disk and CONFIG_BLK_DEV_ZONED configuration.

Link: https://lore.kernel.org/r/20200915073347.832424-2-damien.lemoal@wdc.com
Fixes: b72053072c0b ("block: allow partitions on host aware zone devices")
Cc: <stable@vger.kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

87fbeb88 07-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimited

Do not need check the bps or iops limitation if bps or iops is unlimited.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4599ea49 07-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Avoid calculating bps/iops limitation repeatedly

The tg_may_dispatch() will call tg_with_in_bps_limit() and
tg_with_in_iops_limit() to check if we can dispatch a bio or
not, which will calculate bps/iops limitation multiple times.
But tg_may_dispatch() is always called under queue lock, which
means the bps/iops limitation will not change in tg_may_dispatch().

So we can calculate the bps/iops limitation only once, and pass
them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to
avoid calculating bps/iops limitation repeatedly.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e675df2a 07-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Define readable macros instead of static variables

The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only
variables, thus better to use readable macros instead of static
variables, which can also save some spaces for .bss area.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ff8b22c0 07-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Use readable READ/WRITE macros

Use readable READ/WRITE macros instead of magic numbers.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b53b072c 07-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-throttle: Fix some comments' typos

Fix some comments' typos.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

aa67db24 14-Sep-2020 Tejun Heo <tj@kernel.org>

iocost: fix infinite loop bug in adjust_inuse_and_calc_cost()

adjust_inuse_and_calc_cost() is responsible for reducing the amount of
donated weights dynamically in period as the budget runs low. Because we
don't want to do full donation calculation in period, we keep latching up
inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the
resulting hweight_inuse is satisfactory.

Unfortunately, the adj_step calculation was reading the active weight before
acquiring ioc->lock. Because the current thread could have lost race to
activate the iocg to another thread before entering this function, it may
read the active weight as zero before acquiring ioc->lock. When this
happens, the adj_step is calculated as zero and the incremental adjustment
loop becomes an infinite one.

Fix it by fetching the active weight after acquiring ioc->lock.

Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

769b628d 11-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: fix divide-by-zero in transfer_surpluses()

Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but
all hweight calculations round up and thus it may end up >= WEIGHT_ONE
triggering divide-by-zero and other issues. Bound the value to avoid
surprises.

Fixes: e08d02aa5fc9 ("blk-iocost: implement Andy's method for donation weight updates")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b26410b 31-Aug-2020 Song Liu <songliubraving@fb.com>

block: introduce part_[begin|end]_io_acct

These functions can be used to enable iostat for partitions on devices
like md, bcache.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b8731d9 11-Sep-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Fix a regression in bdev partition locking (Christoph)

- NVMe pull request from Christoph:
- cancel async events before freeing them (David Milburn)
- revert a broken race fix (James Smart)
- fix command processing during resets (Sagi Grimberg)

- Fix a kyber crash with requeued flushes (Omar)

- Fix __bio_try_merge_page() same_page error for no merging (Ritesh)

* tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block:
block: Set same_page to false in __bio_try_merge_page if ret is false
nvme-fabrics: allow to queue requests for live queues
block: only call sched requeue_request() for scheduled requests
nvme-tcp: cancel async events before freeing event struct
nvme-rdma: cancel async events before freeing event struct
nvme-fc: cancel async events before freeing event struct
nvme: Revert: Fix controller creation races with teardown flow
block: restore a specific error code in bdev_del_partition


28500850 11-Sep-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: always allow reserved allocation in hctx_may_queue

NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.

Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.

Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

84ed2573 11-Sep-2020 Tian Tao <tiantao6@hisilicon.com>

block: remove duplicate include statement in scsi_ioctl.c

scsi/sg.h is included more than once, Remove the one that isn't
necessary.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

192f1c6b 10-Sep-2020 Xianting Tian <tian.xianting@h3c.com>

blkcg: add plugging support for punt bio

The test and the explaination of the patch as bellow.

Before test we added more debug code in blkg_async_bio_workfn():
int count = 0
if (bios.head && bios.head->bi_next) {
need_plug = true;
blk_start_plug(&plug);
}
while ((bio = bio_list_pop(&bios))) {
/*io_punt is a sysctl user interface to control the print*/
if(io_punt) {
printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
current->comm, current->pid, bio->bi_iter.bi_sector,
(bio->bi_iter.bi_size)>>9, count++, need_plug);
}
submit_bio(bio);
}
if (need_plug)
blk_finish_plug(&plug);

Steps that need to be set to trigger *PUNT* io before testing:
mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
mount -t cgroup2 nodev /cgroup2
mkdir /cgroup2/cg3
echo "+io" > /cgroup2/cgroup.subtree_control
echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
echo $$ > /cgroup2/cg3/cgroup.procs

Then use dd command to test btrfs PUNT io in current shell:
dd if=/dev/zero of=/btrfs/file bs=64K count=100000

Test hardware environment as below:
[root@localhost btrfs]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel

With above debug code, test command and test environment, I did the
tests under 3 different system loads, which are triggered by stress:
1, Run 64 threads by command "stress -c 64 &"
[53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
[53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
[53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
[53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
[53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
[53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
... ...
[53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
[53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
[53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
[53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
[53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
[53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
[53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1

2, Run 32 threads by command "stress -c 32 &"
[50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
[50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
[50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
[50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
[50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
[50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
... ...
[50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
[50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
[50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
[50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
[50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
[50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
[50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
[50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1

3, Don't run thread by stress
[50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
[50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
[50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
[50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
[50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
[50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
[50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
[50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
[50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
[50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
[50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
[50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
[50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
[50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
[50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
[50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
[50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
[50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
[50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
[50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
[50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0

Analysis of above 3 test results with different system load:
>From above test, we can see more and more continuous bios can be plugged
with system load increasing. When run "stress -c 64 &", 310 continuous
bios are plugged; When run "stress -c 32 &", 260 continuous bios are
plugged; When don't run stress, at most only 2 continuous bios are
plugged, in most cases, bio_list only contains one single bio.

How to explain above phenomenon:
We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
scheduled, it depends on the system load. When system load is low, the
workqueue will be quickly scheduled, and the bio in bio_list will be
quickly processed in blkg_async_bio_workfn(), so there is less chance
that the same io submit thread can add multiple continuous bios to
bio_list before workqueue is scheduled to run. The analysis aligned with
above test "3".
When system load is high, there is some delay before the workqueue can
be scheduled to run, the higher the system load the greater the delay.
So there is more chance that the same io submit thread can add multiple
continuous bios to bio_list. Then when the workqueue is scheduled to run,
there are more continuous bios in bio_list, which will be processed in
blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".

According to test, we can get io performance improved with the patch,
especially when system load is higher. Another optimazition is to use
the plug only when bio_list contains at least 2 bios.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

95f6f3a4 08-Sep-2020 Christoph Hellwig <hch@lst.de>

block: add a bdev_check_media_change helper

Like check_disk_changed, except that it does not call ->revalidate_disk
but leaves that to the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2cd896a5 08-Sep-2020 Ritesh Harjani <riteshh@linux.ibm.com>

block: Set same_page to false in __bio_try_merge_page if ret is false

If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.

Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.

WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6
Call Trace:
__remove_mapping+0x154/0x320 (unreliable)
iomap_releasepage+0x80/0x180
try_to_release_page+0x94/0xe0
invalidate_inode_page+0xc8/0x110
invalidate_mapping_pages+0x1dc/0x540
generic_fadvise+0x3c8/0x450
xfs_file_fadvise+0x2c/0xe0 [xfs]
vfs_fadvise+0x3c/0x60
ksys_fadvise64_64+0x68/0xe0
sys_fadvise64+0x28/0x40
system_call_exception+0xf8/0x1c0
system_call_common+0xf0/0x278

Fixes: cc90bc68422 ("block: fix "check bi_size overflow before merge"")
Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e8a8a185 08-Sep-2020 Omar Sandoval <osandov@fb.com>

block: only call sched requeue_request() for scheduled requests

Yang Yang reported the following crash caused by requeueing a flush
request in Kyber:

[ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
...
[ 2.517468] pc : clear_bit+0x18/0x2c
[ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
[ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
...
[ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
[ 2.517602] Call trace:
[ 2.517606] clear_bit+0x18/0x2c
[ 2.517619] kyber_finish_request+0x74/0x80
[ 2.517627] blk_mq_requeue_request+0x3c/0xc0
[ 2.517637] __scsi_queue_insert+0x11c/0x148
[ 2.517640] scsi_softirq_done+0x114/0x130
[ 2.517643] blk_done_softirq+0x7c/0xb0
[ 2.517651] __do_softirq+0x208/0x3bc
[ 2.517657] run_ksoftirqd+0x34/0x60
[ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
[ 2.517667] kthread+0x110/0x120
[ 2.517669] ret_from_fork+0x10/0x18

This happens because Kyber doesn't track flush requests, so
kyber_finish_request() reads a garbage domain token. Only call the
scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
the finish_request() hook in blk_mq_free_request()). Now that we're
handling it in blk-mq, also remove the check from BFQ.

Reported-by: Yang Yang <yang.yang@vivo.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fc93fe14 03-Sep-2020 Christoph Hellwig <hch@lst.de>

block: make QUEUE_SYSFS_BIT_FNS more useful

Switch to the naming used by the other entries so that we can use the
QUEUE_RW_ENTRY helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

35626147 03-Sep-2020 Christoph Hellwig <hch@lst.de>

block: add helper macros for queue sysfs entries

Add two helpers macros to avoid boilerplate code for the queue sysfs
entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

88ce2a53 08-Sep-2020 Christoph Hellwig <hch@lst.de>

block: restore a specific error code in bdev_del_partition

mdadm relies on the fact that deleting an invalid partition returns
-ENXIO or -ENOTTY to detect if a block device is a partition or a
whole device.

Fixes: 08fc1ab6d748 ("block: fix locking in bdev_del_partition")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ddfb8b0b 07-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

block: Remove unused blk_mq_sched_free_hctx_data()

Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(),
and no users will use blk_mq_sched_free_hctx_data() function.
Remove it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

384d87ef 04-Sep-2020 Jan Kara <jack@suse.cz>

block: Do not discard buffers under a mounted filesystem

Discarding blocks and buffers under a mounted filesystem is hardly
anything admin wants to do. Usually it will confuse the filesystem and
sometimes the loss of buffer_head state (including b_private field) can
even cause crashes like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
Call Trace:
__jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
kjournald2+0xbd/0x270 [jbd2]

So if we don't have block device open with O_EXCL already, claim the
block device while we truncate buffer cache. This makes sure any
exclusive block device user (such as filesystem) cannot operate on the
device while we are discarding buffer cache.

Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8075fc3b 04-Sep-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A bit larger than usual this week, mostly due to the NVMe fixes
arriving late for -rc3 and hence didn't make last weeks pull request.

- NVMe:
- instance leak and io boundary fixes from Keith
- fc locking fix from Christophe
- various tcp/rdma reset during traffic fixes from Sagi
- pci use-after-free fix from Tong
- tcp target null deref fix from Ziye

- Locking fix for partition removal (Christoph)

- Ensure bdi->io_pages is always set (me)

- Fixup for hd struct reference (Ming)

- Fix for zero length bvecs (Ming)

- Two small blk-iocost fixes (Tejun)"

* tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block:
block: allow for_each_bvec to support zero len bvec
blk-stat: make q->stats->lock irqsafe
blk-iocost: ioc_pd_free() shouldn't assume irq disabled
block: fix locking in bdev_del_partition
block: release disk reference in hd_struct_free_work
block: ensure bdi->io_pages is always initialized
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu


b445547e 19-Aug-2020 Kashyap Desai <kashyap.desai@broadcom.com>

blk-mq, elevator: Count requests per hctx to improve performance

High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.

It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.

The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.

Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f1b49fdc 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bccf5e26 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Record nr_active_requests per queue for when using shared sbitmap

The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.

Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a0235d23 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Relocate hctx_may_queue()

blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal.

Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code.

In this way, we can drop the blk-mq-tag.h include of blk-mq.h

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

32bc15af 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Facilitate a shared sbitmap per tagset

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

222a5ae0 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Use pointers for blk_mq_tags bitmap tags

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1c0706a7 19-Aug-2020 John Garry <john.garry@huawei.com>

blk-mq: Pass flags for tag init/free

Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.

For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4d063237 19-Aug-2020 Hannes Reinecke <hare@suse.de>

blk-mq: Free tags in blk_mq_init_tags() upon error

Since the tags are allocated in blk_mq_init_tags(), it's better practice
to free in that same function upon error, rather than a callee which is to
init the bitmap tags (blk_mq_init_tags()).

[jpg: Split from an earlier patch with a new commit message]

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

655ac300 19-Aug-2020 Hannes Reinecke <hare@suse.de>

blk-mq: Rename blk_mq_update_tag_set_depth()

The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.

So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

[jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()]

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

51db1c37 19-Aug-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

[jpg: rebase a few times, add rnbd-clt.c change]

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b8086d3f 01-Sep-2020 Christoph Hellwig <hch@lst.de>

block: use revalidate_disk_size in set_capacity_revalidate_and_notify

Only virtio_blk and xen-blkfront set the revalidate argument to true,
and both do not implement the ->revalidate_disk method. So switch
to the helper that just updates the size instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f4ad06f2 01-Sep-2020 Christoph Hellwig <hch@lst.de>

block: rename bd_invalidated

Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

265600b7 01-Sep-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

block: Remove a duplicative condition

Remove a duplicative condition to remove below cppcheck warnings:

"warning: Redundant condition: sched_allow_merge. '!A || (A && B)' is
equivalent to '!A || B' [redundantCondition]"

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8327cce5 01-Sep-2020 Ritika Srivastava <ritika.srivastava@oracle.com>

block: better deal with the delayed not supported case in blk_cloned_rq_check_limits

If WRITE_ZERO/WRITE_SAME operation is not supported by the storage,
blk_cloned_rq_check_limits() will return IO error which will cause
device-mapper to fail the paths.

Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP.
BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the
paths.

Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

143d2600 01-Sep-2020 Ritika Srivastava <ritika.srivastava@oracle.com>

block: Return blk_status_t instead of errno codes

Replace returning legacy errno codes with blk_status_t in
blk_cloned_rq_check_limits().

Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9d3a39a5 24-Aug-2020 Khazhismel Kumykov <khazhy@google.com>

block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE

CAP_SYS_ADMIN is too broad, and ionice fits into CAP_SYS_NICE's grouping.

Retain CAP_SYS_ADMIN permission for backwards compatibility.

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f0bf84a5 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: add three debug stat - cost.wait, indebt and indelay

These are really cheap to collect and can be useful in debugging iocost
behavior. Add them as debug stats for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

04603755 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: restore inuse update tracepoints

Update and restore the inuse update tracepoints.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ac33e91e 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: implement vtime loss compensation

When an iocg accumulates too much vtime or gets deactivated, we throw away
some vtime, which lowers the overall device utilization. As the exact amount
which is being thrown away is known, we can compensate by accelerating the
vrate accordingly so that the extra vtime generated in the current period
matches what got lost.

This significantly improves work conservation when involving high weight
cgroups with intermittent and bursty IO patterns.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dda1315f 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: halve debts if device stays idle

A low weight iocg can amass a large amount of debt, for example, when
anonymous memory gets reclaimed aggressively. If the system has a lot of
memory paired with a slow IO device, the debt can span multiple seconds or
more. If there are no other subsequent IO issuers, the in-debt iocg may end
up blocked paying its debt while the IO device is idle.

This patch implements a mechanism to protect against such pathological
cases. If the device has been sufficiently idle for a substantial amount of
time, the debts are halved. The criteria are on the conservative side as we
want to resolve the rare extreme cases without impacting regular operation
by forgiving debts too readily.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5160a5a5 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: implement delay adjustment hysteresis

Curently, iocost syncs the delay duration to the outstanding debt amount,
which seemed enough to protect the system from anon memory hogs. However,
that was mostly because the delay calcuation was using hweight_inuse which
quickly converges towards zero under debt for delay duration calculation,
often pusnishing debtors overly harshly for longer than deserved.

The previous patch fixed the delay calcuation and now the protection against
anonymous memory hogs isn't enough because the effect of delay is indirect
and non-linear and a huge amount of future debt can accumulate abruptly
while unthrottled.

This patch implements delay hysteresis so that delay is decayed
exponentially over time instead of getting cleared immediately as debt is
paid off. While the overall behavior is similar to the blk-cgroup
implementation used by blk-iolatency, a lot of the details are different and
due to the empirical nature of the mechanism, it's challenging to adapt the
mechanism for one controller without negatively impacting the other.

As the delay is gradually decayed now, there's no point in running it from
its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
the dedicated hrtimer is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c421a3eb 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: revamp debt handling

Debt handling had several issues.

* How much inuse a debtor carries wasn't clearly defined. inuse would be
driven down over time from not issuing IOs but it'd be better to clamp it
to minimum immediately once in debt.

* How much can be paid off was determined by hweight_inuse. As inuse was
driven down, the payment amount would fall together regardless of the
debtor's active weight. This means that the debtors were punished harshly.

* ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after
iocg_kick_delay().

This patch revamps debt handling so that

* Debt handling owns inuse for iocgs in debt and keeps them at zero.

* Payment amount is determined by hweight_active. This is more deterministic
and safer than hweight_inuse but still far from ideal in that it doesn't
factor in possible donations from other iocgs for debt payments. This
likely needs further improvements in the future.

* iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b0853ab4 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: revamp in-period donation snapbacks

When the margin drops below the minimum on a donating iocg, donation is
immediately canceled in full. There are a couple shortcomings with the
current behavior.

* It's abrupt. A small temporary budget deficit can lead to a wide swing in
weight allocation and a large surplus.

* It's open coded in the issue path but not implemented for the merge path.
A series of merges at a low inuse can make the iocg incur debts and stall
incorrectly.

This patch reimplements in-period donation snapbacks so that

* inuse adjustment and cost calculations are factored into
adjust_inuse_and_calc_cost() which is called from both the issue and merge
paths.

* Snapbacks are more gradual. It occurs in quarter steps.

* A snapback triggers if the margin goes below the low threshold and is
lower than the budget at the time of the last adjustment.

* For the above, __propagate_weights() stores the margin in
iocg->saved_margin. Move iocg->last_inuse storing together into
__propagate_weights() for consistency.

* Full snapback is guaranteed when there are waiters.

* With precise donation and gradual snapbacks, inuse adjustments are now a
lot more effective and the value of scaling inuse on weight changes isn't
clear. Removed inuse scaling from weight_update().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f1de2439 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: revamp donation amount determination

iocost has various safety nets to combat inuse adjustment calculation
inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
adjustment calculations are now accurate and we can make donation amount
determinations accurate too.

* Stop keeping track of past usage history and using the maximum. Act on the
immediate usage information.

* Remove donation constraints defined by SURPLUS_* constants. Donate
whatever isn't used.

* Determine the donation amount so that the iocg will end up with
MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
usage as the previous period. TARGET is set at 50% of period, which is the
previous maximum. This provides smooth convergence for most repetitive IO
patterns.

* Apply donation logic early at 20% budget. There's no risk in doing so as
the calculation is based on the delta between the current budget and the
target budget at the end of the coming period.

* Remove preemptive iocg activation for zero cost IOs. As donation can reach
near zero now, the mere activation doesn't provide any protection anymore.
In the unlikely case that this becomes a problem, the right solution is
assigning appropriate costs for such IOs.

This significantly improves the donation determination logic while also
simplifying it. Now all donations are immediate, exact and smooth.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e08d02aa 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: implement Andy's method for donation weight updates

iocost implements work conservation by reducing iocg->inuse and propagating
the adjustment upwards proportionally. However, while I knew the target
absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
out how to determine the iocg->inuse adjustment to achieve that and
approximated the adjustment by scaling iocg->inuse using the proportion of
the needed hweight_inuse changes.

When nested, these scalings aren't accurate even when adjusting a single
node as the donating node also receives the benefit of the donated portion.
When multiple nodes are donating as they often do, they can be wildly wrong.

iocost employed various safety nets to combat the inaccuracies. There are
ample buffers in determining how much to donate, the adjustments are
conservative and gradual. While it can achieve a reasonable level of work
conservation in simple scenarios, the inaccuracies can easily add up leading
to significant loss of total work. This in turn makes it difficult to
closely cap vrate as vrate adjustment is needed to compensate for the loss
of work. The combination of inaccurate donation calculations and vrate
adjustments can lead to wide fluctuations and clunky overall behaviors.

Andy Newell devised a method to calculate the needed ->inuse updates to
achieve the target hweight_inuse's. The method is compatible with the
proportional inuse adjustment propagation which allows all hot path
operations to be local to each iocg.

To roughly summarize, Andy's method divides the tree into donating and
non-donating parts, calculates global donation rate which is used to
determine the target hweight_inuse for each node, and then derives per-level
proportions. There's non-trivial amount of math involved. Please refer to
the following pdfs for detailed descriptions.

https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN

This patch implements Andy's method in transfer_surpluses(). This makes the
donation calculations accurate per cycle and enables further improvements in
other parts of the donation logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

93f7d2db 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: restructure surplus donation logic

The way the surplus donation logic is structured isn't great. There are two
separate paths for starting/increasing donations and decreasing them making
the logic harder to follow and is prone to unnecessary behavior differences.

In preparation for improved donation handling, this patch restructures the
code so that

* All donors - new, increasing and decreasing - are funneled through the
same code path.

* The target donation calculation is factored into hweight_after_donation()
which is called once from the same spot for all possible donors.

* Actual inuse adjustment is factored into trasnfer_surpluses().

This change introduces a few behavior differences - e.g. donation amount
reduction now uses the max usage of the recent three periods just like new
and increasing donations, and inuse now gets adjusted upwards the same way
it gets downwards. These differences are unlikely to have severely negative
implications and the whole logic will be revamped soon.

This patch also removes two tracepoints. The existing TPs don't quite fit
the new implementation. A later patch will update and reinstate them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

065655c8 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: decouple vrate adjustment from surplus transfers

Budget donations are inaccurate and could take multiple periods to converge.
To prevent triggering vrate adjustments while surplus transfers were
catching up, vrate adjustment was suppressed if donations were increasing,
which was indicated by non-zero nr_surpluses.

This entangling won't be necessary with the scheduled rewrite of donation
mechanism which will make it precise and immediate. Let's decouple the two
in preparation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8692d2db 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: replace iocg->has_surplus with ->surplus_list

Instead of marking iocgs with surplus with a flag and filtering for them
while walking all active iocgs, build a surpluses list. This doesn't make
much difference now but will help implementing improved donation logic which
will iterate iocgs with surplus multiple times.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1aa50d02 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us

Currently, iocg->usages[] which are used to guide inuse adjustments are
calculated from vtime deltas. This, however, assumes that the hierarchical
inuse weight at the time of calculation held for the entire period, which
often isn't true and can lead to significant errors.

Now that we have absolute usage information collected, we can derive
iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
decisions are made based on actual absolute usage. The calculated usage is
clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
saturation regardless of the current hierarchical inuse weight.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

97eb1975 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: add absolute usage stat

Currently, iocost doesn't collect or expose any statistics punting off all
monitoring duties to drgn based iocost_monitor.py. While it works for some
scenarios, there are some usability and data availability challenges. For
example, accurate per-cgroup usage information can't be tracked by vtime
progression at all and the number available in iocg->usages[] are really
short-term snapshots used for control heuristics with possibly significant
errors.

This patch implements per-cgroup absolute usage stat counter and exposes it
through io.stat along with the current vrate. Usage stat collection and
flushing employ the same method as cgroup rstat on the active iocg's and the
only hot path overhead is preemption toggling and adding to a percpu
counter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da437b95 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: grab ioc->lock for debt handling

Currently, debt handling requires only iocg->waitq.lock. In the future, we
want to adjust and propagate inuse changes depending on debt status. Let's
grab ioc->lock in debt handling paths in preparation.

* Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
ioc->lock needs to be made before entering the critical sections.

* Add and use iocg_[un]lock() which handles the conditional double locking.

* Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
the caller grabbed both locks.

This patch is prepatory and the comments contain references to future
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7ca5b2e6 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: streamline vtime margin and timer slack handling

The margin handling was pretty inconsistent.

* ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
thresholds. However, the two are in different units with the former
requiring conversion to vtime on use.

* iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
different values for the two.

This patch cleans up margin and timer slack handling:

* vtime margins are now recorded in ioc->margins.{min, max} on period
duration changes and used consistently.

* Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
used consistently for iocg_kick_waitq() and iocg_kick_delay().

The only functional change is shortening of timer slack. No meaningful
visible change is expected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ce95570a 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: make ioc_now->now and ioc->period_at 64bit

They are in microseconds and wrap in around 1.2 hours with u32. While
unlikely, confusions from wraparounds are still possible. We aren't saving
anything meaningful by keeping these u32. Let's make them u64.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bd0adb91 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: use WEIGHT_ONE based fixed point number for weights

To improve weight donations, we want to able to scale inuse with a greater
accuracy and down below 1. Let's make non-hierarchical weights to use
WEIGHT_ONE based fixed point numbers too like hierarchical ones.

This doesn't cause any behavior changes yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fe20cdb5 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: s/HWEIGHT_WHOLE/WEIGHT_ONE/g

We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to
WEIGHT_ONE.

Pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b84b49e 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: make iocg_kick_waitq() call iocg_kick_delay() after paying debt

iocg_kick_waitq() is the function which pays debt and iocg_kick_delay()
updates the actual delay status accordingly. If iocg_kick_delay() is not
called after iocg_kick_delay() updated debt, unnecessarily large delays can
be applied temporarily.

Let's make sure such conditions don't occur by making iocg_kick_waitq()
always call iocg_kick_delay() after paying debt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6ef20f78 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: move iocg_kick_delay() above iocg_kick_waitq()

We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in
preparation. This is pure code reorganization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

db84a72a 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: clamp inuse and skip noops in __propagate_weights()

__propagate_weights() currently expects the callers to clamp inuse within
[1, active], which is needlessly fragile. The inuse adjustment logic is
going to be revamped, in preparation, let's make __propagate_weights() clamp
inuse on entry.

Also, make it avoid weight updates altogether if neither active or inuse is
changed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

00410f1b 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: rename propagate_active_weights() to propagate_weights()

It already propagates two weights - active and inuse - and there will be
another soon. Let's drop the confusing misnomers. Rename
[__]propagate_active_weights() to [__]propagate_weights() and
commit_active_weights() to commit_weights().

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5e124f74 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: use local[64]_t for percpu stat

blk-iocost has been reading percpu stat counters from remote cpus which on
some archs can lead to torn reads in really rare occassions. Use local[64]_t
for those counters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1f06959b 31-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove the unused q argument to part_in_flight and part_in_flight_rw

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8328eb28 31-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove the disk argument to delete_partition

We can trivially derive the gendisk from the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f93af2a4 31-Aug-2020 Christoph Hellwig <hch@lst.de>

block: cleanup __alloc_disk_node

Use early returns and goto-based unwinding to simplify the flow a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7cf34d97 31-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove the discard_alignment field from struct hd_struct

The alignment offset is only used in slow path callers, so just calculate
it on the fly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b8917f5 31-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove the alignment_offset field from struct hd_struct

The alignment offset is only used in slow path callers, so just calculate
it on the fly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e44a6a23 27-Aug-2020 Xianting Tian <tian.xianting@h3c.com>

blk-mq: use BLK_MQ_NO_TAG for no tag

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cdfcef9e 27-Aug-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

block: Remove blk_mq_attempt_merge() function

The small blk_mq_attempt_merge() function is only called by
__blk_mq_sched_bio_merge(), just open code it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7d7ca7c5 27-Aug-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

block: Add a new helper to attempt to merge a bio

There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bdc6a287 27-Aug-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

block: Move blk_mq_bio_list_merge() into blk-merge.c

Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8e756373 27-Aug-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

block: Move bio merge related functions into blk-merge.c

It's better to move bio merge related functions into blk-merge.c,
which contains all merge related functions.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

339b5a25 29-Aug-2020 Danny Lin <danny@kdrag0n.dev>

blk-wbt: Remove obsolete multiqueue I/O scheduling comment

This comment was added before the multiqueue I/O scheduler framework
was introduced; multiqueue has support for I/O scheduling now, so this
obsolete comment can be removed.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3310eeba 27-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove the BIO_USER_MAPPED flag

Just check if there is private data, in which case the bio must have
originated from bio_copy_user_iov.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7589ad67 27-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove __blk_rq_map_user_iov

Just duplicate a small amount of code in the low-level map into the bio
and copy to the bio routines, leading to much easier to follow and
maintain code, and better shared error handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b63c052 27-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove __blk_rq_unmap_user

Open code __blk_rq_unmap_user in the two callers. Both never pass a NULL
bio, and one of them can use an existing local variable instead of the bio
flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f3256075 27-Aug-2020 Christoph Hellwig <hch@lst.de>

block: remove the BIO_NULL_MAPPED flag

We can simply use a boolean flag in the bio_map_data data structure
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c2b4bb8c 23-Aug-2020 Christoph Hellwig <hch@lst.de>

block: fix locking for struct block_device size updates

Two different callers use two different mutexes for updating the
block device size, which obviously doesn't help to actually protect
against concurrent updates from the different callers. In addition
one of the locks, bd_mutex is rather prone to deadlocks with other
parts of the block stack that use it for high level synchronization.

Switch to using a new spinlock protecting just the size updates, as
that is all we need, and make sure everyone does the update through
the proper helper.

This fixes a bug reported with the nvme revalidating disks during a
hot removal operation, which can currently deadlock on bd_mutex.

Reported-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a98278ec 01-Sep-2020 Jens Axboe <axboe@kernel.dk>

Merge branch 'block-5.9' into for-5.10/block

* block-5.9:
blk-stat: make q->stats->lock irqsafe
blk-iocost: ioc_pd_free() shouldn't assume irq disabled
block: fix locking in bdev_del_partition
block: release disk reference in hd_struct_free_work
block: ensure bdi->io_pages is always initialized
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu


e11d80a8 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-stat: make q->stats->lock irqsafe

blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
make it irqsafe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: cd006509b0a9 ("blk-iocost: account for IO size when testing latencies")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5aeac7c4 01-Sep-2020 Tejun Heo <tj@kernel.org>

blk-iocost: ioc_pd_free() shouldn't assume irq disabled

ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

08fc1ab6 01-Sep-2020 Christoph Hellwig <hch@lst.de>

block: fix locking in bdev_del_partition

We need to hold the whole device bd_mutex to protect against
other thread concurrently deleting out partition before we get
to it, and thus causing a use after free.

Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cafe01ef 01-Sep-2020 Ming Lei <ming.lei@redhat.com>

block: release disk reference in hd_struct_free_work

Commit e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
stops to release request queue from wq context because that commit
supposed all blk_put_queue() is called in context which is allowed
to sleep. However, this assumption isn't true because we release disk's
reference in partition's percpu_ref's ->release() which doesn't allow
to sleep, because the ->release() is run via call_rcu().

Fixes this issue by moving put disk reference into hd_struct_free_work()

Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

de1b0ee4 31-Aug-2020 Jens Axboe <axboe@kernel.dk>

block: ensure bdi->io_pages is always initialized

If a driver leaves the limit settings as the defaults, then we don't
initialize bdi->io_pages. This means that file systems may need to
work around bdi->io_pages == 0, which is somewhat messy.

Initialize the default value just like we do for ->ra_pages.

Cc: stable@vger.kernel.org
Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c41c3ec4 24-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request from Sagi:
- nvme completion rework from Christoph and Chao that mostly came
from a bit of divergence of how we classify errors related to
pathing/retry etc.
- nvmet passthru fixes from Chaitanya
- minor nvmet fixes from Amit and I
- mpath round-robin path selection fix from Martin
- ignore noiob for zoned devices from Keith
- minor nvme-fc fix from Tianjia"

- BFQ cgroup leak fix (Dmitry)

- block layer MAINTAINERS addition (Geert)

- fix null_blk FUA checking (Hou)

- get_max_io_size() size fix (Keith)

- fix block page_is_mergeable() for compound pages (Matthew)

- discard granularity fixes (Ming)

- IO scheduler ordering fix (Ming)

- misc fixes

* tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits)
null_blk: fix passing of REQ_FUA flag in null_handle_rq
nvmet: Disable keep-alive timer when kato is cleared to 0h
nvme: redirect commands on dying queue
nvme: just check the status code type in nvme_is_path_error
nvme: refactor command completion
nvme: rename and document nvme_end_request
nvme: skip noiob for zoned devices
nvme-pci: fix PRP pool size
nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
nvme: Use spin_lock_irq() when taking the ctrl->lock
nvmet: call blk_mq_free_request() directly
nvmet: fix oops in pt cmd execution
nvmet: add ns tear down label for pt-cmd handling
nvme: multipath: round-robin: eliminate "fallback" variable
nvme: multipath: round-robin: fix single non-optimized path case
nvme-fc: Fix wrong return value in __nvme_fc_init_request()
nvmet-passthru: Reject commands with non-sgl flags set
nvmet: fix a memory leak
blkcg: fix memleak for iolatency
MAINTAINERS: Add missing header files to BLOCK LAYER section
...


df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>

27029b4b 10-Aug-2020 Yufen Yu <yuyufen@huawei.com>

blkcg: fix memleak for iolatency

Normally, blkcg_iolatency_exit() will free related memory in iolatency
when cleanup queue. But if blk_throtl_init() return error and queue init
fail, blkcg_iolatency_exit() will not do that for us. Then it cause
memory leak.

Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e4b469c6 06-Aug-2020 Keith Busch <kbusch@kernel.org>

block: fix get_max_io_size()

A previous commit aligning splits to physical block sizes inadvertently
modified one return case such that that it now returns 0 length splits
when the number of sectors doesn't exceed the physical offset. This
later hits a BUG in bio_split(). Restore the previous working behavior.

Fixes: 9cc5169cd478b ("block: Improve physical block alignment of split bios")
Reported-by: Eric Deal <eric.deal@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

db03f88f 18-Aug-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: insert request not through ->queue_rq into sw/scheduler queue

c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed
to add request which has been through ->queue_rq() to the hw queue dispatch
list, however it adds request running out of budget or driver tag to hw queue
too. This way basically bypasses request merge, and causes too many request
dispatched to LLD, and system% is unnecessary increased.

Fixes this issue by adding request not through ->queue_rq into sw/scheduler
queue, and this way is safe because no ->queue_rq is called on this request
yet.

High %system can be observed on Azure storvsc device, and even soft lock
is observed. This patch reduces %system during heavy sequential IO,
meantime decreases soft lockup risk.

Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2de791ab 11-Aug-2020 Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>

bfq: fix blkio cgroup leakage v4

Changes from v1:
- update commit description with proper ref-accounting justification

commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
-> bfq_init_entity()
->bfqg_and_blkg_get(bfqg);
->entity->parent = bfqg->my_entity

-> bfq_put_queue(bfqq)
FINAL_PUT
->bfqg_and_blkg_put(bfqq_group(bfqq))
->kmem_cache_free(bfq_pool, bfqq);

So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")

bfq_group leak trace caused by bad commit:
-> blkg_alloc
-> bfq_pq_alloc
-> bfqg_get (+1)
->bfq_activate_bfqq
->bfq_activate_requeue_entity
-> __bfq_activate_entity
->bfq_get_entity
->bfqg_and_blkg_get (+1) <==== : Note1
->bfq_del_bfqq_busy
->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
-> bfq_forget_entity(is_in_service = true)
entity->on_st_or_in_serv = false <=== :Note2
if (is_in_service)
return; ==> do not touch reference
-> blkcg_css_offline
-> blkcg_destroy_blkgs
-> blkg_destroy
-> bfq_pd_offline
-> __bfq_deactivate_entity
if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
-> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq will leak forever, see test-case below.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i<max_iters;i++))
do
mkdir -p /sys/fs/cgroup/blkio/a
echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:

Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d8166519 17-Aug-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

block: Fix page_is_mergeable() for compound pages

If we pass in an offset which is larger than PAGE_SIZE, then
page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
leading to a large number of bio_vecs being used. Use a slightly more
obvious test that the two pages are compatible with each other.

Fixes: 52d52d1c98a9 ("block: only allow contiguous page structs in a bio_vec")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

943b40c8 17-Aug-2020 Ming Lei <ming.lei@redhat.com>

block: respect queue limit of max discard segment

When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
return false for discard request, then normal request merge is applied.
However, only queue_max_segments() is checked, so max discard segment
limit isn't respected.

Check max discard segment limit in the request merge code for fixing
the issue.

Discard request failure of virtio_blk is fixed.

Fixes: 69840466086d ("block: fix the DISCARD request merge")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d7d8535f 17-Aug-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART

SCHED_RESTART code path is relied to re-run queue for dispatch requests
in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
requests to hctx->dispatch.

memory barriers have to be used for ordering the following two pair of OPs:

1) adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()

2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().

Without the added memory barrier, either:

1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
dispatch side

or

2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in
hctx->dispatch from blk_mq_sched_restart() is missed.

IO hang in ltp/fs_fill test is reported by kernel test robot:

https://lkml.org/lkml/2020/7/26/77

Turns out it is caused by the above out-of-order OPs. And the IO hang
can't be observed any more after applying this patch.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

03ef5941 16-Aug-2020 Xu Wang <vulab@iscas.ac.cn>

bsg-lib: convert comma to semicolon

Replace a comma between expression statements by a semicolon.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

26bfeb26 16-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

block: blk-mq.c: fix @at_head kernel-doc warning

Fix a kernel-doc warning in block/blk-mq.c:

../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'

Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: André Almeida <andrealmeid@collabora.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4b6c093e 15-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A few fixes on the block side of things:

- Discard granularity fix (Coly)

- rnbd cleanups (Guoqing)

- md error handling fix (Dan)

- md sysfs fix (Junxiao)

- Fix flush request accounting, which caused an IO slowdown for some
configurations (Ming)

- Properly propagate loop flag for partition scanning (Lennart)"

* tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
block: fix double account of flush request's driver tag
loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
rnbd: no need to set bi_end_io in rnbd_bio_map_kern
rnbd: remove rnbd_dev_submit_io
md-cluster: Fix potential error pointer dereference in resize_bitmaps()
block: check queue's limits.discard_granularity in __blkdev_issue_discard()
md: get sysfs entry after redundancy attr group create


c1e2b842 09-Aug-2020 Ming Lei <ming.lei@redhat.com>

block: fix double account of flush request's driver tag

In case of none scheduler, we share data request's driver tag for
flush request, so have to mark the flush request as INFLIGHT for
avoiding double account of this driver tag.

Fixes: 568f27006577 ("blk-mq: centralise related handling into blk_mq_get_driver_tag")
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

97d052ea 10-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:

- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.

- The seqcount associated lock debug addons which were blocked by the
above fallout.

seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.

This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.

Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.

Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.

While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.

- Conversion of seqcount usage sites to associated types and
initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...


dfdf16ec 06-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This consists of the usual driver updates (ufs, qla2xxx, tcmu, lpfc,
hpsa, zfcp, scsi_debug) and minor bug fixes.

We also have a huge docbook fix update like most other subsystems and
no major update to the core (the few non trivial updates are either
minor fixes or removing an unused feature [scsi_sdb_cache])"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (307 commits)
scsi: scsi_transport_srp: Sanitize scsi_target_block/unblock sequences
scsi: ufs-mediatek: Apply DELAY_AFTER_LPM quirk to Micron devices
scsi: ufs: Introduce device quirk "DELAY_AFTER_LPM"
scsi: virtio-scsi: Correctly handle the case where all LUNs are unplugged
scsi: scsi_debug: Implement tur_ms_to_ready parameter
scsi: scsi_debug: Fix request sense
scsi: lpfc: Fix typo in comment for ULP
scsi: ufs-mediatek: Prevent LPM operation on undeclared VCC
scsi: iscsi: Do not put host in iscsi_set_flashnode_param()
scsi: hpsa: Correct ctrl queue depth
scsi: target: tcmu: Make TMR notification optional
scsi: target: tcmu: Implement tmr_notify callback
scsi: target: tcmu: Fix and simplify timeout handling
scsi: target: tcmu: Factor out new helper ring_insert_padding
scsi: target: tcmu: Do not queue aborted commands
scsi: target: tcmu: Use priv pointer in se_cmd
scsi: target: Add tmr_notify backend function
scsi: target: Modify core_tmr_abort_task()
scsi: target: iscsi: Fix inconsistent debug message
scsi: target: iscsi: Fix login error when receiving
...


b35fd742 05-Aug-2020 Coly Li <colyli@suse.de>

block: check queue's limits.discard_granularity in __blkdev_issue_discard()

If create a loop device with a backing NVMe SSD, current loop device
driver doesn't correctly set its queue's limits.discard_granularity and
leaves it as 0. If a discard request at LBA 0 on this loop device, in
__blkdev_issue_discard() the calculated req_sects will be 0, and a zero
length discard request will trigger a BUG() panic in generic block layer
code at block/blk-mq.c:563.

[ 955.565006][ C39] ------------[ cut here ]------------
[ 955.559660][ C39] invalid opcode: 0000 [#1] SMP NOPTI
[ 955.622171][ C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G E 5.8.0-default+ #40
[ 955.622171][ C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020
[ 955.622175][ C39] RIP: 0010:blk_mq_end_request+0x107/0x110
[ 955.622177][ C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54
[ 955.622179][ C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202
[ 955.749277][ C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003
[ 955.749278][ C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 955.749279][ C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 955.749279][ C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160
[ 955.749280][ C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e
[ 955.749281][ C39] FS: 0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000
[ 955.749282][ C39] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 955.749282][ C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0
[ 955.749283][ C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 955.749284][ C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 955.749284][ C39] PKRU: 55555554
[ 955.749285][ C39] Call Trace:
[ 955.749290][ C39] blk_done_softirq+0x99/0xc0
[ 957.550669][ C39] __do_softirq+0xd3/0x45f
[ 957.550677][ C39] ? smpboot_thread_fn+0x2f/0x1e0
[ 957.550679][ C39] ? smpboot_thread_fn+0x74/0x1e0
[ 957.550680][ C39] ? smpboot_thread_fn+0x14e/0x1e0
[ 957.550684][ C39] run_ksoftirqd+0x30/0x60
[ 957.550687][ C39] smpboot_thread_fn+0x149/0x1e0
[ 957.886225][ C39] ? sort_range+0x20/0x20
[ 957.886226][ C39] kthread+0x137/0x160
[ 957.886228][ C39] ? kthread_park+0x90/0x90
[ 957.886231][ C39] ret_from_fork+0x22/0x30
[ 959.117120][ C39] ---[ end trace 3dacdac97e2ed164 ]---

This is the procedure to reproduce the panic,
# modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
# losetup -f /dev/nvme0n1 --direct-io=on
# blkdiscard /dev/loop0 -o 0 -l 0x200

This patch fixes the issue by checking q->limits.discard_granularity in
__blkdev_issue_discard() before composing the discard bio. If the value
is 0, then prints a warning oops information and returns -EOPNOTSUPP to
the caller to indicate that this buggy device driver doesn't support
discard request.

Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
Fixes: c52abf563049 ("loop: Better discard support for block devices")
Reported-and-suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Evan Green <evgreen@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

060a72a2 05-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block

Pull block stacking updates from Jens Axboe:
"The stacking related fixes depended on both the core block and drivers
branches, so here's a topic branch with that change.

Outside of that, a late fix from Johannes for zone revalidation"

* tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
block: don't do revalidate zones on invalid devices
block: remove blk_queue_stack_limits
block: remove bdev_stack_limits
block: inherit the zoned characteristics in blk_stack_limits


e0fc99e2 05-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

- NVMe:
- ZNS support (Aravind, Keith, Matias, Niklas)
- Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
Dongli, Max, Sagi)

- null_blk zone capacity support (Aravind)

- MD:
- raid5/6 fixes (ChangSyun)
- Warning fixes (Damien)
- raid5 stripe fixes (Guoqing, Song, Yufen)
- sysfs deadlock fix (Junxiao)
- raid10 deadlock fix (Vitaly)

- struct_size conversions (Gustavo)

- Set of bcache updates/fixes (Coly)

* tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
md/raid5: Allow degraded raid6 to do rmw
md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
raid5: don't duplicate code for different paths in handle_stripe
raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
md: print errno in super_written
md/raid5: remove the redundant setting of STRIPE_HANDLE
md: register new md sysfs file 'uuid' read-only
md: fix max sectors calculation for super 1.0
nvme-loop: remove extra variable in create ctrl
nvme-loop: set ctrl state connecting after init
nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
nvme-multipath: fix logic for non-optimized paths
nvme-rdma: fix controller reset hang during traffic
nvme-tcp: fix controller reset hang during traffic
nvmet: introduce the passthru Kconfig option
nvmet: introduce the passthru configfs interface
nvmet: Add passthru enable/disable helpers
nvmet: add passthru code to process commands
nvme: export nvme_find_get_ns() and nvme_put_ns()
nvme: introduce nvme_ctrl_get_by_path()
...


99ea1521 04-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
"This is long overdue, and has hidden too many bugs over the years. The
series has several "by hand" fixes, and then a trivial treewide
replacement.

- Clean up non-trivial uses of uninitialized_var()

- Update documentation and checkpatch for uninitialized_var() removal

- Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
compiler: Remove uninitialized_var() macro
treewide: Remove uninitialized_var() usage
checkpatch: Remove awareness of uninitialized_var() macro
mm/debug_vm_pgtable: Remove uninitialized_var() usage
f2fs: Eliminate usage of uninitialized_var() macro
media: sur40: Remove uninitialized_var() usage
KVM: PPC: Book3S PR: Remove uninitialized_var() usage
clk: spear: Remove uninitialized_var() usage
clk: st: Remove uninitialized_var() usage
spi: davinci: Remove uninitialized_var() usage
ide: Remove uninitialized_var() usage
rtlwifi: rtl8192cu: Remove uninitialized_var() usage
b43: Remove uninitialized_var() usage
drbd: Remove uninitialized_var() usage
x86/mm/numa: Remove uninitialized_var() usage
docs: deprecated.rst: Add uninitialized_var()


cdc8fcb4 03-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:

- Cleanup how memory accounting is done on ring setup/free (Bijan)

- sq array offset calculation fixup (Dmitry)

- Consistently handle blocking off O_DIRECT submission path (me)

- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)

- IO completion optimizations (me)

- Fix race with accounting and ring fd install (me)

- Support EPOLLEXCLUSIVE (Jiufei)

- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)

- Completion side cleanups (Pavel)

- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

- Request environment grabbing cleanups (Pavel)

- File and socket read/write cleanups (Pavel)

- Improve kiocb_set_rw_flags() (Pavel)

- Tons of fixes and cleanups (Pavel)

- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...


382625d0 03-Aug-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.

- Softirq completion cleanups (Christoph)

- Stop using ->queuedata (Christoph)

- Cleanup bd claiming (Christoph)

- Use check_events, moving away from the legacy media change
(Christoph)

- Use inode i_blkbits consistently (Christoph)

- Remove old unused writeback congestion bits (Christoph)

- Cleanup/unify submission path (Christoph)

- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)

- sbitmap cleared bits handling (John)

- Request merging blktrace event addition (Jan)

- sysfs add/remove race fixes (Luis)

- blk-mq tag fixes/optimizations (Ming)

- Duplicate words in comments (Randy)

- Flush deferral cleanup (Yufen)

- IO context locking/retry fixes (John)

- struct_size() usage (Gustavo)

- blk-iocost fixes (Chengming)

- blk-cgroup IO stats fixes (Boris)

- Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...


1a1206dc 30-Jul-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

block: don't do revalidate zones on invalid devices

When we loose a device for whatever reason while (re)scanning zones, we
trip over a NULL pointer in blk_revalidate_zone_cb, like in the following
log:

sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB)
sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed
sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08
sd 0:0:0:0: [sda] Sense Key : 0xb [current]
sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6
sda: failed to revalidate zones
sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B)
sda: detected capacity change from 14000519643136 to 0
==================================================================
BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550
Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58

CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound async_run_entry_fn
Call Trace:
dump_stack+0x7d/0xb0
? blk_revalidate_zone_cb+0x1b7/0x550
kasan_report.cold+0x5/0x37
? blk_revalidate_zone_cb+0x1b7/0x550
check_memory_region+0x145/0x1a0
blk_revalidate_zone_cb+0x1b7/0x550
sd_zbc_parse_report+0x1f1/0x370
? blk_req_zone_write_trylock+0x200/0x200
? sectors_to_logical+0x60/0x60
? blk_req_zone_write_trylock+0x200/0x200
? blk_req_zone_write_trylock+0x200/0x200
sd_zbc_report_zones+0x3c4/0x5e0
? sd_dif_config_host+0x500/0x500
blk_revalidate_disk_zones+0x231/0x44d
? _raw_write_lock_irqsave+0xb0/0xb0
? blk_queue_free_zone_bitmaps+0xd0/0xd0
sd_zbc_read_zones+0x8cf/0x11a0
sd_revalidate_disk+0x305c/0x64e0
? __device_add_disk+0x776/0xf20
? read_capacity_16.part.0+0x1080/0x1080
? blk_alloc_devt+0x250/0x250
? create_object.isra.0+0x595/0xa20
? kasan_unpoison_shadow+0x33/0x40
sd_probe+0x8dc/0xcd2
really_probe+0x20e/0xaf0
__driver_attach_async_helper+0x249/0x2d0
async_run_entry_fn+0xbe/0x560
process_one_work+0x764/0x1290
? _raw_read_unlock_irqrestore+0x30/0x30
worker_thread+0x598/0x12f0
? __kthread_parkme+0xc6/0x1b0
? schedule+0xed/0x2c0
? process_one_work+0x1290/0x1290
kthread+0x36b/0x440
? kthread_create_worker_on_cpu+0xa0/0xa0
ret_from_fork+0x22/0x30
==================================================================

When the device is already gone we end up with the following scenario:
The device's capacity is 0 and thus the number of zones will be 0 as well. When
allocating the bitmap for the conventional zones, we then trip over a NULL
pointer.

So if we encounter a zoned block device with a 0 capacity, don't dare to
revalidate the zones sizes.

Fixes: 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d958e343 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: blk-timeout: delete duplicated word

Drop the repeated word "request".
Change to the correct kernel-doc notation for function name separtor.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c4aecaa2 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: blk-mq-sched: delete duplicated word

Drop the repeated word "to".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

70f15a4f 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: blk-mq: delete duplicated word

Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0d20dcc2 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: genhd: delete duplicated words

Drop the repeated word "to" in multiple places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5b8f65e1 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: elevator: delete duplicated word and fix typos

Drop the repeated word "the".
Fix typos of "features" and "specified".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3cf14889 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: bio: delete duplicated words

Drop the repeated words "a" and "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f06678af 30-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

block: bfq-iosched: fix duplicated word

Change "at at" to "at a".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d9012a59 30-Jul-2020 Chengming Zhou <zhouchengming@bytedance.com>

iocost: Fix check condition of iocg abs_vdebt

We shouldn't skip iocg when its abs_vdebt is not zero.

Fixes: 0b80f9866e6b ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

67b7b641 20-Jul-2020 Ahmed S. Darwish <a.darwish@linutronix.de>

iocost: Use sequence counter with associated spinlock

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20200720155530.1173732-21-a.darwish@linutronix.de

08c875cb 28-Jul-2020 Daniel Wagner <dwagner@suse.de>

block: Use non _rcu version of list functions for tag_set_list

tag_set_list is only accessed under the tag_set_lock lock. There is
no need for using the _rcu list functions.

The _rcu list function were introduced to allow read access to the
tag_set_list protected under RCU, see 705cda97ee3a ("blk-mq: Make it
safe to use RCU to iterate over blk_mq_tag_set.tag_list") and
05b79413946d ("Revert "blk-mq: don't handle TAG_SHARED in restart"").
Those changes got reverted later but the cleanup commit missed a
couple of places to undo the changes.

Fixes: 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8f38f8e0 06-Jul-2020 Alan Stern <stern@rowland.harvard.edu>

scsi: block: pm: Simplify resume handling

Commit 05d18ae1cc8a ("scsi: pm: Balance pm_only counter of request queue
during system resume") fixed a problem in the block layer's runtime-PM
code: blk_set_runtime_active() failed to call blk_clear_pm_only().
However, the commit's implementation was awkward; it forced the SCSI
system-resume handler to choose whether to call blk_post_runtime_resume()
or blk_set_runtime_active(), depending on whether or not the SCSI device
had previously been runtime suspended.

This patch simplifies the situation considerably by adding the missing
function call directly into blk_set_runtime_active() (under the condition
that the queue is not already in the RPM_ACTIVE state). This allows the
SCSI routine to revert back to its original form. Furthermore, making this
change reveals that blk_post_runtime_resume() (in its success pathway) does
exactly the same thing as blk_set_runtime_active(). The duplicate code is
easily removed by making one routine call the other.

No functional changes are intended.

Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu
CC: Can Guo <cang@codeaurora.org>
CC: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

b9b1a5d7 20-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove blk_queue_stack_limits

This function is just a tiny wrapper around blk_stack_limits. Open code
it int the two callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9efa82ef 20-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove bdev_stack_limits

This function is just a tiny wrapper around blk_stack_limit and has
two callers. Simplify the stack a bit by open coding it in the two
callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3093a479 20-Jul-2020 Christoph Hellwig <hch@lst.de>

block: inherit the zoned characteristics in blk_stack_limits

Lift the code from device mapper into blk_stack_limits to inherity
the stacking limitations. This ensures we do the right thing for
all stacked zoned block devices.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4f43d648 20-Jul-2020 Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.9/drivers' into for-5.9/block-merge

* for-5.9/drivers: (38 commits)
block: add max_active_zones to blk-sysfs
block: add max_open_zones to blk-sysfs
s390/dasd: Use struct_size() helper
s390/dasd: fix inability to use DASD with DIAG driver
md-cluster: fix wild pointer of unlock_all_bitmaps()
md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes
md: fix deadlock causing by sysfs_notify
md: improve io stats accounting
md: raid0/linear: fix dereference before null check on pointer mddev
rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()'
nvme: remove ns->disk checks
nvme-pci: use standard block status symbolic names
nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size()
nvme-pci: add a blank line after declarations
nvme-pci: fix some comments issues
nvme-pci: remove redundant segment validation
nvme: document quirked Intel models
nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs
nvme: support for zoned namespaces
nvme: support for multiple Command Sets Supported and Effects log pages
...


9caaa66c 20-Jul-2020 Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.9/block' into for-5.9/block-merge

* for-5.9/block: (124 commits)
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
block: use bd_prepare_to_claim directly in the loop driver
block: refactor bd_start_claiming
block: simplify the restart case in __blkdev_get
Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
block: always remove partitions from blk_drop_partitions()
block: relax jiffies rounding for timeouts
blk-mq: remove redundant validation in __blk_mq_end_request()
blk-mq: Remove unnecessary local variable
writeback: remove bdi->congested_fn
writeback: remove struct bdi_writeback_congested
writeback: remove {set,clear}_wb_congested
...


ef45fe47 01-Jun-2020 Boris Burkov <boris@bur.io>

blk-cgroup: show global disk stats in root cgroup io.stat

In order to improve consistency and usability in cgroup stat accounting,
we would like to support the root cgroup's io.stat.

Since the root cgroup has processes doing io even if the system has no
explicitly created cgroups, we need to be careful to avoid overhead in
that case. For that reason, the rstat algorithms don't handle the root
cgroup, so just turning the file on wouldn't give correct statistics.

To get around this, we simulate flushing the iostat struct by filling it
out directly from global disk stats. The result is a root cgroup io.stat
file consistent with both /proc/diskstats and io.stat.

Note that in order to collect the disk stats, we needed to iterate over
devices. To facilitate that, we had to change the linkage of a disk_type
to external so that it can be used from blk-cgroup.c to iterate over
disks.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Boris Burkov <boris@bur.io>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd1fc4b9 01-Jun-2020 Boris Burkov <boris@bur.io>

blk-cgroup: make iostat functions visible to stat printing

Previously, the code which printed io.stat only needed access to the
generic rstat flushing code, but since we plan to write some more
specific code for preparing root cgroup stats, we need to manipulate
iostat structs directly. Since declaring static functions ahead does not
seem like common practice in this file, simply move the iostat functions
up. We only plan to use blkg_iostat_set, but it seems better to keep them
all together.

Signed-off-by: Boris Burkov <boris@bur.io>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9b15d109 16-Jul-2020 Coly Li <colyli@suse.de>

block: improve discard bio alignment in __blkdev_issue_discard()

This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
/dev/sdb, fill data into the first 50GB by,
# dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
# blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
# sg_get_lba_status /dev/sdb -m 1048 --lba=0
descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
26 sector_t nr_sects, gfp_t gfp_mask, int flags,
27 struct bio **biop)
28 {
29 struct request_queue *q = bdev_get_queue(bdev);
[snipped]
56
57 while (nr_sects) {
58 sector_t req_sects = min_t(sector_t, nr_sects,
59 bio_allowed_max_sectors(q));
60
61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
62
63 bio = blk_next_bio(bio, 0, gfp_mask);
64 bio->bi_iter.bi_sector = sector;
65 bio_set_dev(bio, bdev);
66 bio_set_op_attrs(bio, op, 0);
67
68 bio->bi_iter.bi_size = req_sects << 9;
69 sector += req_sects;
70 nr_sects -= req_sects;
[snipped]
79 }
80
81 *biop = bio;
82 return 0;
83 }
84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5718d6c 16-Jul-2020 Yufen Yu <yuyufen@huawei.com>

block: defer flush request no matter whether we have elevator

Commit 7520872c0cf4 ("block: don't defer flushes on blk-mq + scheduling")
tried to fix deadlock for cycled wait between flush requests and data
request into flush_data_in_flight. The former holded all driver tags
and wait for data request completion, but the latter can not complete
for waiting free driver tags.

After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
for flush rq"), flush requests will not get driver tag before queuing
into flush queue.

* With elevator, flush request just get sched_tags before inserting
flush queue. It will not get driver tag until issue them to driver.
data request on list fq->flush_data_in_flight will complete in
the end.

* Without elevator, each flush request will get a driver tag when
allocate request. Then data request on fq->flush_data_in_flight
don't worry about lacking driver tag.

In both of these cases, cycled wait cannot be true. So we may allow
to defer flush request.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

943c4d90 17-Jul-2020 Wei Yongjun <weiyongjun1@huawei.com>

block: make blk_timeout_init() static

The sparse tool complains as follows:

block/blk-timeout.c:93:12: warning:
symbol 'blk_timeout_init' was not declared. Should it be static?

Function blk_timeout_init() is not used outside of blk-timeout.c, so
mark it static.

Fixes: 9054650fac24 ("block: relax jiffies rounding for timeouts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3f649ab7 03-Jun-2020 Kees Cook <keescook@chromium.org>

treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>

ab96bbab 19-Jun-2020 John Ogness <john.ogness@linutronix.de>

block: remove retry loop in ioc_release_fn()

The reverse-order double lock dance in ioc_release_fn() is using a
retry loop. This is a problem on PREEMPT_RT because it could preempt
the task that would release q->queue_lock and thus live lock in the
retry loop.

RCU is already managing the freeing of the request queue and icq. If
the trylock fails, use RCU to guarantee that the request queue and
icq are not freed and re-acquire the locks in the correct order,
allowing forward progress.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a43f085f 19-Jun-2020 John Ogness <john.ogness@linutronix.de>

block: remove unnecessary ioc nested locking

The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
elevator callback. This led to a lockdep warning, which was fixed in
commit d8c66c5d5924 ("block: fix lockdep warning on io_context release
put_io_context()") by using a nested subclass for the ioc spinlock.
However, with commit f382fb0bcef4 ("block: remove legacy IO schedulers")
the CFQ IO scheduler no longer exists.

The BFQ IO scheduler also implements the exit_icq() elevator callback but
does not call put_io_context().

The nested subclass for the ioc spinlock is no longer needed. Since it
existed as an exception and no longer applies, remove the nested subclass
usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

659bf827 14-Jul-2020 Niklas Cassel <niklas.cassel@wdc.com>

block: add max_active_zones to blk-sysfs

Add a new max_active zones definition in the sysfs documentation.
This definition will be common for all devices utilizing the zoned block
device support in the kernel.

Export max_active_zones according to this new definition for NVMe Zoned
Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
the kernel), and ZBC SCSI devices.

Add the new max_active_zones member to struct request_queue, rather
than as a queue limit, since this property cannot be split across stacking
drivers.

For SCSI devices, even though max active zones is not part of the ZBC/ZAC
spec, export max_active_zones as 0, signifying "no limit".

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e15864f8 14-Jul-2020 Niklas Cassel <niklas.cassel@wdc.com>

block: add max_open_zones to blk-sysfs

Add a new max_open_zones definition in the sysfs documentation.
This definition will be common for all devices utilizing the zoned block
device support in the kernel.

Export max open zones according to this new definition for NVMe Zoned
Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
the kernel), and ZBC SCSI devices.

Add the new max_open_zones member to struct request_queue, rather
than as a queue limit, since this property cannot be split across stacking
drivers.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e791ee68 15-Jul-2020 Jens Axboe <axboe@kernel.dk>

Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."

This reverts commit 826f2f48da8c331ac51e1381998d318012d66550.

Qian Cai reports that this commit causes stalls with swap. Revert until
the reason can be figured out.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d0f0f1b4 15-Jul-2020 Ming Lei <ming.lei@redhat.com>

block: always remove partitions from blk_drop_partitions()

In theory, when GENHD_FL_NO_PART_SCAN is set, no partitions can be created
on one disk. However, ioctl(BLKPG, BLKPG_ADD_PARTITION) doesn't check
GENHD_FL_NO_PART_SCAN, so partitions still can be added even though
GENHD_FL_NO_PART_SCAN is set.

So far blk_drop_partitions() only removes partitions when disk_part_scan_enabled()
return true. This way can make ghost partition on loop device after changing/clearing
FD in case that PARTSCAN is disabled, such as partitions can be added
via 'parted' on loop disk even though GENHD_FL_NO_PART_SCAN is set.

Fix this issue by always removing partitions in blk_drop_partitions(), and
this way is correct because the current code supposes that no partitions
can be added in case of GENHD_FL_NO_PART_SCAN.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9054650f 13-Jul-2020 Jens Axboe <axboe@kernel.dk>

block: relax jiffies rounding for timeouts

In doing high IOPS testing, blk-mq is generally pretty well optimized.
There are a few things that stuck out as using more CPU than what is
really warranted, and one thing is the round_jiffies_up() that we do
twice for each request. That accounts for about 0.8% of the CPU in
my testing.

We can make this cheaper by avoiding an integer division, by just adding
a rough HZ mask that we can AND with instead. The timeouts are only on a
second granularity already, we don't have to be that accurate here and
this patch barely changes that. All we care about is nice grouping.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

d33db702 10-Jul-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Fix for inflight accounting, which affects only dm (Ming)

- Fix documentation error for bfq (Yufen)

- Fix memory leak for nbd (Zheng)

* tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-block:
nbd: Fix memory leak in nbd_add_socket
blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight()
docs: block: update and fix tiny error for bfq


87890092 04-Jul-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-mq: remove redundant validation in __blk_mq_end_request()

We've already validated the 'q->elevator' before calling
->ops.completed_request() in blk_mq_sched_completed_request(), thus no
need to validate rq->internal_tag again. Rmove it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

106e71c5 04-Jul-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-mq: Remove unnecessary local variable

Remove unnecessary local variable 'ret' in blk_mq_dispatch_hctx_list().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8c911f3d 01-Jul-2020 Christoph Hellwig <hch@lst.de>

writeback: remove struct bdi_writeback_congested

We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a564e23f 08-Jul-2020 Christoph Hellwig <hch@lst.de>

md: switch to ->check_events for media change notifications

md is the last driver using the legacy media_changed method. Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

568f2700 06-Jul-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: centralise related handling into blk_mq_get_driver_tag

Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7bf13729 01-Jul-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: streamline handling of q->mq_ops->queue_rq result

Current handling of q->mq_ops->queue_rq result is a bit ugly:

- two branches which needs to 'continue' have to check if the
dispatch local list is empty, otherwise one bad request may
be retrieved via 'rq = list_first_entry(list, struct request, queuelist);'

- the branch of 'if (unlikely(ret != BLK_STS_OK))' isn't easy
to follow, since it is actually one error branch.

Streamline this handling, so the code becomes more readable, meantime
potential kernel oops can be avoided in case that the last request in
local dispatch list is failed.

Fixes: fc17b6534eb8 ("blk-mq: switch ->queue_rq return value to blk_status_t")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

240e6ee2 29-Jun-2020 Keith Busch <keith.busch@wdc.com>

nvme: support for zoned namespaces

Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
in NVM Express TP4053. Zoned namespaces are discovered based on their
Command Set Identifier reported in the namespaces Namespace
Identification Descriptor list. A successfully discovered Zoned
Namespace will be registered with the block layer as a host managed
zoned block device with Zone Append command support. A namespace that
does not support append is not supported by the driver.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Keith Busch <keith.busch@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

82394db7 29-Jun-2020 Matias Bjørling <matias.bjorling@wdc.com>

block: add capacity field to zone descriptors

In the zoned storage model, the sectors within a zone are typically all
writeable. With the introduction of the Zoned Namespace (ZNS) Command
Set in the NVM Express organization, the model was extended to have a
specific writeable capacity.

Extend the zone descriptor data structure with a zone capacity field to
indicate to the user how many sectors in a zone are writeable.

Introduce backward compatibility in the zone report ioctl by extending
the zone report header data structure with a flags field to indicate if
the capacity field is available.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

482c6b61 08-Jul-2020 Jens Axboe <axboe@kernel.dk>

Merge tag 'v5.8-rc4' into for-5.9/drivers

Merge in 5.8-rc4 for-5.9/block to setup for-5.9/drivers, to provide
a clean base and making the life for the NVMe changes easier.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v5.8-rc4': (732 commits)
Linux 5.8-rc4
x86/ldt: use "pr_info_once()" instead of open-coding it badly
MIPS: Do not use smp_processor_id() in preemptible code
MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
.gitignore: Do not track `defconfig` from `make savedefconfig`
io_uring: fix regression with always ignoring signals in io_cqring_wait()
x86/ldt: Disable 16-bit segments on Xen PV
x86/entry/32: Fix #MC and #DB wiring on x86_32
x86/entry/xen: Route #DB correctly on Xen PV
x86/entry, selftests: Further improve user entry sanity checks
x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
i2c: mlxcpld: check correct size of maximum RECV_LEN packet
i2c: add Kconfig help text for slave mode
i2c: slave-eeprom: update documentation
i2c: eg20t: Load module automatically if ID matches
i2c: designware: platdrv: Set class based on DMI
i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
mm/page_alloc: fix documentation error
vmalloc: fix the owner argument for the new __vmalloc_node_range callers
mm/cma.c: use exact_nid true to fix possible per-numa cma leak
...


0e6e255e 07-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove a bogus warning in __submit_bio_noacct_mq

If blk_mq_submit_bio flushes the plug list, bios for other disks can
show up on current->bio_list. As that doesn't involve any stacking of
block device it is entirely harmless and we should not warn about
this case.

Fixes: ff93ea0ce763 ("block: shortcut __submit_bio_noacct for blk-mq drivers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05a4fed6 07-Jul-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight()

dm-multipath is the only user of blk_mq_queue_inflight(). When
dm-multipath calls blk_mq_queue_inflight() to check if it has
outstanding IO it can get a false negative. The reason for this is
blk_mq_rq_inflight() doesn't consider requests that are no longer
MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
called or finished yet) as "inflight".

This causes request-based dm-multipath's dm_wait_for_completion() to
return before all outstanding dm-multipath requests have actually
completed. This breaks DM multipath's suspend functionality because
blk-mq requests complete after DM's suspend has finished -- which
shouldn't happen.

Fix this by considering any request not in the MQ_RQ_IDLE state
(so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
blk_mq_rq_inflight().

Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

29206c63 05-Jul-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe fixes from Christoph:
- Fix crash in multi-path disk add (Christoph)
- Fix ignore of identify error (Sagi)

- Fix a compiler complaint that a function should be static (Wei)

* tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-block:
block: make function __bio_integrity_free() static
nvme: fix a crash in nvme_mpath_add_disk
nvme: fix identify error status silent ignore


7cc2a8ea 02-Jul-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.8-2020-07-01' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Use kvfree_sensitive() for the block keyslot free (Eric)

- Sync blk-mq debugfs flags (Hou)

- Memory leak fix in virtio-blk error path (Hou)

* tag 'block-5.8-2020-07-01' of git://git.kernel.dk/linux-block:
virtio-blk: free vblk-vqs in error path of virtblk_probe()
block/keyslot-manager: use kvfree_sensitive()
blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flags


7c792f33 02-Jul-2020 Christoph Hellwig <hch@lst.de>

block: initialize current->bio_list[1] in __submit_bio_noacct_mq

bio_alloc_bioset references current->bio_list[1], so we need to
initialize it for the blk-mq submission path as well.

Fixes: ff93ea0ce763 ("block: shortcut __submit_bio_noacct for blk-mq drivers")
Reported-by: Qian Cai <cai@lca.pw>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3197d48a 01-Jul-2020 Wei Yongjun <weiyongjun1@huawei.com>

block: make function __bio_integrity_free() static

Fix sparse build warning:

block/bio-integrity.c:27:6: warning:
symbol '__bio_integrity_free' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4e2f62e5 01-Jul-2020 Jens Axboe <axboe@kernel.dk>

Revert "blk-mq: put driver tag when this request is completed"

This reverts commits the following commits:

37f4a24c2469a10a4c16c641671bd766e276cf9f
723bf178f158abd1ce6069cb049581b3cb003aab
36a3df5a4574d5ddf59804fcd0c4e9654c514d9a

The last one is the culprit, but we have to go a bit deeper to get this
to revert cleanly. There's been a report that this breaks some MMC
setups [1], and also causes an issue with swap [2]. Until this can be
figured out, revert the offending commits.

[1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
[2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6e2fa4dd 01-Jul-2020 Hongnan Li <hongnan.li@linux.alibaba.com>

blk-iolatency: only call ktime_get() if needed

ktime_to_ns(ktime_get()), which is expensive, does not need to be called
if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
blk_iolatency is disabled.

Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a6c35f9 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove direct_make_request

Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ff93ea0c 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: shortcut __submit_bio_noacct for blk-mq drivers

For blk-mq drivers bios can only be inserted for the same queue. So
bypass the complicated sorting logic in __submit_bio_noacct with
a blk-mq simpler submission helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

566acf2d 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: refator submit_bio_noacct

Split out a __submit_bio_noacct helper for the actual de-recursion
algorithm, and simplify the loop by using a continue when we can't
enter the queue for a bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ed00aabd 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: rename generic_make_request to submit_bio_noacct

generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c62b37d9 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: move ->make_request_fn to struct block_device_operations

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e439ab71 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove the nr_sectors variable in generic_make_request_checks

The variable is only used once, so just open code the bio_sector()
there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

833f84e2 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove the NULL queue check in generic_make_request_checks

All registers disks must have a valid queue pointer, so don't bother to
log a warning for that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c8178674 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: tidy up a warning in bio_check_ro

The "generic_make_request: " prefix has no value, and will soon become
stale.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f695ca38 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue argument from blk_queue_split

The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5fc1e8b 27-Apr-2020 Hou Tao <houtao1@huawei.com>

blk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq()

Just use rq directly, the usage of list_entry_rq() doesn't make any
sense.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0b8cc25d 30-Jun-2020 Colin Ian King <colin.king@canonical.com>

blk-cgroup: clean up indentation

There is a statement that is indented one level too deeply, fix it
by removing a tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

37f4a24c 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: centralise related handling into blk_mq_get_driver_tag

Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

723bf178 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: move blk_mq_put_driver_tag() into blk-mq.c

It is used by blk-mq.c only, so move it to the source file.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

570e9b73 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: move blk_mq_get_driver_tag into blk-mq.c

blk_mq_get_driver_tag() is only used by blk-mq.c and is supposed to
stay in blk-mq.c, so move it and preparing for cleanup code of
get/put driver tag.

Meantime hctx_may_queue() is moved to header file and it is fine
since it is defined as inline always.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6e6fcbc2 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: support batching dispatch in case of io

More and more drivers want to get batching requests queued from
block layer, such as mmc, and tcp based storage drivers. Also
current in-tree users have virtio-scsi, virtio-blk and nvme.

For none, we already support batching dispatch.

But for io scheduler, every time we just take one request from scheduler
and pass the single request to blk_mq_dispatch_rq_list(). This way makes
batching dispatch not possible when io scheduler is applied. One reason
is that we don't want to hurt sequential IO performance, becasue IO
merge chance is reduced if more requests are dequeued from scheduler
queue.

Try to support batching dispatch for io scheduler by starting with the
following simple approach:

1) still make sure we can get budget before dequeueing request

2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy
we fackback to non-batching dispatch, otherwise dequeue as many as
possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list().

Wrt. 2), we use similar policy for none, and turns out that SCSI SSD
performance got improved much.

In future, maybe we can develop more intelligent algorithem for batching
dispatch.

Baolin has tested this patch and found that MMC performance is improved[3].

[1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r
[2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/
[3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1fd40b5e 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: pass obtained budget count to blk_mq_dispatch_rq_list

Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
for supporting fully batching submission.

With the obtained budget count, it is easier to put extra budgets
in case of .queue_rq failure.

Meantime remove the old 'got_budget' parameter.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bbdb3c5d 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: remove dead check from blk_mq_dispatch_rq_list

When BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE is returned from
.queue_rq, the 'list' variable always holds this rq which isn't
queued to LLD successfully.

So blk_mq_dispatch_rq_list() always returns false from the branch
of '!list_empty(list)'.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75383524 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: move getting driver tag and budget into one helper

Move code for getting driver tag and budget into one helper, so
blk_mq_dispatch_rq_list gets a bit simplified, and easier to read.

Meantime move updating of 'no_tag' and 'no_budget_available' into
the branch for handling partial dispatch because that is exactly
consumer of the two local variables.

Also rename the parameter of 'got_budget' as 'ask_budget'.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

445874e8 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: pass hctx to blk_mq_dispatch_rq_list

All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
hctx, so it is better to pass hctx instead of request queue, because
blk-mq's dispatch target is hctx instead of request queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

65c76369 30-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: pass request queue into get/put budget callback

blk-mq budget is abstract from scsi's device queue depth, and it is
always per-request-queue instead of hctx.

It can be quite absurd to get a budget from one hctx, then dequeue a
request from scheduler queue, and this request may not belong to this
hctx, at least for bfq and deadline.

So fix the mess and always pass request queue to get/put budget
callback.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3e20aa96 16-Jun-2020 Eric Biggers <ebiggers@google.com>

block/keyslot-manager: use kvfree_sensitive()

Make blk_ksm_destroy() use the kvfree_sensitive() function (which was
introduced in v5.8-rc1) instead of open-coding it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

42fdc5e4 29-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove the BLK_MQ_REQ_INTERNAL flag

Just check for a non-NULL elevator directly to make the code more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

36a3df5a 29-Jun-2020 Ming Lei <ming.lei@redhat.com>

blk-mq: put driver tag when this request is completed

It is natural to release driver tag when this request is completed by
LLD or device since its purpose is for LLD use.

One big benefit is that the released tag can be re-used quicker since
bio_endio() may take too long.

Meantime we don't need to release driver tag for flush request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a2e83ef9 27-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-cgroup: remove a dead check in blk_throtl_bio

bios must have a valid block group by the time they are submitted.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

db18a53e 27-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-cgroup: remove blkcg_bio_issue_check

blkcg_bio_issue_check is a giant inline function that does three entirely
different things. Factor out the blk-cgroup related bio initalization
into a new helper, and the open code the sequence in the only caller,
relying on the fact that all the actual functionality is stubbed out for
non-cgroup builds.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

93b80638 27-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-cgroup: move rcu locking from blkcg_bio_issue_check to blk_throtl_bio

The only thing in blkcg_bio_issue_check that needs to be under
rcu_read_lock is blk_throtl_bio, so move the locking there.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

13c7863d 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: move the initial blkg lookup into blkg_tryget_closest

By moving the initial blkg lookup into blkg_tryget_closest we get
a nicely self contained routines that does all the RCU locking.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a5b97526 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: bypass blkg_tryget_closest for the root_blkg

The root_blkg is only torn down at the very end of removing a queue.
So in the I/O submission path is always has a life reference and we
can just grab another one using blkg_get instead of doing a tryget
and parent walk that won't lead anywhere.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8c546287 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: merge blkg_lookup_create and __blkg_lookup_create

No good reason to keep these two functions split.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

28fc591f 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: move the bio cgroup associatation helpers to blk-cgroup.c

Keep the cgroup code together.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a18b9b15 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: move bio_associate_blkg_from_page to mm/page_io.c

bio_associate_blkg_from_page is a special purpose helper for swap bios
that doesn't need access to bio internals. Move it to the swap code
instead of having it in bio.c.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2badf06c 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: merge __bio_associate_blkg into bio_associate_blkg_from_css

Merge __bio_associate_blkg into the only caller, which allows to slightly
reduce the RCU crticial section and better explain the code flow.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d92c370a 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: really clone the block cgroup in bio_clone_blkg_association

bio_clone_blkg_association is supposed to clone the associatation, but
actually ends up doing a search with a tryget. As we know we have a
reference on the source cgroup just get an unconditional additional
reference to it and call it a day. That also removes the need for
a RCU critical section.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

db9819c7 27-Jun-2020 Christoph Hellwig <hch@lst.de>

block: remove bio_disassociate_blkg

bio_disassociate_blkg has two callers, of which one immediately assigns
a new value to >bi_blkg. Just open code the function in the two callers.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bfe373f6 27-Apr-2020 Hou Tao <houtao1@huawei.com>

blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flags

Else there may be magic numbers in /sys/kernel/debug/block/*/state.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

826f2f48 28-Jun-2020 Guo Xuenan <guoxuenan@huawei.com>

blk-rq-qos: remove redundant finish_wait to rq_qos_wait.

It is no need do finish_wait twice after acquiring inflight.

Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9b8d0207 27-Jun-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.8-2020-06-26' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request from Christoph:
- multipath deadlock fixes (Anton)
- NUMA fixes (Max)
- RDMA completion vector fix (Max)
- IO deadlock fix (Sagi)
- multipath reference fix (Sagi)
- NS mutation fix (Sagi)

- Use right allocator when freeing bip in error path (Chengguang)

* tag 'block-5.8-2020-06-26' of git://git.kernel.dk/linux-block:
nvme-multipath: fix bogus request queue reference put
nvme-multipath: fix deadlock due to head->lock
nvme: don't protect ns mutation with ns->head->lock
nvme-multipath: fix deadlock between ana_work and scan_work
nvme: fix possible deadlock when I/O is blocked
nvme-rdma: assign completion vector correctly
nvme-loop: initialize tagset numa value to the value of the ctrl
nvme-tcp: initialize tagset numa value to the value of the ctrl
nvme-pci: initialize tagset numa value to the value of the ctrl
nvme-pci: override the value of the controller's numa node
nvme: set initial value for controller's numa node
block: release bip in a right way in error path


f3bdc62f 17-Jun-2020 Jan Kara <jack@suse.cz>

blktrace: Provide event for request merging

Currently blk-mq does not report any event when two requests get merged
in the elevator. This then results in difficult to understand sequence
of events like:

...
8,0 34 1579 0.608765271 2718 I WS 215023504 + 40 [dbench]
8,0 34 1584 0.609184613 2719 A WS 215023544 + 56 <- (8,4) 2160568
8,0 34 1585 0.609184850 2719 Q WS 215023544 + 56 [dbench]
8,0 34 1586 0.609188524 2719 G WS 215023544 + 56 [dbench]
8,0 3 602 0.609684162 773 D WS 215023504 + 96 [kworker/3:1H]
8,0 34 1591 0.609843593 0 C WS 215023504 + 96 [0]

and you can only guess (after quite some headscratching since the above
excerpt is intermixed with a lot of other IO) that request 215023544+56
got merged to request 215023504+40. Provide proper event for request
merging like we used to do in the legacy block layer.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f61d6e25 19-Jun-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

blk-iocost: Use struct_size() in kzalloc_node()

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1f4fe21c 19-Jun-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

block: bio: Use struct_size() in kmalloc()

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Jens Axboe <axboe@kernel.dk>

85e0cbbb 19-Jun-2020 Luis Chamberlain <mcgrof@kernel.org>

block: create the request_queue debugfs_dir on registration

We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.

Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e8c7d14a 19-Jun-2020 Luis Chamberlain <mcgrof@kernel.org>

block: revert back to synchronous request_queue removal

Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d99523560 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.

The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.

Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().

Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Suggested-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

763b5892 19-Jun-2020 Luis Chamberlain <mcgrof@kernel.org>

block: clarify context for refcount increment helpers

Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().

We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5bd357c 19-Jun-2020 Luis Chamberlain <mcgrof@kernel.org>

block: add docs for gendisk / request_queue refcount helpers

This adds documentation for the gendisk / request_queue refcount
helpers.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

40d09b53 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: add a new blk_mq_complete_request_remote API

This is a variant of blk_mq_complete_request_remote that only completes
the request if it needs to be bounced to another CPU or a softirq. If
the request can be completed locally the function returns false and lets
the driver complete it without requring and indirect function call.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

96339526 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a blk_mq_complete_need_ipi helper

Add a helper to decide if we can complete locally or need an IPI.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4c8fc196 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove the get_cpu/put_cpu pair in blk_mq_complete_request

We don't really care if we get migrated during the I/O completion.
In the worth case we either perform an IPI that wasn't required, or
complete the request on a CPU which we just migrated off.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

15f73f5b 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: move failure injection out of blk_mq_complete_request

Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d391a7a3 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: merge the softirq vs non-softirq IPI logic

Both the softirq path for single queue devices and the multi-queue
completion handler share the same logic to figure out if we need an
IPI for the completion and eventually issue it. Merge the two
versions into a single unified code path.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d6cc464c 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: short cut the IPI path in blk_mq_force_complete_rq for !SMP

Let the compile optimize out the entire IPI path, given that we are
obviously not going to use it.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6aab1da6 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: complete polled requests directly

Even for single queue devices there is no point in offloading a polled
completion to the softirq, given that blk_mq_force_complete_rq is called
from the polling thread in that case and thus there are no starvation
issues.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dea6f399 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: remove raise_blk_irq

By open coding raise_blk_irq in the only caller, and replacing the
ifdef CONFIG_SMP with an IS_ENABLED check the flow in the caller
can be significantly simplified.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

115243f5 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: factor out a helper to reise the block softirq

Add a helper to deduplicate the logic that raises the block softirq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c3077b5d 11-Jun-2020 Christoph Hellwig <hch@lst.de>

blk-mq: merge blk-softirq.c into blk-mq.c

__blk_complete_request is only called from the blk-mq code, and
duplicates a lot of code from blk-mq.c. Move it there to prepare
for better code sharing and simplifications.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0b8eb629 24-Jun-2020 Chengguang Xu <cgxu519@mykernel.net>

block: release bip in a right way in error path

Release bip using kfree() in error path when that was allocated
by kmalloc().

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a473e83 04-Jun-2020 Jens Axboe <axboe@kernel.dk>

block: provide plug based way of signaling forced no-wait semantics

Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

d2b1c81f 19-Jun-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Use import_uuid() where appropriate (Andy)

- bcache fixes (Coly, Mauricio, Zhiqiang)

- blktrace sparse warnings fix (Jan)

- blktrace concurrent setup fix (Luis)

- blkdev_get use-after-free fix (Jason)

- Ensure all blk-mq maps are updated (Weiping)

- Loop invalidate bdev fix (Zheng)

* tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
block: make function 'kill_bdev' static
loop: replace kill_bdev with invalidate_bdev
partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense
block: update hctx map when use multiple maps
blktrace: Avoid sparse warnings when assigning q->blk_trace
blktrace: break out of blktrace setup on concurrent calls
block: Fix use-after-free in blkdev_get()
trace/events/block.h: drop kernel-doc for dropped function parameter
blk-mq: Remove redundant 'return' statement
bcache: pr_info() format clean up in bcache_device_init()
bcache: use delayed kworker fo asynchronous devices registration
bcache: check and adjust logical block size for backing devices
bcache: fix potential deadlock problem in btree_gc_coalesce


bc163c20 22-Apr-2020 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense

There is a specific API to treat raw data as UUID, i.e. import_uuid().
Use it instead of uuid_copy() with explicit casting.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fe35ec58 17-Jun-2020 Weiping Zhang <zhangweiping@didiglobal.com>

block: update hctx map when use multiple maps

There is an issue when tune the number for read and write queues,
if the total queue count was not changed. The hctx->type cannot
be updated, since __blk_mq_update_nr_hw_queues will return directly
if the total queue count has not been changed.

Reproduce:

dmesg | grep "default/read/poll"
[ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues
cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
48 default

tune the write queues to 24:
echo 24 > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller

dmesg | grep "default/read/poll"
[ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues

cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
48 default

The driver's hardware queue mapping is not same as block layer.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8a631c26 28-May-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

block: Replace zero-length array with flexible-array

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>

a8a5e383 15-Jun-2020 Baolin Wang <baolin.wang@linux.alibaba.com>

blk-mq: Remove redundant 'return' statement

The blk_mq_all_tag_iter() is a void function, thus remove
the redundant 'return' statement in this function.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6adc19fd 13-Jun-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

- fix build rules in binderfs sample

- fix build errors when Kbuild recurses to the top Makefile

- covert '---help---' in Kconfig to 'help'

* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
treewide: replace '---help---' in Kconfig files with 'help'
kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
samples: binderfs: really compile this sample and fix build issues