History log of /linux-master/block/bfq-cgroup.c
Revision Date Author Comments
# 08420cf7 15-Jan-2024 Jens Axboe <axboe@kernel.dk>

block: add blk_time_get_ns() and blk_time_get() helpers

Convert any user of ktime_get_ns() to use blk_time_get_ns(), and
ktime_get() to blk_time_get(), so we have a unified API for querying the
current time in nanoseconds or as ktime.

No functional changes intended, this patch just wraps ktime_get_ns()
and ktime_get() with a block helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# faffaab2 12-Apr-2023 Tejun Heo <tj@kernel.org>

blkcg: Restructure blkg_conf_prep() and friends

We want to support lazy init of rq-qos policies so that iolatency is enabled
lazily on configuration instead of gendisk initialization. The way blkg
config helpers are structured now is a bit awkward for that. Let's
restructure:

* blkcg_conf_open_bdev() is renamed to blkg_conf_open_bdev(). The blkcg_
prefix was used because the bdev opening step is blkg-independent.
However, the distinction is too subtle and confuses more than helps. Let's
switch to blkg prefix so that it's consistent with the type and other
helper names.

* struct blkg_conf_ctx now remembers the original input string and is always
initialized by the new blkg_conf_init().

* blkg_conf_open_bdev() is updated to take a pointer to blkg_conf_ctx like
blkg_conf_prep() and can be called multiple times safely. Instead of
modifying the double pointer to input string directly,
blkg_conf_open_bdev() now sets blkg_conf_ctx->body.

* blkg_conf_finish() is renamed to blkg_conf_exit() for symmetry and now
must be called on all blkg_conf_ctx's which were initialized with
blkg_conf_init().

Combined, this allows the users to either open the bdev first or do it
altogether with blkg_conf_prep() which will help implementing lazy init of
rq-qos policies.

blkg_conf_init/exit() will also be used implement synchronization against
device removal. This is necessary because iolat / iocost are configured
through cgroupfs instead of one of the files under /sys/block/DEVICE. As
cgroupfs operations aren't synchronized with block layer, the lazy init and
other configuration operations may race against device removal. This patch
makes blkg_conf_init/exit() used consistently for all cgroup-orginating
configurations making them a good place to implement explicit
synchronization.

Users are updated accordingly. No behavior change is intended by this patch.

v2: bfq wasn't updated in v1 causing a build error. Fixed.

v3: Update the description to include future use of blkg_conf_init/exit() as
synchronization points.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230413000649.115785-3-tj@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 650e2cb5 06-Apr-2023 Chengming Zhou <zhouchengming@bytedance.com>

blk-cgroup: delete cpd_init_fn of blkcg_policy

blkcg_policy cpd_init_fn() is used to just initialize some default
fields of policy data, which is enough to do in cpd_alloc_fn().

This patch delete the only user bfq_cpd_init(), and remove cpd_init_fn
from blkcg_policy.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230406145050.49914-4-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e9f2f3f5 06-Apr-2023 Chengming Zhou <zhouchengming@bytedance.com>

block, bfq: remove BFQ_WEIGHT_LEGACY_DFL

BFQ_WEIGHT_LEGACY_DFL is the same as CGROUP_WEIGHT_DFL, which means
we don't need cpd_bind_fn() callback to update default weight when
attached to a hierarchy.

This patch remove BFQ_WEIGHT_LEGACY_DFL and cpd_bind_fn().

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230406145050.49914-2-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a06377c5 14-Feb-2023 Christoph Hellwig <hch@lst.de>

Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"

This reverts commit 84d7d462b16dd5f0bf7c7ca9254bf81db2c952a2.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230214183308.1658775-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1231039d 14-Feb-2023 Christoph Hellwig <hch@lst.de>

Revert "blk-cgroup: move the cgroup information to struct gendisk"

This reverts commit 3f13ab7c80fdb0ada86a8e3e818960bc1ccbaa59 as a patch
it depends on caused a few problems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230214183308.1658775-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f37bf75c 02-Feb-2023 Yu Kuai <yukuai3@huawei.com>

block, bfq: cleanup 'bfqg->online'

After commit dfd6200a0954 ("blk-cgroup: support to track if policy is
online"), there is no need to do this again in bfq.

However, 'pd->online' is not protected by 'bfqd->lock', in order to make
sure bfq won't see that 'pd->online' is still set after bfq_pd_offline(),
clear it before bfq_pd_offline() is called. This is fine because other
polices doesn't use 'pd->online' and bfq_pd_offline() will move active
bfqq to root cgroup anyway.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230202134913.2364549-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3f13ab7c 03-Feb-2023 Christoph Hellwig <hch@lst.de>

blk-cgroup: move the cgroup information to struct gendisk

cgroup information only makes sense on a live gendisk that allows
file system I/O (which includes the raw block device). So move over
the cgroup related members.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0a0b4f79 03-Feb-2023 Christoph Hellwig <hch@lst.de>

blk-cgroup: pass a gendisk to pd_alloc_fn

No need to the request_queue here, pass a gendisk and extract the
node ids from that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 40e4996e 03-Feb-2023 Christoph Hellwig <hch@lst.de>

blk-cgroup: pass a gendisk to blkcg_{de,}activate_policy

Prepare for storing the blkcg information in the gendisk instead of
the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 84d7d462 03-Feb-2023 Christoph Hellwig <hch@lst.de>

blk-cgroup: pin the gendisk in struct blkcg_gq

Currently each blkcg_gq holds a request_queue reference, which is what
is used in the policies. But a lot of these interfaces will move over to
use a gendisk, so store a disk in struct blkcg_gq and hold a reference to
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2d31c684 03-Jan-2023 Davide Zini <davidezini2@gmail.com>

block, bfq: inject I/O to underutilized actuators

The main service scheme of BFQ for sync I/O is serving one sync
bfq_queue at a time, for a while. In particular, BFQ enforces this
scheme when it deems the latter necessary to boost throughput or
to preserve service guarantees. Unfortunately, when BFQ enforces
this policy, only one actuator at a time gets served for a while,
because each bfq_queue contains I/O only for one actuator. The
other actuators may remain underutilized.

Actually, BFQ may serve (inject) extra I/O, taken from other
bfq_queues, in parallel with that of the in-service queue. This
injection mechanism may provide the ground for dealing also with
the above actuator-underutilization problem. Yet BFQ does not take
the actuator load into account when choosing which queue to pick
extra I/O from. In addition, BFQ may happen to inject extra I/O
only when the in-service queue is temporarily empty.

In view of these facts, this commit extends the
injection mechanism in such a way that the latter:
(1) takes into account also the actuator load;
(2) checks such a load on each dispatch, and injects I/O for an
underutilized actuator, if there is one and there is I/O for it.

To perform the check in (2), this commit introduces a load
threshold, currently set to 4. A linear scan of each actuator is
performed, until an actuator is found for which the following two
conditions hold: the load of the actuator is below the threshold,
and there is at least one non-in-service queue that contains I/O
for that actuator. If such a pair (actuator, queue) is found, then
the head request of that queue is returned for dispatch, instead
of the head request of the in-service queue.

We have set the threshold, empirically, to the minimum possible
value for which an actuator is fully utilized, or close to be
fully utilized. By doing so, injected I/O 'steals' as few
drive-queue slots as possibile to the in-service queue. This
reduces as much as possible the probability that the service of
I/O from the in-service bfq_queue gets delayed because of slot
exhaustion, i.e., because all the slots of the drive queue are
filled with I/O injected from other queues (NCQ provides for 32
slots).

This new mechanism also counters actuator underutilization in the
case of asymmetric configurations of bfq_queues. Namely if there
are few bfq_queues containing I/O for some actuators and many
bfq_queues containing I/O for other actuators. Or if the
bfq_queues containing I/O for some actuators have lower weights
than the other bfq_queues.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9778369a 03-Jan-2023 Paolo Valente <paolo.valente@linaro.org>

block, bfq: split sync bfq_queues on a per-actuator basis

Single-LUN multi-actuator SCSI drives, as well as all multi-actuator
SATA drives appear as a single device to the I/O subsystem [1]. Yet
they address commands to different actuators internally, as a function
of Logical Block Addressing (LBAs). A given sector is reachable by
only one of the actuators. For example, Seagate’s Serial Advanced
Technology Attachment (SATA) version contains two actuators and maps
the lower half of the SATA LBA space to the lower actuator and the
upper half to the upper actuator.

Evidently, to fully utilize actuators, no actuator must be left idle
or underutilized while there is pending I/O for it. The block layer
must somehow control the load of each actuator individually. This
commit lays the ground for allowing BFQ to provide such a per-actuator
control.

BFQ associates an I/O-request sync bfq_queue with each process doing
synchronous I/O, or with a group of processes, in case of queue
merging. Then BFQ serves one bfq_queue at a time. While in service, a
bfq_queue is emptied in request-position order. Yet the same process,
or group of processes, may generate I/O for different actuators. In
this case, different streams of I/O (each for a different actuator)
get all inserted into the same sync bfq_queue. So there is basically
no individual control on when each stream is served, i.e., on when the
I/O requests of the stream are picked from the bfq_queue and
dispatched to the drive.

This commit enables BFQ to control the service of each actuator
individually for synchronous I/O, by simply splitting each sync
bfq_queue into N queues, one for each actuator. In other words, a sync
bfq_queue is now associated to a pair (process, actuator). As a
consequence of this split, the per-queue proportional-share policy
implemented by BFQ will guarantee that the sync I/O generated for each
actuator, by each process, receives its fair share of service.

This is just a preparatory patch. If the I/O of the same process
happens to be sent to different queues, then each of these queues may
undergo queue merging. To handle this event, the bfq_io_cq data
structure must be properly extended. In addition, stable merging must
be disabled to avoid loss of control on individual actuators. Finally,
also async queues must be split. These issues are described in detail
and addressed in next commits. As for this commit, although multiple
per-process bfq_queues are provided, the I/O of each process or group
of processes is still sent to only one queue, regardless of the
actuator the I/O is for. The forwarding to distinct bfq_queues will be
enabled after addressing the above issues.

[1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gabriele Felici <felicigb@gmail.com>
Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b600de2d 29-Jan-2023 Yu Kuai <yukuai3@huawei.com>

block, bfq: fix uaf for bfqq in bic_set_bfqq()

After commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'"),
bic->bfqq will be accessed in bic_set_bfqq(), however, in some context
bic->bfqq will be freed, and bic_set_bfqq() is called with the freed
bic->bfqq.

Fix the problem by always freeing bfqq after bic_set_bfqq().

Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'")
Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230130014136.591038-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 216f7647 03-Jan-2023 Yu Kuai <yukuai3@huawei.com>

block, bfq: switch 'bfqg->ref' to use atomic refcount apis

The updating of 'bfqg->ref' should be protected by 'bfqd->lock', however,
during code review, we found that bfq_pd_free() update 'bfqg->ref'
without holding the lock, which is problematic:

1) bfq_pd_free() triggered by removing cgroup is called asynchronously;
2) bfqq will grab bfqg reference, and exit bfqq will drop the reference,
which can concurrent with 1).

Unfortunately, 'bfqd->lock' can't be held here because 'bfqd' might already
be freed in bfq_pd_free(). Fix the problem by using atomic refcount apis.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230103084755.1256479-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 337366e0 13-Dec-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: replace 0/1 with false/true in bic apis

Just to make the code a litter cleaner, there are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221214033155.3455754-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 452af7dc 13-Dec-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: don't return bfqg from __bfq_bic_change_cgroup()

The return value is not used, hence remove it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221214033155.3455754-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aa625117 01-Nov-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: don't declare 'bfqd' as type 'void *' in bfq_group

Prevent unnecessary format conversion for bfqg->bfqd in multiple
places.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 60a6e10c 16-Sep-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: record how many queues have pending requests

Prepare to refactor the counting of 'num_groups_with_pending_reqs'.

Add a counter in bfq_group, update it while tracking if bfqq have pending
requests and when bfq_bfqq_move() is called.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f02be900 08-Nov-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: fix null pointer dereference in bfq_bio_bfqg()

Out test found a following problem in kernel 5.10, and the same problem
should exist in mainline:

BUG: kernel NULL pointer dereference, address: 0000000000000094
PGD 0 P4D 0
Oops: 0000 [#1] SMP
CPU: 7 PID: 155 Comm: kworker/7:1 Not tainted 5.10.0-01932-g19e0ace2ca1d-dirty 4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4
Workqueue: kthrotld blk_throtl_dispatch_work_fn
RIP: 0010:bfq_bio_bfqg+0x52/0xc0
Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
bfq_bic_update_cgroup+0x3c/0x350
? ioc_create_icq+0x42/0x270
bfq_init_rq+0xfd/0x1060
bfq_insert_requests+0x20f/0x1cc0
? ioc_create_icq+0x122/0x270
blk_mq_sched_insert_requests+0x86/0x1d0
blk_mq_flush_plug_list+0x193/0x2a0
blk_flush_plug_list+0x127/0x170
blk_finish_plug+0x31/0x50
blk_throtl_dispatch_work_fn+0x151/0x190
process_one_work+0x27c/0x5f0
worker_thread+0x28b/0x6b0
? rescuer_thread+0x590/0x590
kthread+0x153/0x1b0
? kthread_flush_work+0x170/0x170
ret_from_fork+0x1f/0x30
Modules linked in:
CR2: 0000000000000094
---[ end trace e2e59ac014314547 ]---
RIP: 0010:bfq_bio_bfqg+0x52/0xc0
Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Root cause is quite complex:

1) use bfq elevator for the test device.
2) create a cgroup CG
3) config blk throtl in CG

blkg_conf_prep
blkg_create

4) create a thread T1 and issue async io in CG:

bio_init
bio_associate_blkg
...
submit_bio
submit_bio_noacct
blk_throtl_bio -> io is throttled
// io submit is done

5) switch elevator:

bfq_exit_queue
blkcg_deactivate_policy
list_for_each_entry(blkg, &q->blkg_list, q_node)
blkg->pd[] = NULL
// bfq policy is removed

5) thread t1 exist, then remove the cgroup CG:

blkcg_unpin_online
blkcg_destroy_blkgs
blkg_destroy
list_del_init(&blkg->q_node)
// blkg is removed from queue list

6) switch elevator back to bfq

bfq_init_queue
bfq_create_group_hierarchy
blkcg_activate_policy
list_for_each_entry_reverse(blkg, &q->blkg_list)
// blkg is removed from list, hence bfq policy is still NULL

7) throttled io is dispatched to bfq:

bfq_insert_requests
bfq_init_rq
bfq_bic_update_cgroup
bfq_bio_bfqg
bfqg = blkg_to_bfqg(blkg)
// bfqg is NULL because bfq policy is NULL

The problem is only possible in bfq because only bfq can be deactivated and
activated while queue is online, while others can only be deactivated while
the device is removed.

Fix the problem in bfq by checking if blkg is online before calling
blkg_to_bfqg().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221108103434.2853269-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c2090eab 15-Aug-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: remove unused functions

While doing code coverage testing(CONFIG_BFQ_CGROUP_DEBUG is disabled), we
found that some functions doesn't have caller, thus remove them.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dc469ba2 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

block/bfq: Use the new blk_opf_t type

Use the new blk_opf_t type for arguments and variables that represent
request flags or a bitwise combination of a request operation and
request flags. Rename those variables from 'op' into 'opf'.

This patch does not change any functionality.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-8-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1d87be82 17-Jun-2022 Bart Van Assche <bvanassche@acm.org>

block: bfq: Fix kernel-doc headers

Fix the following warnings:

block/bfq-cgroup.c:721: warning: Function parameter or member 'bfqg' not described in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:721: warning: Excess function parameter 'blkcg' description in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:870: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_leaf_entity'
block/bfq-cgroup.c:900: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_active_queues'

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617210859.106623-1-bvanassche@acm.org


# c28c49b0 17-Jun-2022 Bart Van Assche <bvanassche@acm.org>

block: bfq: Remove an unused function definition

This patch is the result of the analysis of a sparse report.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617204433.102022-1-bvanassche@acm.org


# 075a53b7 31-Mar-2022 Jan Kara <jack@suse.cz>

bfq: Make sure bfqg for which we are queueing requests is online

Bios queued into BFQ IO scheduler can be associated with a cgroup that
was already offlined. This may then cause insertion of this bfq_group
into a service tree. But this bfq_group will get freed as soon as last
bio associated with it is completed leading to use after free issues for
service tree users. Fix the problem by making sure we always operate on
online bfq_group. If the bfq_group associated with the bio is not
online, we pick the first online parent.

CC: stable@vger.kernel.org
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-9-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e54a249 31-Mar-2022 Jan Kara <jack@suse.cz>

bfq: Get rid of __bio_blkcg() usage

BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio
would not be associated with any blkcg, the usage of __bio_blkcg() in
BFQ is prone to races with the task being migrated between cgroups as
__bio_blkcg() calls at different places could return different blkcgs.

Convert BFQ to the new situation where bio->bi_blkg is initialized in
bio_set_dev() and thus practically always valid. This allows us to save
blkcg_gq lookup and noticeably simplify the code.

CC: stable@vger.kernel.org
Fixes: 0fe061b9f03c ("blkcg: fix ref count issue with bio_blkcg() using task_css")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 09f87186 31-Mar-2022 Jan Kara <jack@suse.cz>

bfq: Track whether bfq_group is still online

Track whether bfq_group is still online. We cannot rely on
blkcg_gq->online because that gets cleared only after all policies are
offlined and we need something that gets updated already under
bfqd->lock when we are cleaning up our bfq_group to be able to guarantee
that when we see online bfq_group, it will stay online while we are
holding bfqd->lock lock.

CC: stable@vger.kernel.org
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3bc5e683 31-Mar-2022 Jan Kara <jack@suse.cz>

bfq: Split shared queues on move between cgroups

When bfqq is shared by multiple processes it can happen that one of the
processes gets moved to a different cgroup (or just starts submitting IO
for different cgroup). In case that happens we need to split the merged
bfqq as otherwise we will have IO for multiple cgroups in one bfqq and
we will just account IO time to wrong entities etc.

Similarly if the bfqq is scheduled to merge with another bfqq but the
merge didn't happen yet, cancel the merge as it need not be valid
anymore.

CC: stable@vger.kernel.org
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8410f709 28-Jan-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: don't move oom_bfqq

Our test report a UAF:

[ 2073.019181] ==================================================================
[ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168
[ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584
[ 2073.019192]
[ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5
[ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2073.019200] Call trace:
[ 2073.019203] dump_backtrace+0x0/0x310
[ 2073.019206] show_stack+0x28/0x38
[ 2073.019210] dump_stack+0xec/0x15c
[ 2073.019216] print_address_description+0x68/0x2d0
[ 2073.019220] kasan_report+0x238/0x2f0
[ 2073.019224] __asan_store8+0x88/0xb0
[ 2073.019229] __bfq_put_async_bfqq+0xa0/0x168
[ 2073.019233] bfq_put_async_queues+0xbc/0x208
[ 2073.019236] bfq_pd_offline+0x178/0x238
[ 2073.019240] blkcg_deactivate_policy+0x1f0/0x420
[ 2073.019244] bfq_exit_queue+0x128/0x178
[ 2073.019249] blk_mq_exit_sched+0x12c/0x160
[ 2073.019252] elevator_exit+0xc8/0xd0
[ 2073.019256] blk_exit_queue+0x50/0x88
[ 2073.019259] blk_cleanup_queue+0x228/0x3d8
[ 2073.019267] null_del_dev+0xfc/0x1e0 [null_blk]
[ 2073.019274] null_exit+0x90/0x114 [null_blk]
[ 2073.019278] __arm64_sys_delete_module+0x358/0x5a0
[ 2073.019282] el0_svc_common+0xc8/0x320
[ 2073.019287] el0_svc_handler+0xf8/0x160
[ 2073.019290] el0_svc+0x10/0x218
[ 2073.019291]
[ 2073.019294] Allocated by task 14163:
[ 2073.019301] kasan_kmalloc+0xe0/0x190
[ 2073.019305] kmem_cache_alloc_node_trace+0x1cc/0x418
[ 2073.019308] bfq_pd_alloc+0x54/0x118
[ 2073.019313] blkcg_activate_policy+0x250/0x460
[ 2073.019317] bfq_create_group_hierarchy+0x38/0x110
[ 2073.019321] bfq_init_queue+0x6d0/0x948
[ 2073.019325] blk_mq_init_sched+0x1d8/0x390
[ 2073.019330] elevator_switch_mq+0x88/0x170
[ 2073.019334] elevator_switch+0x140/0x270
[ 2073.019338] elv_iosched_store+0x1a4/0x2a0
[ 2073.019342] queue_attr_store+0x90/0xe0
[ 2073.019348] sysfs_kf_write+0xa8/0xe8
[ 2073.019351] kernfs_fop_write+0x1f8/0x378
[ 2073.019359] __vfs_write+0xe0/0x360
[ 2073.019363] vfs_write+0xf0/0x270
[ 2073.019367] ksys_write+0xdc/0x1b8
[ 2073.019371] __arm64_sys_write+0x50/0x60
[ 2073.019375] el0_svc_common+0xc8/0x320
[ 2073.019380] el0_svc_handler+0xf8/0x160
[ 2073.019383] el0_svc+0x10/0x218
[ 2073.019385]
[ 2073.019387] Freed by task 72584:
[ 2073.019391] __kasan_slab_free+0x120/0x228
[ 2073.019394] kasan_slab_free+0x10/0x18
[ 2073.019397] kfree+0x94/0x368
[ 2073.019400] bfqg_put+0x64/0xb0
[ 2073.019404] bfqg_and_blkg_put+0x90/0xb0
[ 2073.019408] bfq_put_queue+0x220/0x228
[ 2073.019413] __bfq_put_async_bfqq+0x98/0x168
[ 2073.019416] bfq_put_async_queues+0xbc/0x208
[ 2073.019420] bfq_pd_offline+0x178/0x238
[ 2073.019424] blkcg_deactivate_policy+0x1f0/0x420
[ 2073.019429] bfq_exit_queue+0x128/0x178
[ 2073.019433] blk_mq_exit_sched+0x12c/0x160
[ 2073.019437] elevator_exit+0xc8/0xd0
[ 2073.019440] blk_exit_queue+0x50/0x88
[ 2073.019443] blk_cleanup_queue+0x228/0x3d8
[ 2073.019451] null_del_dev+0xfc/0x1e0 [null_blk]
[ 2073.019459] null_exit+0x90/0x114 [null_blk]
[ 2073.019462] __arm64_sys_delete_module+0x358/0x5a0
[ 2073.019467] el0_svc_common+0xc8/0x320
[ 2073.019471] el0_svc_handler+0xf8/0x160
[ 2073.019474] el0_svc+0x10/0x218
[ 2073.019475]
[ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00
which belongs to the cache kmalloc-1024 of size 1024
[ 2073.019484] The buggy address is located 552 bytes inside of
1024-byte region [ffff8000ccf63f00, ffff8000ccf64300)
[ 2073.019486] The buggy address belongs to the page:
[ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0
[ 2073.020123] flags: 0x7ffff0000008100(slab|head)
[ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00
[ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
[ 2073.020411] page dumped because: kasan: bad access detected
[ 2073.020412]
[ 2073.020414] Memory state around the buggy address:
[ 2073.020420] ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020424] ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020430] ^
[ 2073.020434] ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020438] ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020439] ==================================================================

The same problem exist in mainline as well.

This is because oom_bfqq is moved to a non-root group, thus root_group
is freed earlier.

Thus fix the problem by don't move oom_bfqq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c5e4cb0f 28-Jan-2022 Yu Kuai <yukuai3@huawei.com>

block, bfq: avoid moving bfqq to it's parent bfqg

Moving bfqq to it's parent bfqg is pointless.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220129015924.3958918-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2fc428f6 17-Oct-2021 Zheng Liang <zhengliang6@huawei.com>

block, bfq: fix UAF problem in bfqg_stats_init()

In bfq_pd_alloc(), the function bfqg_stats_init() init bfqg. If
blkg_rwstat_init() init bfqg_stats->bytes successful and init
bfqg_stats->ios failed, bfqg_stats_init() return failed, bfqg will
be freed. But blkg_rwstat->cpu_cnt is not deleted from the list of
percpu_counters. If we traverse the list of percpu_counters, It will
have UAF problem.

we should use blkg_rwstat_exit() to cleanup bfqg_stats bytes in the
above scenario.

Fixes: commit fd41e60331b ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
Signed-off-by: Zheng Liang <zhengliang6@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211018024225.1493938-1-zhengliang6@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2e9bc346 20-Sep-2021 Christoph Hellwig <hch@lst.de>

block: move elevator.h to block/

Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer. Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d29bd414 15-Oct-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: reset last_bfqq_created on group change

Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
queues"), BFQ maintains a per-group pointer to the last bfq_queue
created. If such a queue, say bfqq, happens to move to a different
group, then bfqq is no more a valid last bfq_queue created for its
previous group. That pointer must then be cleared. Not resetting such
a pointer may also cause UAF, if bfqq happens to also be freed after
being moved to a different group. This commit performs this missing
reset. As such it fixes commit 430a67f9d616 ("block, bfq: merge bursts
of newly-created queues").

Such a missing reset is most likely the cause of the crash reported in [1].
With some analysis, we found that this crash was due to the
above UAF. And such UAF did go away with this commit applied [1].

Anyway, before this commit, that crash happened to be triggered in
conjunction with commit 2d52c58b9c9b ("block, bfq: honor already-setup
queue merges"). The latter was then reverted by commit ebc69e897e17
("Revert "block, bfq: honor already-setup queue merges""). Yet commit
2d52c58b9c9b ("block, bfq: honor already-setup queue merges") contains
no error related with the above UAF, and can then be restored.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503

Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Grzegorz Kowal <custos.mentis@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 430a67f9 04-Mar-2021 Paolo Valente <paolo.valente@linaro.org>

block, bfq: merge bursts of newly-created queues

Many throughput-sensitive workloads are made of several parallel I/O
flows, with all flows generated by the same application, or more
generically by the same task (e.g., system boot). The most
counterproductive action with these workloads is plugging I/O dispatch
when one of the bfq_queues associated with these flows remains
temporarily empty.

To avoid this plugging, BFQ has been using a burst-handling mechanism
for years now. This mechanism has proven effective for throughput, and
not detrimental for service guarantees. This commit pushes this
mechanism a little bit further, basing on the following two facts.

First, all the I/O flows of a the same application or task contribute
to the execution/completion of that common application or task. So the
performance figures that matter are total throughput of the flows and
task-wide I/O latency. In particular, these flows do not need to be
protected from each other, in terms of individual bandwidth or
latency.

Second, the above fact holds regardless of the number of flows.

Putting these two facts together, this commits merges stably the
bfq_queues associated with these I/O flows, i.e., with the processes
that generate these IO/ flows, regardless of how many the involved
processes are.

To decide whether a set of bfq_queues is actually associated with the
I/O flows of a common application or task, and to merge these queues
stably, this commit operates as follows: given a bfq_queue, say Q2,
currently being created, and the last bfq_queue, say Q1, created
before Q2, Q2 is merged stably with Q1 if
- very little time has elapsed since when Q1 was created
- Q2 has the same ioprio as Q1
- Q2 belongs to the same group as Q1

Merging bfq_queues also reduces scheduling overhead. A fio test with
ten random readers on /dev/nullb shows a throughput boost of 40%, with
a quadcore. Since BFQ's execution time amounts to ~50% of the total
per-request processing time, the above throughput boost implies that
BFQ's overhead is reduced by more than 50%.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2de791ab 11-Aug-2020 Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>

bfq: fix blkio cgroup leakage v4

Changes from v1:
- update commit description with proper ref-accounting justification

commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
-> bfq_init_entity()
->bfqg_and_blkg_get(bfqg);
->entity->parent = bfqg->my_entity

-> bfq_put_queue(bfqq)
FINAL_PUT
->bfqg_and_blkg_put(bfqq_group(bfqq))
->kmem_cache_free(bfq_pool, bfqq);

So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")

bfq_group leak trace caused by bad commit:
-> blkg_alloc
-> bfq_pq_alloc
-> bfqg_get (+1)
->bfq_activate_bfqq
->bfq_activate_requeue_entity
-> __bfq_activate_entity
->bfq_get_entity
->bfqg_and_blkg_get (+1) <==== : Note1
->bfq_del_bfqq_busy
->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
-> bfq_forget_entity(is_in_service = true)
entity->on_st_or_in_serv = false <=== :Note2
if (is_in_service)
return; ==> do not touch reference
-> blkcg_css_offline
-> blkcg_destroy_blkgs
-> blkg_destroy
-> bfq_pd_offline
-> __bfq_deactivate_entity
if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
-> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq will leak forever, see test-case below.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i<max_iters;i++))
do
mkdir -p /sys/fs/cgroup/blkio/a
echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:

Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4d38a87f 21-Mar-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: invoke flush_idle_tree after reparent_active_queues in pd_offline

In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
flush the rb tree that contains all idle entities belonging to the pd
(cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
invoked before bfq_reparent_active_queues(). Yet the latter may happen
to add some entities to the idle tree. It happens if, in some of the
calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
the queue to move is empty and gets expired.

This commit simply reverses the invocation order between
bfq_flush_idle_tree() and bfq_reparent_active_queues().

Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 576682fa 21-Mar-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: make reparent_leaf_entity actually work only on leaf entities

bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
entity represents just a bfq_queue in an entity tree). Yet, the input
entity is guaranteed to always be a leaf entity only in two-level
entity trees. In this respect, because of the error fixed by
commit 14afc5936197 ("block, bfq: fix overwrite of bfq_group pointer
in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
to actually have only two levels. After the latter commit, this does not
hold any longer.

This commit fixes this problem by modifying
bfq_reparent_leaf_entity(), so that it searches an active leaf entity
down the path that stems from the input entity. Such a leaf entity is
guaranteed to exist when bfq_reparent_leaf_entity() is invoked.

Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c8997736 21-Mar-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroup

A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
goal of this put is to release a process reference to a bfq_queue. But
process-reference releases may trigger also some extra operation, and,
to this goal, are handled through bfq_release_process_ref(). So, turn
the invocation of bfq_put_queue() into an invocation of
bfq_release_process_ref().

Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fd1bb3ae 21-Mar-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: move forward the getting of an extra ref in bfq_bfqq_move

Commit ecedd3d7e199 ("block, bfq: get extra ref to prevent a queue
from being freed during a group move") gets an extra reference to a
bfq_queue before possibly deactivating it (temporarily), in
bfq_bfqq_move(). This prevents the bfq_queue from disappearing before
being reactivated in its new group.

Yet, the bfq_queue may also be expired (i.e., its service may be
stopped) before the bfq_queue is deactivated. And also an expiration
may lead to a premature freeing. This commit fixes this issue by
simply moving forward the getting of the extra reference already
introduced by commit ecedd3d7e199 ("block, bfq: get extra ref to
prevent a queue from being freed during a group move").

Reported-by: cki-project@redhat.com
Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 14afc593 06-Mar-2020 Carlo Nonato <carlo.nonato95@gmail.com>

block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()

The bfq_find_set_group() function takes as input a blkcg (which represents
a cgroup) and retrieves the corresponding bfq_group, then it updates the
bfq internal group hierarchy (see comments inside the function for why
this is needed) and finally it returns the bfq_group.
In the hierarchy update cycle, the pointer holding the correct bfq_group
that has to be returned is mistakenly used to traverse the hierarchy
bottom to top, meaning that in each iteration it gets overwritten with the
parent of the current group. Since the update cycle stops at root's
children (depth = 2), the overwrite becomes a problem only if the blkcg
describes a cgroup at a hierarchy level deeper than that (depth > 2). In
this case the root's child that happens to be also an ancestor of the
correct bfq_group is returned. The main consequence is that processes
contained in a cgroup at depth greater than 2 are wrongly placed in the
group described above by BFQ.

This commits fixes this problem by using a different bfq_group pointer in
the update cycle in order to avoid the overwrite of the variable holding
the original group reference.

Reported-by: Kwon Je Oh <kwonje.oh2@gmail.com>
Signed-off-by: Carlo Nonato <carlo.nonato95@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# db37a34c 03-Feb-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: get a ref to a group when adding it to a service tree

BFQ schedules generic entities, which may represent either bfq_queues
or groups of bfq_queues. When an entity is inserted into a service
tree, a reference must be taken, to make sure that the entity does not
disappear while still referred in the tree. Unfortunately, such a
reference is mistakenly taken only if the entity represents a
bfq_queue. This commit takes a reference also in case the entity
represents a group.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Chris Evich <cevich@redhat.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4d8340d0 03-Feb-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: remove ifdefs from around gets/puts of bfq groups

ifdefs around gets and puts of bfq groups reduce readability, remove them.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 33a16a98 03-Feb-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: extend incomplete name of field on_st

The flag on_st in the bfq_entity data structure is true if the entity
is on a service tree or is in service. Yet the name of the field,
confusingly, does not mention the second, very important case. Extend
the name to mention the second case too.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ecedd3d7 03-Feb-2020 Paolo Valente <paolo.valente@linaro.org>

block, bfq: get extra ref to prevent a queue from being freed during a group move

In bfq_bfqq_move(), the bfq_queue, say Q, to be moved to a new group
may happen to be deactivated in the scheduling data structures of the
source group (and then activated in the destination group). If Q is
referred only by the data structures in the source group when the
deactivation happens, then Q is freed upon the deactivation.

This commit addresses this issue by getting an extra reference before
the possible deactivation, and releasing this extra reference after Q
has been moved.

Tested-by: Chris Evich <cevich@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 08802ed6 05-Dec-2019 Hou Tao <houtao1@huawei.com>

bfq-iosched: Ensure bio->bi_blkg is valid before using it

bio->bi_blkg will be NULL when the issue of the request
has bypassed the block layer as shown in the following oops:

Internal error: Oops: 96000005 [#1] SMP
CPU: 17 PID: 2996 Comm: scsi_id Not tainted 5.4.0 #4
Call trace:
percpu_counter_add_batch+0x38/0x4c8
bfqg_stats_update_legacy_io+0x9c/0x280
bfq_insert_requests+0xbac/0x2190
blk_mq_sched_insert_request+0x288/0x670
blk_execute_rq_nowait+0x140/0x178
blk_execute_rq+0x8c/0x140
sg_io+0x604/0x9c0
scsi_cmd_ioctl+0xe38/0x10a8
scsi_cmd_blk_ioctl+0xac/0xe8
sd_ioctl+0xe4/0x238
blkdev_ioctl+0x590/0x20e0
block_ioctl+0x60/0x98
do_vfs_ioctl+0xe0/0x1b58
ksys_ioctl+0x80/0xd8
__arm64_sys_ioctl+0x40/0x78
el0_svc_handler+0xc4/0x270

so ensure its validity before using it.

Fixes: fd41e60331b1 ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fd41e603 07-Nov-2019 Tejun Heo <tj@kernel.org>

bfq-iosched: stop using blkg->stat_bytes and ->stat_ios

When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
from blk-cgroup core to populate six stat knobs. blk-cgroup core is
moving away from blkg_rwstat to improve scalability and won't be able
to support this usage.

It isn't like the sharing gains all that much. Let's break it out to
dedicated rwstat counters which are updated when on cgroup1. This
makes use of bfqg_*rwstat*() helpers outside of
CONFIG_BFQ_CGROUP_DEBUG. Move them out.

v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a557f1c7 07-Nov-2019 Tejun Heo <tj@kernel.org>

bfq-iosched: relocate bfqg_*rwstat*() helpers

Collect them right under #ifdef CONFIG_BFQ_CGROUP_DEBUG. The next
patch will use them from !DEBUG path and this makes it easy to move
them out of the ifdef block.

This is pure code reorganization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 795fe54c 27-Aug-2019 Fam Zheng <zhengfeiran@bytedance.com>

bfq: Add per-device weight

This adds to BFQ the missing per-device weight interfaces:
blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
implementation pretty closely resembles what we had in CFQ and the parsing code
is basically reused.

Tests
=====

Using two cgroups and three block devices, having weights setup as:

Cgroup test1 test2
============================================
default 100 500
sda 500 100
sdb default default
sdc 200 200

cgroup v1 runs
--------------

sda.test1.out: READ: bw=913MiB/s
sda.test2.out: READ: bw=183MiB/s

sdb.test1.out: READ: bw=213MiB/s
sdb.test2.out: READ: bw=1054MiB/s

sdc.test1.out: READ: bw=650MiB/s
sdc.test2.out: READ: bw=650MiB/s

cgroup v2 runs
--------------

sda.test1.out: READ: bw=915MiB/s
sda.test2.out: READ: bw=184MiB/s

sdb.test1.out: READ: bw=216MiB/s
sdb.test2.out: READ: bw=1069MiB/s

sdc.test1.out: READ: bw=621MiB/s
sdc.test2.out: READ: bw=622MiB/s

Signed-off-by: Fam Zheng <zhengfeiran@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Paolo Valente <paolo.valente@linaro.org>

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5ff047e3 27-Aug-2019 Fam Zheng <zhengfeiran@bytedance.com>

bfq: Extract bfq_group_set_weight from bfq_io_set_weight_legacy

This function will be useful when we update weight from the soon-coming
per-device interface.

Signed-off-by: Fam Zheng <zhengfeiran@bytedance.com>
Reviewed-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cf09a8ee 28-Aug-2019 Tejun Heo <tj@kernel.org>

blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()

Instead of @node, pass in @q and @blkcg so that the alloc function has
more context. This doesn't cause any behavior change and will be used
by io.weight implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8060c47b 05-Jun-2019 Christoph Hellwig <hch@lst.de>

block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUG

This option is entirely bfq specific, give it an appropinquate name.

Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all
the functionality already does so anyway.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d6258980 05-Jun-2019 Christoph Hellwig <hch@lst.de>

bfq-iosched: move bfq_stat_recursive_sum into the only caller

This function was moved from core block code and is way to generic.
Fold it into the only caller and simplify it based on the actually
passed arguments.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c0ce79dc 05-Jun-2019 Christoph Hellwig <hch@lst.de>

blk-cgroup: move struct blkg_stat to bfq

This structure and assorted infrastructure is only used by the bfq I/O
scheduler. Move it there instead of bloating the common code.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7af6fd91 05-Jun-2019 Christoph Hellwig <hch@lst.de>

blk-cgroup: introduce a new struct blkg_rwstat_sample

When sampling the blkcg counts we don't need atomics or per-cpu
variables. Introduce a new structure just containing plain u64
counters.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5d0b6e48 05-Jun-2019 Christoph Hellwig <hch@lst.de>

blk-cgroup: pass blkg_rwstat structures by reference

Returning a structure generates rather bad code, so switch to passing
by reference. Also don't require the structure to be zeroed and add
to the 0-initialized counters, but actually set the counters to the
calculated value.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cf892988 10-Jun-2019 Jens Axboe <axboe@kernel.dk>

cgroup/bfq: revert bfq.weight symlink change

There's some discussion on how to do this the best, and Tejun prefers
that BFQ just create the file itself instead of having cgroups support
a symlink feature.

Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this
can be done properly for 5.3.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 19e9da9e 21-May-2019 Angelo Ruocco <angeloruocco90@gmail.com>

block, bfq: add weight symlink to the bfq.weight cgroup parameter

Many userspace tools and services use the proportional-share policy of
the blkio/io cgroups controller. The CFQ I/O scheduler implemented
this policy for the legacy block layer. To modify the weight of a
group in case CFQ was in charge, the 'weight' parameter of the group
must be modified. On the other hand, the BFQ I/O scheduler implements
the same policy in blk-mq, but, with BFQ, the parameter to modify has
a different name: bfq.weight (forced choice until legacy block was
present, because two different policies cannot share a common parameter
in cgroups).

Due to CFQ legacy, most if not all userspace configurations still use
the parameter 'weight', and for the moment do not seem likely to be
changed. But, when CFQ went away with legacy block, such a parameter
ceased to exist.

So, a simple workaround has been proposed [1] to make all
configurations work: add a symlink, named weight, to bfq.weight. This
commit adds such a symlink.

[1] https://lkml.org/lkml/2019/4/8/555

Suggested-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a497ee34 30-Apr-2019 Christoph Hellwig <hch@lst.de>

block: switch all files cleared marked as GPLv2 or later to SPDX tags

All these files have some form of the usual GPLv2 or later boilerplate.
Switch them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 636b8fe8 08-Apr-2019 Angelo Ruocco <angelo.ruocco.90@gmail.com>

block, bfq: fix some typos in comments

Some of the comments in the bfq files had typos. This patch fixes them.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8cacc5ab 12-Mar-2019 Paolo Valente <paolo.valente@linaro.org>

block, bfq: do not merge queues on flash storage with queueing

To boost throughput with a set of processes doing interleaved I/O
(i.e., a set of processes whose individual I/O is random, but whose
merged cumulative I/O is sequential), BFQ merges the queues associated
with these processes, i.e., redirects the I/O of these processes into a
common, shared queue. In the shared queue, I/O requests are ordered by
their position on the medium, thus sequential I/O gets dispatched to
the device when the shared queue is served.

Queue merging costs execution time, because, to detect which queues to
merge, BFQ must maintain a list of the head I/O requests of active
queues, ordered by request positions. Measurements showed that this
costs about 10% of BFQ's total per-request processing time.

Request processing time becomes more and more critical as the speed of
the underlying storage device grows. Yet, fortunately, queue merging
is basically useless on the very devices that are so fast to make
request processing time critical. To reach a high throughput, these
devices must have many requests queued at the same time. But, in this
configuration, the internal scheduling algorithms of these devices do
also the job of queue merging: they reorder requests so as to obtain
as much as possible a sequential I/O pattern. As a consequence, with
processes doing interleaved I/O, the throughput reached by one such
device is likely to be the same, with and without queue merging.

In view of this fact, this commit disables queue merging, and all
related housekeeping, for non-rotational devices with internal
queueing. The total, single-lock-protected, per-request processing
time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
(time measured with simple code instrumentation, and using the
throughput-sync.sh script of the S suite [1], in performance-profiling
mode). To put this result into context, the total,
single-lock-protected, per-request execution time of the lightest I/O
scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
~800 LOC, against ~10500 LOC for BFQ).

Disabling merging provides a further, remarkable benefit in terms of
throughput. Merging tends to make many workloads artificially more
uneven, mainly because of shared queues remaining non empty for
incomparably more time than normal queues. So, if, e.g., one of the
queues in a set of merged queues has a higher weight than a normal
queue, then the shared queue may inherit such a high weight and, by
staying almost always active, may force BFQ to perform I/O plugging
most of the time. This evidently makes it harder for BFQ to let the
device reach a high throughput.

As a practical example of this problem, and of the benefits of this
commit, we measured again the throughput in the nasty scenario
considered in previous commit messages: dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes. With
this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
of the other I/O schedulers. As such, this is also likely to be the
maximum possible throughput reachable with this workload on this
device, because I/O is mostly random, and the other schedulers
basically just pass I/O requests to the drive as fast as possible.

[1] https://github.com/Algodev-github/S

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Alessio Masola <alessio.masola@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0fe061b9 04-Dec-2018 Dennis Zhou <dennis@kernel.org>

blkcg: fix ref count issue with bio_blkcg() using task_css

The bio_blkcg() function turns out to be inconsistent and consequently
dangerous to use. The first part returns a blkcg where a reference is
owned by the bio meaning it does not need to be rcu protected. However,
the third case, the last line, is problematic:

return css_to_blkcg(task_css(current, io_cgrp_id));

This can race against task migration and the cgroup dying. It is also
semantically different as it must be called rcu protected and is
susceptible to failure when trying to get a reference to it.

This patch adds association ahead of calling bio_blkcg() rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg(). In blk-iolatency, association is
moved above the bio_blkcg() call to ensure it will not return %NULL.

BFQ uses the old bio_blkcg() function, but I do not want to address it
in this series due to the complexity. I have created a private version
documenting the inconsistency and noting not to use it.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0d945c1f 15-Nov-2018 Christoph Hellwig <hch@lst.de>

block: remove the queue_lock indirection

With the legacy request path gone there is no good reason to keep
queue_lock as a pointer, we can always use the embedded lock now.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

Fixed floppy and blk-cgroup missing conversions and half done edits.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b5f2954d 01-Nov-2018 Dennis Zhou <dennis@kernel.org>

blkcg: revert blkcg cleanups series

This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 27e6fa99 11-Sep-2018 Dennis Zhou (Facebook) <dennisszhou@gmail.com>

blkcg: fix ref count issue with bio_blkcg using task_css

The accessor function bio_blkcg either returns the blkcg associated with
the bio or finds one in the current context. This can cause an issue
when trying to associate a bio with a blkcg. Particularly, it's the
third case that is problematic:

return css_to_blkcg(task_css(current, io_cgrp_id));

As the above may race against task migration and the cgroup exiting, it
is not always ok to take a reference on the blkcg returned from
bio_blkcg.

This patch adds association ahead of calling bio_blkcg rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg. blk_get_rl is modified as well to get
a reference to the blkcg it may use and blk_put_rl will always put the
reference back. Association is also moved above the bio_blkcg call to
ensure it will not return NULL in blk-iolatency.

BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
to address this in this series. I've created a private version of the
function with notes not to use it describing the flaw. Hopefully soon,
that code can be cleaned up.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d5274b3c 06-Sep-2018 Konstantin Khlebnikov <koct9i@gmail.com>

block: bfq: swap puts in bfqg_and_blkg_put

Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc33dd7 ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fc8ebd01 15-Aug-2018 Maciej S. Szmigiero <mail@maciej.szmigiero.name>

block, bfq: return nbytes and not zero from struct cftype .write() method

The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.

Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 84c7afce 09-May-2018 Omar Sandoval <osandov@fb.com>

block: use ktime_get_ns() instead of sched_clock() for cfq and bfq

cfq and bfq have some internal fields that use sched_clock() which can
trivially use ktime_get_ns() instead. Their timestamp fields in struct
request can also use ktime_get_ns(), which resolves the 8 year old
comment added by commit 28f4197e5d47 ("block: disable preemption before
using sched_clock()").

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 52257ffb 09-Jan-2018 Paolo Valente <paolo.valente@linaro.org>

block, bfq: put async queues for root bfq groups too

For each pair [device for which bfq is selected as I/O scheduler,
group in blkio/io], bfq maintains a corresponding bfq group. Each such
bfq group contains a set of async queues, with each async queue
created on demand, i.e., when some I/O request arrives for it. On
creation, an async queue gets an extra reference, to make sure that
the queue is not freed as long as its bfq group exists. Accordingly,
to allow the queue to be freed after the group exited, this extra
reference must released on group exit.

The above holds also for a bfq root group, i.e., for the bfq group
corresponding to the root blkio/io root for a given device. Yet, by
mistake, the references to the existing async queues of a root group
are not released when the latter exits. This causes a memory leak when
the instance of bfq for a given device exits. In a similar vein,
bfqg_stats_xfer_dead is not executed for a root group.

This commit fixes bfq_pd_offline so that the latter executes the above
missing operations for a root group too.

Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reported-by: Guoqing Jiang <gqjiang@suse.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Davide Ferrari <davideferrari8@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a33801e8 12-Nov-2017 Luca Miccio <lucmiccio@gmail.com>

block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP

BFQ currently creates, and updates, its own instance of the whole
set of blkio statistics that cfq creates. Yet, from the comments
of Tejun Heo in [1], it turned out that most of these statistics
are meant/useful only for debugging. This commit makes BFQ create
the latter, debugging statistics only if the option
CONFIG_DEBUG_BLK_CGROUP is set.

By doing so, this commit also enables BFQ to enjoy a high perfomance
boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
BFQ has to update far fewer statistics, and, in particular, not the
heaviest to update. To give an idea of the benefits, if
CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
with 8 threads doing random I/O in parallel on null_blk (configured
with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
(+30%). We have measured similar or even much higher boosts with other
CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
been obtained and can be reproduced very easily with the script in [1].

[1] https://www.spinics.net/lists/linux-block/msg18943.html

Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dfb79af5 30-Aug-2017 Bart Van Assche <bvanassche@acm.org>

bfq: Declare local functions static

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8f9bebc3 05-Jun-2017 Paolo Valente <paolo.valente@linaro.org>

block, bfq: access and cache blkg data only when safe

In blk-cgroup, operations on blkg objects are protected with the
request_queue lock. This is no more the lock that protects
I/O-scheduler operations in blk-mq. In fact, the latter are now
protected with a finer-grained per-scheduler-instance lock. As a
consequence, although blkg lookups are also rcu-protected, blk-mq I/O
schedulers may see inconsistent data when they access blkg and
blkg-related objects. BFQ does access these objects, and does incur
this problem, in the following case.

The blkg_lookup performed in bfq_get_queue, being protected (only)
through rcu, may happen to return the address of a copy of the
original blkg. If this is the case, then the blkg_get performed in
bfq_get_queue, to pin down the blkg, is useless: it does not prevent
blk-cgroup code from destroying both the original blkg and all objects
directly or indirectly referred by the copy of the blkg. BFQ accesses
these objects, which typically causes a crash for NULL-pointer
dereference of memory-protection violation.

Some additional protection mechanism should be added to blk-cgroup to
address this issue. In the meantime, this commit provides a quick
temporary fix for BFQ: cache (when safe) blkg data that might
disappear right after a blkg_lookup.

In particular, this commit exploits the following facts to achieve its
goal without introducing further locks. Destroy operations on a blkg
invoke, as a first step, hooks of the scheduler associated with the
blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed, and that all the pointers it contains are consistent,
while that instance is holding its bfqd->lock. A blkg_lookup performed
with bfqd->lock held then returns a fully consistent blkg, which
remains consistent until this lock is held. In more detail, this holds
even if the returned blkg is a copy of the original one.

Finally, also the object describing a group inside BFQ needs to be
protected from destruction on the blkg_free of the original blkg
(which invokes bfq_pd_free). This commit adds private refcounting for
this object, to let it disappear only after no bfq_queue refers to it
any longer.

This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.

Reported-by: Tomas Konir <tomas.konir@gmail.com>
Reported-by: Lee Tibbert <lee.tibbert@gmail.com>
Reported-by: Marco Piazza <mpiazza@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Tomas Konir <tomas.konir@gmail.com>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Marco Piazza <mpiazza@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ea25da48 19-Apr-2017 Paolo Valente <paolo.valente@linaro.org>

block, bfq: split bfq-iosched.c into multiple source files

The BFQ I/O scheduler features an optimal fair-queuing
(proportional-share) scheduling algorithm, enriched with several
mechanisms to boost throughput and reduce latency for interactive and
real-time applications. This makes BFQ a large and complex piece of
code. This commit addresses this issue by splitting BFQ into three
main, independent components, and by moving each component into a
separate source file:
1. Main algorithm: handles the interaction with the kernel, and
decides which requests to dispatch; it uses the following two further
components to achieve its goals.
2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
computes the schedule, using weights and budgets provided by the above
component.
3. cgroups support: handles group operations (creation, destruction,
move, ...).

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>