History log of /linux-master/drivers/block/brd.c
Revision Date Author Comments
# b5baaba4 15-Feb-2024 Christoph Hellwig <hch@lst.de>

brd: pass queue_limits to blk_mq_alloc_disk

Pass the queue limits directly to blk_alloc_disk instead of setting them
one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 74fa8f9c 15-Feb-2024 Christoph Hellwig <hch@lst.de>

block: pass a queue_limits argument to blk_alloc_disk

Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Also change blk_alloc_disk to return an ERR_PTR instead of just NULL
which can't distinguish errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6dd4423f 14-Jun-2023 Pankaj Raghav <p.raghav@samsung.com>

brd: use cond_resched instead of cond_resched_rcu

The body of the loop is run without RCU lock held. Use the regular
cond_resched() instead of cond_resched_rcu().

Fixes: 786bb0245881 ("brd: use XArray instead of radix-tree to index backing pages")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 786bb024 11-May-2023 Pankaj Raghav <p.raghav@samsung.com>

brd: use XArray instead of radix-tree to index backing pages

XArray was introduced to hold large array of pointers with a simple API.
XArray API also provides array semantics which simplifies the way we store
and access the backing pages, and the code becomes significantly easier
to understand.

No performance difference was noticed between the two implementation
using fio with direct=1 [1].

[1] Performance in KIOPS:

| radix-tree | XArray | Diff
| | |
write | 315 | 313 | -0.6%
randwrite | 286 | 290 | +1.3%
read | 330 | 335 | +1.5%
randread | 309 | 312 | +0.9%

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230511121544.111648-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3f89ac58 24-Apr-2023 Chaitanya Kulkarni <kch@nvidia.com>

block/drivers: remove dead clear of random flag

QUEUE_FLAG_ADD_RANDOM is not set before we clear it for "null_blk",
"brd", "nbd", "zram", and "bcache" since by default we don't set
"QUEUE_FLAG_ADD_RANDOM" to MQ ops.

Remove dead clear of QUEUE_FLAG_ADD_RANDOM in above listed drivers.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> #zram
Link: https://lore.kernel.org/r/20230424234628.45544-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3222d8c2 25-Jan-2023 Christoph Hellwig <hch@lst.de>

block: remove ->rw_page

The ->rw_page method is a special purpose bypass of the usual bio handling
path that is limited to single-page reads and writes and synchronous which
causes a lot of extra code in the drivers, callers and the block layer.

The only remaining user is the MM swap code. Switch that swap code to
simply submit a single-vec on-stack bio an synchronously wait on it based
on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that
currently implement ->rw_page instead. While this touches one extra cache
line and executes extra code, it simplifies the block layer and drivers
and ensures that all feastures are properly supported by all drivers, e.g.
right now ->rw_page bypassed cgroup writeback entirely.

[akpm@linux-foundation.org: fix comment typo, per Dan]
Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0aa2988e 17-Feb-2023 Pankaj Raghav <p.raghav@samsung.com>

brd: use radix_tree_maybe_preload instead of radix_tree_preload

Unconditionally calling radix_tree_preload_end() results in a OOPS
message as the preload is only conditionally called for
gfpflags_allow_blocking().

[ 20.267323] BUG: using smp_processor_id() in preemptible [00000000] code: fio/416
[ 20.267837] caller is brd_insert_page.part.0+0xbe/0x190 [brd]
[ 20.269436] Call Trace:
[ 20.269598] <TASK>
[ 20.269742] dump_stack_lvl+0x32/0x50
[ 20.269982] check_preemption_disabled+0xd1/0xe0
[ 20.270289] brd_insert_page.part.0+0xbe/0x190 [brd]
[ 20.270664] brd_submit_bio+0x33f/0xf40 [brd]

Use radix_tree_maybe_preload() which does preload only if
gfpflags_allow_blocking() is true but also takes the lock. Therefore,
unconditionally calling radix_tree_preload_end() should not create any
issues and the message disappears.

Fixes: 6ded703c56c2 ("brd: check for REQ_NOWAIT and set correct page allocation mask")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230217121442.33914-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 67205f80 15-Feb-2023 Jens Axboe <axboe@kernel.dk>

brd: mark as nowait compatible

By default, non-mq drivers do not support nowait. This causes io_uring
to use a slower path as the driver cannot be trust not to block. brd
can safely set the nowait flag, as worst case all it does is a NOIO
allocation.

For io_uring, this makes a substantial difference. Before:

submitter=0, tid=453, file=/dev/ram0, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=440.03K, BW=1718MiB/s, IOS/call=32/31
IOPS=428.96K, BW=1675MiB/s, IOS/call=32/32
IOPS=442.59K, BW=1728MiB/s, IOS/call=32/31
IOPS=419.65K, BW=1639MiB/s, IOS/call=32/32
IOPS=426.82K, BW=1667MiB/s, IOS/call=32/31

and after:

submitter=0, tid=354, file=/dev/ram0, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=3.37M, BW=13.15GiB/s, IOS/call=32/31
IOPS=3.45M, BW=13.46GiB/s, IOS/call=32/31
IOPS=3.43M, BW=13.42GiB/s, IOS/call=32/32
IOPS=3.43M, BW=13.39GiB/s, IOS/call=32/31
IOPS=3.43M, BW=13.38GiB/s, IOS/call=32/31

or about an 8x in difference. Now that brd is prepared to deal with
REQ_NOWAIT reads/writes, mark it as supporting that.

Cc: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-block/20230203103005.31290-1-p.raghav@samsung.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6ded703c 16-Feb-2023 Jens Axboe <axboe@kernel.dk>

brd: check for REQ_NOWAIT and set correct page allocation mask

If REQ_NOWAIT is set, then do a non-blocking allocation if the operation
is a write and we need to insert a new page. Currently REQ_NOWAIT cannot
be set as the queue isn't marked as supporting nowait, this change is in
preparation for allowing that.

radix_tree_preload() warns on attempting to call it with an allocation
mask that doesn't allow blocking. While that warning could arguably
be removed, we need to handle radix insertion failures anyway as they
are more likely if we cannot block to get memory.

Remove legacy BUG_ON()'s and turn them into proper errors instead, one
for the allocation failure and one for finding a page that doesn't
match the correct index.

Cc: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# db0ccc44 16-Feb-2023 Jens Axboe <axboe@kernel.dk>

brd: return 0/-error from brd_insert_page()

It currently returns a page, but callers just check for NULL/page to
gauge success. Clean this up and return the appropriate error directly
instead.

Cc: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e55e1b48 18-Aug-2022 Wolfram Sang <wsa+renesas@sang-engineering.com>

block: move from strlcpy with unused retval to strscpy

Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Acked-by: Geoff Levand <geoff@infradead.org>
Link: https://lore.kernel.org/r/20220818205958.6552-1-wsa+renesas@sang-engineering.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ba91fd01 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

block/brd: Use the enum req_op type

Improve static type checking by using the enum req_op type for a
function argument that represents a request operation.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-13-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 86947df3 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

block: Change the type of the last .rw_page() argument

All .rw_page() callers pass an enum req_op value as last argument. Make
this explicit by changing the type of the last argument into enum req_op.
See also commit 3f289dcb4b26 ("block: make bdev_ops->rw_page() take a
REQ_OP instead of bool").

Cc: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8b9ab626 19-Jun-2022 Christoph Hellwig <hch@lst.de>

block: remove blk_cleanup_disk

blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 00358933 06-Jan-2022 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>

brd: remove brd_devices_mutex mutex

If brd_alloc() from brd_probe() is called before brd_alloc() from
brd_init() is called, module loading will fail with -EEXIST error.
To close this race, call __register_blkdev() just before leaving
brd_init().

Then, we can remove brd_devices_mutex mutex, for brd_device list
will no longer be accessed concurrently.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/6b074af7-c165-4fab-b7da-8270a4f6f6cd@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1ebe2e5f 22-Nov-2021 Christoph Hellwig <hch@lst.de>

block: remove GENHD_FL_EXT_DEVT

All modern drivers can support extra partitions using the extended
dev_t. In fact except for the ioctl method drivers never even see
partitions in normal operation.

So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e1528830 15-Oct-2021 Luis Chamberlain <mcgrof@kernel.org>

block/brd: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015235219.2191207-2-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3e08773c 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: switch polling to be bio based

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f7bf3586 07-Sep-2021 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>

brd: reduce the brd_devices_mutex scope

As with commit 8b52d8be86d72308 ("loop: reorder loop_exit"),
unregister_blkdev() needs to be called first in order to avoid calling
brd_alloc() from brd_probe() after brd_del_one() from brd_exit(). Then,
we can avoid holding global mutex during add_disk()/del_gendisk() as with
commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope").

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e205f13d-18ff-a49c-0988-7de6ea5ff823@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 018eca45 20-Jul-2021 Guoqing Jiang <jiangguoqing@kylinos.cn>

block: move some macros to blkdev.h

Move them (PAGE_SECTORS_SHIFT, PAGE_SECTORS and SECTOR_MASK) to the
generic header file to remove redundancy.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Link: https://lore.kernel.org/r/20210721025315.1729118-1-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7f9b348c 20-May-2021 Christoph Hellwig <hch@lst.de>

brd: convert to blk_alloc_disk/blk_cleanup_disk

Convert the brd driver to use the blk_alloc_disk and blk_cleanup_disk
helpers to simplify gendisk and request_queue allocation. This also
allows to remove the request_queue pointer in struct request_queue,
and to simplify the initialization as blk_cleanup_disk can be called
on any disk returned from blk_alloc_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4ee60ec1 06-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

include: remove pagemap.h from blkdev.h

My UEK-derived config has 1030 files depending on pagemap.h before this
change. Afterwards, just 326 files need to be rebuilt when I touch
pagemap.h. I think blkdev.h is probably included too widely, but
untangling that dependency is harder and this solves my problem. x86
allmodconfig builds, but there may be implicit include problems on other
architectures.

Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm]
Acked-by: Jens Axboe <axboe@kernel.dk> [block]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi]
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4be591f 16-Apr-2021 Calvin Owens <calvinowens@fb.com>

brd: expose number of allocated pages in debugfs

While the maximum size of each ramdisk is defined either as a module
parameter, or compile time default, it's impossible to know how many pages
have currently been allocated by each ram%d device, since they're
allocated when used and never freed.

This patch creates a new directory at this location:

/sys/kernel/debug/ramdisk_pages/

which will contain a file named "ram%d" for each instantiated ramdisk on
the system. The file is read-only, and read() will output the number of
pages currently held by that ramdisk.

We lose track how much memory a ramdisk is using as pages once used are
simply recycled but never freed.

In instances where we exhaust the size of the ramdisk with a file that
exceeds it, encounter ENOSPC and delete the file for mitigation; df would
show decrease in used and increase in available blocks but the since we
have touched all pages, the memory footprint of the ramdisk does not
reflect the blocks used/available count

...
[root@localhost ~]# mkfs.ext2 /dev/ram15
mke2fs 1.45.6 (20-Mar-2020)
Creating filesystem with 4096 1k blocks and 1024 inodes
[root@localhost ~]# mount /dev/ram15 /mnt/ram15/

[root@localhost ~]# cat
/sys/kernel/debug/ramdisk_pages/ram15
58
[root@kerneltest008.06.prn3 ~]# df /dev/ram15
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ram15 3963 31 3728 1% /mnt/ram15
[root@kerneltest008.06.prn3 ~]# dd if=/dev/urandom of=/mnt/ram15/test2
bs=1M count=5
dd: error writing '/mnt/ram15/test2': No space left on device
4+0 records in
3+0 records out
4005888 bytes (4.0 MB, 3.8 MiB) copied, 0.0446614 s, 89.7 MB/s
[root@kerneltest008.06.prn3 ~]# df /mnt/ram15/
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ram15 3963 3960 0 100% /mnt/ram15
[root@kerneltest008.06.prn3 ~]# cat
/sys/kernel/debug/ramdisk_pages/ram15
1024
[root@kerneltest008.06.prn3 ~]# rm /mnt/ram15/test2
rm: remove regular file '/mnt/ram15/test2'? y
[root@kerneltest008.06.prn3 /var]# df /dev/ram15
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ram15 3963 31 3728 1% /mnt/ram15

# Acutal memory footprint
[root@kerneltest008.06.prn3 /var]# cat
/sys/kernel/debug/ramdisk_pages/ram15
1024
...

This debugfs counter will always reveal the accurate number of
permanently allocated pages to the ramdisk.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
[cleaned up the !CONFIG_DEBUG_FS case and API changes for HEAD]
Signed-off-by: Kyle McMartin <jkkm@fb.com>
[rebased]
Signed-off-by: Saravanan D <saravanand@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 309dca30 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: store a block_device pointer in struct bio

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 74cb8994 24-Jan-2021 Christoph Hellwig <hch@lst.de>

brd: remove the end of device check in brd_do_bvec

The block layer already checks for this conditions in bio_check_eod
before calling the driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7cc178a6 29-Oct-2020 Christoph Hellwig <hch@lst.de>

brd: use __register_blkdev to allocate devices on demand

Use the simpler mechanism attached to major_name to allocate a brd device
when a currently unregistered minor is accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a8b456d0 24-Sep-2020 Christoph Hellwig <hch@lst.de>

bdi: remove BDI_CAP_SYNCHRONOUS_IO

BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device. Just check up for
the method instead. The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c62b37d9 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: move ->make_request_fn to struct block_device_operations

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3d745ea5 27-Mar-2020 Christoph Hellwig <hch@lst.de>

block: simplify queue allocation

Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper. Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c8ab4225 04-Feb-2020 Zhiqiang Liu <liuzhiqiang26@huawei.com>

brd: check and limit max_part par

In brd_init func, rd_nr num of brd_device are firstly allocated
and add in brd_devices, then brd_devices are traversed to add each
brd_device by calling add_disk func. When allocating brd_device,
the disk->first_minor is set to i * max_part, if rd_nr * max_part
is larger than MINORMASK, two different brd_device may have the same
devt, then only one of them can be successfully added.
when rmmod brd.ko, it will cause oops when calling brd_exit.

Follow those steps:
# modprobe brd rd_nr=3 rd_size=102400 max_part=1048576
# rmmod brd
then, the oops will appear.

Oops log:
[ 726.613722] Call trace:
[ 726.614175] kernfs_find_ns+0x24/0x130
[ 726.614852] kernfs_find_and_get_ns+0x44/0x68
[ 726.615749] sysfs_remove_group+0x38/0xb0
[ 726.616520] blk_trace_remove_sysfs+0x1c/0x28
[ 726.617320] blk_unregister_queue+0x98/0x100
[ 726.618105] del_gendisk+0x144/0x2b8
[ 726.618759] brd_exit+0x68/0x560 [brd]
[ 726.619501] __arm64_sys_delete_module+0x19c/0x2a0
[ 726.620384] el0_svc_common+0x78/0x130
[ 726.621057] el0_svc_handler+0x38/0x78
[ 726.621738] el0_svc+0x8/0xc
[ 726.622259] Code: aa0203f6 aa0103f7 aa1e03e0 d503201f (7940e260)

Here, we add brd_check_and_reset_par func to check and limit max_part par.

--
V5->V6:
- remove useless code

V4->V5:(suggested by Ming Lei)
- make sure max_part is not larger than DISK_MAX_PARTS

V3->V4:(suggested by Ming Lei)
- remove useless change
- add one limit of max_part

V2->V3: (suggested by Ming Lei)
- clear .minors when running out of consecutive minor space in brd_alloc
- remove limit of rd_nr

V1->V2:
- add more checks in brd_check_par_valid as suggested by Ming Lei.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f1acbf21 04-Dec-2019 Ming Lei <ming.lei@redhat.com>

brd: warn on un-aligned buffer

Queue dma alignment limit requires users(fs, target, ...) of block layer
to pass aligned buffer.

So far brd doesn't support un-aligned buffer, even though it is easy
to support it.

However, given brd is often used for debug purpose, and there are other
drivers which can't support un-aligned buffer too.

So add warning so that brd users know what to fix.

Reported-by: Stephen Rust <srust@blockbridge.com>
Cc: Stephen Rust <srust@blockbridge.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 36582a5a 04-Dec-2019 Ming Lei <ming.lei@redhat.com>

brd: remove max_hw_sectors queue limit

Now we depend on blk_queue_split() to respect most of queue limit
(the only one exception could be dma alignment), however
blk_queue_split() isn't used for brd, so this limit isn't respected
since v4.3.

Also max_hw_sectors limit doesn't play a big role for brd, which is
added since brd is added to tree for unknown reason.

So remove it.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 09c434b8 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for more missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 936b33f7 09-May-2019 Mikulas Patocka <mpatocka@redhat.com>

brd: add cond_resched to brd_free_pages

The loop that frees all the pages can take unbounded amount of time, so
add cond_resched() to it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f6b50160 22-Apr-2019 Hou Tao <houtao1@huawei.com>

brd: re-enable __GFP_HIGHMEM in brd_insert_page()

__GFP_HIGHMEM is disabled if dax is enabled on brd, however
dax support for brd has been removed since commit (7a862fbbdec6
"brd: remove dax support"), so restore __GFP_HIGHMEM in
brd_insert_page().

Also remove the no longer applicable comments about DAX and highmem.

Cc: stable@vger.kernel.org
Fixes: 7a862fbbdec6 ("brd: remove dax support")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 153fcd5f 01-Nov-2018 Ming Lei <ming.lei@redhat.com>

block: brd: associate with queue until adding disk

brd_free() may be called in failure path on one brd instance which
disk isn't added yet, so release handler of gendisk may free the
associated request_queue early and causes the following use-after-free[1].

This patch fixes this issue by associating gendisk with request_queue
just before adding disk.

[1] KASAN: use-after-free Read in del_timer_syncNon-volatile memory driver v1.3
Linux agpgart interface v0.103
[drm] Initialized vgem 1.0.0 20120112 for virtual device on minor 0
usbcore: registered new interface driver udl
==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20
kernel/locking/lockdep.c:3218
Read of size 8 at addr ffff8801d1b6b540 by task swapper/0/1

CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0+ #88
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x39d lib/dump_stack.c:113
print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218
lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
del_timer_sync+0xb7/0x270 kernel/time/timer.c:1283
blk_cleanup_queue+0x413/0x710 block/blk-core.c:809
brd_free+0x5d/0x71 drivers/block/brd.c:422
brd_init+0x2eb/0x393 drivers/block/brd.c:518
do_one_initcall+0x145/0x957 init/main.c:890
do_initcall_level init/main.c:958 [inline]
do_initcalls init/main.c:966 [inline]
do_basic_setup init/main.c:984 [inline]
kernel_init_freeable+0x5c6/0x6b9 init/main.c:1148
kernel_init+0x11/0x1ae init/main.c:1068
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:350

Reported-by: syzbot+3701447012fe951dabb2@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3f289dcb 18-Jul-2018 Tejun Heo <tj@kernel.org>

block: make bdev_ops->rw_page() take a REQ_OP instead of bool

c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.

Unfortunately, we want to track discards separately and @is_write
isn't enough information. This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5657a819 24-May-2018 Joe Perches <joe@perches.com>

block drivers/block: Use octal not symbolic permissions

Convert the S_<FOO> symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.

see: https://lkml.org/lkml/2016/8/2/1945

Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...>

Miscellanea:

o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 316ba573 03-May-2018 SeongJae Park <sj38.park@gmail.com>

brd: Mark as non-rotational

This commit sets QUEUE_FLAG_NONROT and clears up QUEUE_FLAG_ADD_RANDOM
to mark the ramdisks as non-rotational device.

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 233bde21 14-Mar-2018 Bart Van Assche <bvanassche@acm.org>

block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h>

It happens often while I'm preparing a patch for a block driver that
I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
available for this driver? Do I have to introduce definitions of these
constants before I can use these constants? To avoid this confusion,
move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
<linux/blkdev.h> header file such that these become available for all
block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
header file conditional to avoid that including that header file after
<linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE
redefinition.

Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
not been removed from uapi header files nor from NAND drivers in
which these constants are used for another purpose than converting
block layer offsets and sizes into a number of sectors.

Cc: David S. Miller <davem@davemloft.net>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3079c22e 26-Feb-2018 Jan Kara <jack@suse.cz>

genhd: Rename get_disk() to get_disk_and_module()

Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 23c47d2a 15-Nov-2017 Minchan Kim <minchan@kernel.org>

bdi: introduce BDI_CAP_SYNCHRONOUS_IO

As discussed at

https://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>

someday we will remove rw_page(). If so, we need something to detect
such super-fast storage on which synchronous IO operations like the
current rw_page are always a win.

Introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices. With it, we
could use various optimization techniques.

Link: http://lkml.kernel.org/r/1505886205-9671-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7a862fbb 23-Oct-2017 Dan Williams <dan.j.williams@intel.com>

brd: remove dax support

DAX support in brd is awkward because its backing page frames are
distinct from the ones provided by pmem, dcssblk, or axonram. We need
pfn_t_devmap() entries to fully support DAX, and the limited DAX support
for pfn_t_special() page frames is not interesting for brd when pmem is
already a superset of brd. Lastly, brd is the only dax capable driver
that may sleep in its ->direct_access() implementation. So it causes a
global burden with no net gain of kernel functionality.

For all these reasons, remove DAX support.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 15f7b41f 09-Nov-2017 Mikulas Patocka <mpatocka@redhat.com>

brd: remove unused brd_mutex

Remove unused mutex brd_mutex. It is unused since the commit ff26956875c2
("brd: remove support for BLKFLSBUF").

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 02a48436 13-Sep-2017 Mikulas Patocka <mpatocka@redhat.com>

brd: fix overflow in __brd_direct_access

The code in __brd_direct_access multiplies the pgoff variable by page size
and divides it by 512. It can cause overflow on 32-bit architectures. The
overflow happens if we create ramdisk larger than 4G and use it as a
sparse device.

This patch replaces multiplication and division with multiplication by the
number of sectors per page.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 1647b9b959c7 ("brd: add dax_operations support")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 98cc093c 06-Sep-2017 Huang Ying <ying.huang@intel.com>

block, THP: make block_device_operations.rw_page support THP

The .rw_page in struct block_device_operations is used by the swap
subsystem to read/write the page contents from/into the corresponding
swap slot in the swap device. To support the THP (Transparent Huge
Page) swap optimization, the .rw_page is enhanced to support to
read/write THP if possible.

Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74d46992 23-Aug-2017 Christoph Hellwig <hch@lst.de>

block: replace bi_bdev with a gendisk pointer and partitions index

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 287f3ca5 10-Jul-2017 Bart Van Assche <bvanassche@acm.org>

ARM: fix rd_size declaration

The global variable 'rd_size' is declared as 'int' in source file
arch/arm/kernel/atags_parse.c and as 'unsigned long' in
drivers/block/brd.c. Fix this inconsistency.

Additionally, remove the declarations of rd_image_start, rd_prompt and
rd_doload from parse_tag_ramdisk() since these duplicate existing
declarations in <linux/initrd.h>.

Link: http://lkml.kernel.org/r/20170627065024.12347-1-bart.vanassche@wdc.com
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Zhaohongjiang <zhaohongjiang@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5d61e43b 27-Jun-2017 Dan Williams <dan.j.williams@intel.com>

dax: remove default copy_from_iter fallback

Require all dax-drivers to register a ->copy_from_iter() operation so
that it is clear which dax_operations are optional and which must be
implemented for filesystem-dax to operate.

Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 0b0bcacc 19-Jun-2017 Christoph Hellwig <hch@lst.de>

block: don't bother with bounce limits for make_request drivers

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1ef97fe4 03-May-2017 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

brd: fix uninitialized use of brd->dax_dev

commit 1647b9b9 "brd: add dax_operations support" introduced the allocation
and freeing of a dax_device, but the allocated dax_device is not stored
into the brd_device, so brd_del_one() will eventually operate on an
uninitialized brd->dax_dev.

Fix this by storing the allocated dax_device to brd->dax_dev.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# d4b29fd7 27-Jan-2017 Dan Williams <dan.j.williams@intel.com>

block: remove block_device_operations ->direct_access()

Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 1647b9b9 25-Jan-2017 Dan Williams <dan.j.williams@intel.com>

brd: add dax_operations support

Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# f09a06a1 05-Apr-2017 Christoph Hellwig <hch@lst.de>

brd: remove discard support

It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7c0f6ba6 24-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

Replace <asm/uaccess.h> with <linux/uaccess.h> globally

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff269568 25-Oct-2016 Mike Snitzer <snitzer@redhat.com>

brd: remove support for BLKFLSBUF

Discontinue having the brd driver destructively free all pages in the
ramdisk in response to the BLKFLSBUF ioctl. Doing so allows a BLKFLSBUF
ioctl issued to a logical partition to destroy pages of the parent brd
device (and all other partitions of that brd device).

This change breaks compatibility - but in this case the compatibility
breaks more than it helps.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 366f4aea 25-Oct-2016 Jan Kara <jack@suse.cz>

brd: Switch rd_size to unsigned long

Currently rd_size was int which lead to overflow and bogus device size
once the requested ramdisk size was 1 TB or more. Although these days
ramdisks with 1 TB size are mostly a mistake, the days when they are
useful are not far.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c11f0c0b 05-Aug-2016 Jens Axboe <axboe@fb.com>

block/mm: make bdev_ops->rw_page() take a bool for read/write

Commit abf545484d31 changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# abf54548 04-Aug-2016 Mike Christie <mchristi@redhat.com>

mm/block: convert rw_page users to bio op use

The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>


# 163d4baa 23-Jun-2016 Toshi Kani <toshi.kani@hpe.com>

block: add QUEUE_FLAG_DAX for devices to advertise their DAX support

Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device. Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support. This will allow to set the DAX capability based on how
mapped device is composed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 70246286 19-Jul-2016 Christoph Hellwig <hch@lst.de>

block: get rid of bio_rw and READA

These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7a9eb206 03-Jun-2016 Dan Williams <dan.j.williams@intel.com>

pmem: kill __pmem address space

The __pmem address space was meant to annotate codepaths that touch
persistent memory and need to coordinate a call to wmb_pmem(). Now that
wmb_pmem() is gone, there is little need to keep this annotation.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 95fe6c1a 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block, fs, mm, drivers: use bio set/get op accessors

This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op

These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0a70bd43 24-Feb-2016 Dan Williams <dan.j.williams@intel.com>

dax: enable dax in the presence of known media errors (badblocks)

1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e4298be 15-Dec-2015 Bart Van Assche <bvanassche@acm.org>

brd: Fix discard request processing

Avoid that discard requests with size => PAGE_SIZE fail with
-EIO. Refuse discard requests if the discard size is not a
multiple of the page size.

Fixes: 2dbe54957636 ("brd: Refuse improperly aligned discard requests")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliot <elliott@hp.com>
Cc: stable <stable@vger.kernel.org> # v4.4+
Signed-off-by: Jens Axboe <axboe@fb.com>


# 34c0fd54 15-Jan-2016 Dan Williams <dan.j.williams@intel.com>

mm, dax, pmem: introduce pfn_t

For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags. These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory". Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e. O_DIRECT I/O with a persistent memory source/target). However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2dbe5495 04-Nov-2015 Jan Kara <jack@suse.com>

brd: Refuse improperly aligned discard requests

Currently when improperly aligned discard request is submitted, we just
silently discard more / less data which results in filesystem corruption
in some cases. Refuse such misaligned requests.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# dece1635 05-Nov-2015 Jens Axboe <axboe@fb.com>

block: change ->make_request_fn() and users to return a queue cookie

No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>


# cb389b9c 07-Aug-2015 Dan Williams <dan.j.williams@intel.com>

dax: drop size parameter to ->direct_access()

None of the implementations currently use it. The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# e2e05394 18-Aug-2015 Ross Zwisler <zwisler@kernel.org>

pmem, dax: have direct_access use __pmem annotation

Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 4246a0b6 20-Jul-2015 Christoph Hellwig <hch@lst.de>

block: add a bi_error field to struct bio

Currently we have two different ways to signal an I/O error on a BIO:

(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2bb4cd5c 14-Jul-2015 Jens Axboe <axboe@fb.com>

block: have drivers use blk_queue_max_discard_sectors()

Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a7a97fc9 16-Feb-2015 Matthew Wilcox <willy@infradead.org>

brd: rename XIP to DAX

Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c8fa3173 07-Jan-2015 Boaz Harrosh <boaz@plexistor.com>

brd: Request from fdisk 4k alignment

Because of the direct_access() API which returns a PFN. partitions
better start on 4K boundary, else offset ZERO of a partition will
not be aligned and blk_direct_access() will fail the call.

By setting blk_queue_physical_block_size(PAGE_SIZE) we can communicate
this to fdisk and friends.

The call to blk_queue_physical_block_size() is harmless and will
not affect the Kernel behavior in any way. It is only for
communication to user-mode.

before this patch running fdisk on a default size brd of 4M
the first sector offered is 34 (BAD), but after this patch it
will be 40, ie 8 sectors aligned. Also when entering some random
partition sizes the next partition-start sector is offered 8 sectors
aligned after this patch. (Please note that with fdisk the user
can still enter bad values, only the offered default values will
be correct)

Note that with bdev-size > 4M fdisk will try to align on a 1M
boundary (above first-sector will be 2048), in any case.

CC: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 937af5ec 07-Jan-2015 Boaz Harrosh <boaz@plexistor.com>

brd: Fix all partitions BUGs

This patch fixes up brd's partitions scheme, now enjoying all worlds.

The MAIN fix here is that currently, if one fdisks some partitions,
a BAD bug will make all partitions point to the same start-end sector
ie: 0 - brd_size And an mkfs of any partition would trash the partition
table and the other partition.

Another fix is that "mount -U uuid" will not work if show_part was not
specified, because of the GENHD_FL_SUPPRESS_PARTITION_INFO flag.
We now always load without it and remove the show_part parameter.

[We remove Dmitry's new module-param part_show it is now always
show]

So NOW the logic goes like this:
* max_part - Just says how many minors to reserve between ramX
devices. In any way, there can be as many partition as requested.
If minors between devices ends, then dynamic 259-major ids will
be allocated on the fly.
The default is now max_part=1, which means all partitions devt(s)
will be from the dynamic (259) major-range.
(If persistent partition minors is needed use max_part=X)
For example with /dev/sdX max_part is hard coded 16.

* Creation of new devices on the fly still/always work:
mknod /path/devnod b 1 X
fdisk -l /path/devnod
Will create a new device if [X / max_part] was not already
created before. (Just as before)

partitions on the dynamically created device will work as well
Same logic applies with minors as with the pre-created ones.

TODO: dynamic grow of device size. So each device can have it's
own size.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# dd22f551 07-Jan-2015 Matthew Wilcox <willy@infradead.org>

block: Change direct_access calling convention

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# aeac3181 17-Aug-2014 Dmitry Monakhov <dmonakhov@openvz.org>

brd: add ram disk visibility option

Currenly ram disk is not visiable inside /proc/partitions. This was
done for compatibility reasons here: 53978d0a7a27. But some utilities
expect disk presents in /proc/partitions.
Let's add module's option and let's administrator chose visibility behaviour.
By default, old behaviour preserved.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 96f8d8e0 04-Jun-2014 Matthew Wilcox <willy@infradead.org>

brd: return -ENOSPC rather than -ENOMEM on page allocation failure

brd is effectively a thinly provisioned device. Thinly provisioned
devices return -ENOSPC when they can't write a new block. -ENOMEM is an
implementation detail that callers shouldn't know.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Dave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a72132c3 04-Jun-2014 Matthew Wilcox <willy@infradead.org>

brd: add support for rw_page()

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7988613b 23-Nov-2013 Kent Overstreet <kmo@daterainc.com>

block: Convert bio_for_each_segment() to bvec_iter

More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>


# 4f024f37 11-Oct-2013 Kent Overstreet <kmo@daterainc.com>

block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6


# a207f593 13-Oct-2013 Mikulas Patocka <mpatocka@redhat.com>

block: fix a probe argument to blk_register_region

The probe function is supposed to return NULL on failure (as we can see in
kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;

However, in loop and brd, it returns negative error from ERR_PTR.

This causes a crash if we simulate disk allocation failure and run
less -f /dev/loop0 because the negative number is interpreted as a pointer:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4
IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450
PGD 23c677067 PUD 23d6d1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
CPU: 1 PID: 6831 Comm: less Tainted: P W O 3.10.15-devel #18
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000
RIP: 0010:[<ffffffff8118b188>] [<ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP: 0018:ffff88023e47dbd8 EFLAGS: 00010286
RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658
R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640
FS: 00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
Call Trace:
[<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
[<ffffffff8118c10d>] blkdev_get+0x1dd/0x370
[<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
[<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
[<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
[<ffffffff8118c365>] blkdev_open+0x65/0x80
[<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
[<ffffffff8114d214>] finish_open+0x34/0x50
[<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
[<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
[<ffffffff81115a8e>] ? might_fault+0x4e/0xa0
[<ffffffff8115f4f0>] do_filp_open+0x40/0x90
[<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
[<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
[<ffffffff8114e45f>] do_sys_open+0xef/0x1d0
[<ffffffff8114e559>] SyS_open+0x19/0x20
[<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
a4 07 00 44 89
RIP [<ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP <ffff88023e47dbd8>
CR2: 00000000000002b4
---[ end trace bb7f32dbf02398dc ]---

The brd change should be backported to stable kernels starting with 2.6.25.
The loop change should be backported to stable kernels starting with 2.6.22.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org # 2.6.22+
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dfd20b2b 24-May-2013 Brian Behlendorf <behlendorf1@llnl.gov>

drivers/block/brd.c: fix brd_lookup_page() race

The index on the page must be set before it is inserted in the radix
tree. Otherwise there is a small race which can occur during lookup
where the page can be found with the incorrect index. This will trigger
the BUG_ON() in brd_lookup_page().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Chris Wedgwood <cw@f00f.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f73a1c7d 25-Sep-2012 Kent Overstreet <koverstreet@google.com>

block: Add bio_end_sector()

Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>


# cfd8005c 25-Nov-2011 Cong Wang <amwang@redhat.com>

block: remove the second argument of k[un]map_atomic()

Signed-off-by: Cong Wang <amwang@redhat.com>


# ff01bb48 16-Sep-2011 Al Viro <viro@zeniv.linux.org.uk>

fs: move code out of buffer.c

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5a7bbad2 11-Sep-2011 Christoph Hellwig <hch@infradead.org>

block: remove support for bio remapping from ->make_request

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 8892cbaf 26-May-2011 Namhyung Kim <namhyung@gmail.com>

brd: export module parameters

Export 'rd_nr', 'rd_size' and 'max_part' parameters to sysfs so user can
know that how many devices are allowed, how big each device is and how
many partitions are supported. If 'max_part' is 0, it means simply the
device doesn't support partitioning.

Also note that 'max_part' can be adjusted to power of 2 minus 1 form if
needed. User should check this value after the module loading if he/she
want to use that number correctly (i.e. fdisk, mknod, etc.).

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 13868b76 26-May-2011 Namhyung Kim <namhyung@gmail.com>

brd: fix comment on initial device creation

If 'rd_nr' param was not specified, 16 (can be adjusted via
CONFIG_BLK_DEV_RAM_COUNT) devices would be created by default
but comment said 1. Fix it.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# af465668 26-May-2011 Namhyung Kim <namhyung@gmail.com>

brd: handle on-demand devices correctly

When finding or allocating a ram disk device, brd_probe() did not take
partition numbers into account so that it can result to a different
device. Consider following example (I set CONFIG_BLK_DEV_RAM_COUNT=4
for simplicity) :

$ sudo modprobe brd max_part=15
$ ls -l /dev/ram*
brw-rw---- 1 root disk 1, 0 2011-05-25 15:41 /dev/ram0
brw-rw---- 1 root disk 1, 16 2011-05-25 15:41 /dev/ram1
brw-rw---- 1 root disk 1, 32 2011-05-25 15:41 /dev/ram2
brw-rw---- 1 root disk 1, 48 2011-05-25 15:41 /dev/ram3
$ sudo mknod /dev/ram4 b 1 64
$ sudo dd if=/dev/zero of=/dev/ram4 bs=4k count=256
256+0 records in
256+0 records out
1048576 bytes (1.0 MB) copied, 0.00215578 s, 486 MB/s
namhyung@leonhard:linux$ ls -l /dev/ram*
brw-rw---- 1 root disk 1, 0 2011-05-25 15:41 /dev/ram0
brw-rw---- 1 root disk 1, 16 2011-05-25 15:41 /dev/ram1
brw-rw---- 1 root disk 1, 32 2011-05-25 15:41 /dev/ram2
brw-rw---- 1 root disk 1, 48 2011-05-25 15:41 /dev/ram3
brw-r--r-- 1 root root 1, 64 2011-05-25 15:45 /dev/ram4
brw-rw---- 1 root disk 1, 1024 2011-05-25 15:44 /dev/ram64

After this patch, /dev/ram4 - instead of /dev/ram64 - was
accessed correctly.

In addition, 'range' passed to blk_register_region() should
include all range of dev_t that RAMDISK_MAJOR can address.
It does not need to be limited by partition numbers unless
'rd_nr' param was specified.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 315980c8 26-May-2011 Namhyung Kim <namhyung@gmail.com>

brd: limit 'max_part' module param to DISK_MAX_PARTS

The 'max_part' parameter controls the number of maximum partition
a brd device can have. However if a user specifies very large
value it would exceed the limitation of device minor number and
can cause a kernel panic (or, at least, produce invalid device
nodes in some cases).

On my desktop system, following command kills the kernel. On qemu,
it triggers similar oops but the kernel was alive:

$ sudo modprobe brd max_part=100000
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: [<ffffffff81110a9a>] sysfs_create_dir+0x2d/0xae
PGD 7af1067 PUD 7b19067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file:
CPU 0
Modules linked in: brd(+)

Pid: 44, comm: insmod Tainted: G W 2.6.39-qemu+ #158 Bochs Bochs
RIP: 0010:[<ffffffff81110a9a>] [<ffffffff81110a9a>] sysfs_create_dir+0x2d/0xae
RSP: 0018:ffff880007b15d78 EFLAGS: 00000286
RAX: ffff880007b05478 RBX: ffff880007a52760 RCX: ffff880007b15dc8
RDX: ffff880007a4f900 RSI: ffff880007b15e48 RDI: ffff880007a52760
RBP: ffff880007b15da8 R08: 0000000000000002 R09: 0000000000000000
R10: ffff880007b15e48 R11: ffff880007b05478 R12: 0000000000000000
R13: ffff880007b05478 R14: 0000000000400920 R15: 0000000000000063
FS: 0000000002160880(0063) GS:ffff880007c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000058 CR3: 0000000007b1c000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
Process insmod (pid: 44, threadinfo ffff880007b14000, task ffff880007acb980)
Stack:
ffff880007b15dc8 ffff880007b05478 ffff880007b15da8 00000000fffffffe
ffff880007a52760 ffff880007b05478 ffff880007b15de8 ffffffff81143c0a
0000000000400920 ffff880007a52760 ffff880007b05478 0000000000000000
Call Trace:
[<ffffffff81143c0a>] kobject_add_internal+0xdf/0x1a0
[<ffffffff81143da1>] kobject_add_varg+0x41/0x50
[<ffffffff81143e6b>] kobject_add+0x64/0x66
[<ffffffff8113bbe7>] blk_register_queue+0x5f/0xb8
[<ffffffff81140f72>] add_disk+0xdf/0x289
[<ffffffffa00040df>] brd_init+0xdf/0x1aa [brd]
[<ffffffffa0004000>] ? 0xffffffffa0003fff
[<ffffffffa0004000>] ? 0xffffffffa0003fff
[<ffffffff8100020a>] do_one_initcall+0x7a/0x12e
[<ffffffff8108516c>] sys_init_module+0x9c/0x1dc
[<ffffffff812ff4bb>] system_call_fastpath+0x16/0x1b
Code: 89 e5 41 55 41 54 53 48 89 fb 48 83 ec 18 48 85 ff 75 04 0f 0b eb fe 48 8b 47 18 49 c7 c4 70 1e 4d 81 48 85 c0 74 04 4c 8b 60 30
8b 44 24 58 45 31 ed 0f b6 c4 85 c0 74 0d 48 8b 43 28 48 89
RIP [<ffffffff81110a9a>] sysfs_create_dir+0x2d/0xae
RSP <ffff880007b15d78>
CR2: 0000000000000058
---[ end trace aebb1175ce1f6739 ]---

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# a2cba291 26-May-2011 Namhyung Kim <namhyung@gmail.com>

brd: get rid of unused members from struct brd_device

brd_refcnt, brd_offset, brd_sizelimit and brd_blocksize in struct
brd_device seem to be copied from struct loop_device but they're
not used anywhere. Let get rid of them.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 2a48fc0a 02-Jun-2010 Arnd Bergmann <arnd@arndb.de>

block: autoconvert trivial BKL users to private mutex

The block device drivers have all gained new lock_kernel
calls from a recent pushdown, and some of the drivers
were already using the BKL before.

This turns the BKL into a set of per-driver mutexes.
Still need to check whether this is safe to do.

file=$1
name=$2
if grep -q lock_kernel ${file} ; then
if grep -q 'include.*linux.mutex.h' ${file} ; then
sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
else
sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
fi
sed -i ${file} \
-e "/^#include.*linux.mutex.h/,$ {
1,/^\(static\|int\|long\)/ {
/^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);

} }" \
-e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
-e '/[ ]*cycle_kernel_lock();/d'
else
sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \
-e '/cycle_kernel_lock()/d'
fi

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 4913efe4 03-Sep-2010 Tejun Heo <tj@kernel.org>

block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()

Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().

blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
device has write cache and can flush it, it should set REQ_FLUSH. If
the device can handle FUA writes, it should also set REQ_FUA.

All blk_queue_ordered() users are converted.

* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 6958f145 03-Sep-2010 Tejun Heo <tj@kernel.org>

block: kill QUEUE_ORDERED_BY_TAG

Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot. Kill ORDERED_BY_TAG. The
following users are affected.

* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated. Removed.
* xen-blkfront: ORDERED_TAG case dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 8a6cfeb6 08-Jul-2010 Arnd Bergmann <arnd@arndb.de>

block: push down BKL into .locked_ioctl

As a preparation for the removal of the big kernel
lock in the block layer, this removes the BKL
from the common ioctl handling code, moving it
into every single driver still using it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 00fff265 03-Jul-2010 FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>

block: remove q->prepare_flush_fn completely

This removes q->prepare_flush_fn completely (changes the
blk_queue_ordered API).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 7b6d91da 07-Aug-2010 Christoph Hellwig <hch@lst.de>

block: unify flags for struct bio and struct request

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# b7c33571 26-May-2010 Nick Piggin <npiggin@suse.de>

brd: support discard

Support discard requests in brd by zeroing or deleting the underlying backing
pages. This is simply to help with testing and documentation nature of
brd code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 086fa5ff 25-Feb-2010 Martin K. Petersen <martin.petersen@oracle.com>

block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors

The block layer calling convention is blk_queue_<limit name>.
blk_queue_max_sectors predates this practice, leading to some confusion.
Rename the function to appropriately reflect that its intended use is to
set max_hw_sectors.

Also introduce a temporary wrapper for backwards compability. This can
be removed after the merge window is closed.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 83d5cde4 21-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

const: make block_device_operations const

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1adbee50 10-Jun-2009 Robert P. J. Day <rpjday@crashcourse.ca>

ramdisk: remove long-deprecated "ramdisk=" boot-time parameter

The "ramdisk" parameter was removed from the defunct rd.c file quite some
time ago, in favour of the more specific "ramdisk_size" parameter so, for
consistency, the same should be done here.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# c2572f2b 15-Apr-2009 Nick Piggin <npiggin@suse.de>

brd: fix cacheflushing

brd is missing a flush_dcache_page. On 2nd thoughts, perhaps it is the
pagecache's responsibility to flush user virtual aliases (the driver of
course should flush kernel virtual mappings)... but anyway, there
already exists cache flushing for one direction of transfer, so we
should add the other.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# dfbc4752 15-Apr-2009 Nick Piggin <npiggin@suse.de>

brd: support barriers

brd is always ordered (not that it matters, as it is defined not to
survive when the system goes down). So tell the block layer it is
ordered, which might be of help with testing filesystems.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 2b9ecd03 02-Mar-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] switch brd

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d4430d62 02-Mar-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] beginning of methods conversion

To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.

Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.

New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c82f2966 20-Aug-2008 Akinobu Mita <akinobu.mita@gmail.com>

brd: fix name argument of unregister_blkdev()

The name of brd block device is "ramdisk", it's not "brd".
(The block device is registered by register_blkdev(RAMDISK_MAJOR, "ramdisk")
So it should be unregistered by unregister_blkdev(RAMDISK_MAJOR, "ramdisk")

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# efedf51c 04-Jun-2008 Nick Piggin <npiggin@suse.de>

Add 'rd' alias to new brd ramdisk driver

Alias brd to rd in the hope of helping legacy users. Suggested by Jan.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 53978d0a 23-May-2008 Marcin Krol <hawk@pld-linux.org>

brd: don't show ramdisks in /proc/partitions

In 2.6.25, ramdisk devices show up in /proc/partitions, which is a
behaviour change from the old rd.c. Add GENHD_FL_SUPPRESS_PARTITION_INFO,
which was present in rd.c.

All kernels prior to 2.6.25 weren't displaying ramdisks in
/proc/partitions. Since there are many userspace tools using information
from /proc/partitions some of them may now behave incorrectly (I didn't
tested any though). For example before 2.6.25 /proc/partitions was empty
if no block devices like hard disks and such were detected by kernel. Now
all 16 ramdisks are always visible there. Some software may rely on such
information (I mean, on empty /proc/partitions).

There was quite similar situation back in 2004, and ramdisks were excluded
back from displaying. Thats why I called this a regression (maybe a bit
unfortunate). See this patch for info:
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.3-rc2/2.6.3-rc2-mm1/broken-out/nbd-proc-partitions-fix.patch

I also think that someone somewhere (long time ago) excluded ramdisks from
/proc/partitions for good reasons. It is possible that now such new
"feature" is harmless, but I think there are more chances that someone
will say "hey, /proc/partitions has changed, now my software doesn't work"
then "hey where did my new 2.6.25 feature go". nbd devices are also
excluded, maybe for very same (unknown to me) reasons.

Signed-off-by: Marcin Krol <hawk@pld-linux.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d7853d1f 30-Apr-2008 Laurent Vivier <Laurent.Vivier@bull.net>

brd: modify ramdisk device to be able to manage partitions

This patch adds partition management for Block RAM Device (BRD).

This patch is done to keep in sync BRD and loop device drivers.

This patch adds a parameter to the module, max_part, to specify
the maximum number of partitions per RAM device.

Example:

# modprobe brd max_part=63
# ls -l /dev/ram*
brw-rw---- 1 root disk 1, 0 2008-04-03 13:39 /dev/ram0
brw-rw---- 1 root disk 1, 64 2008-04-03 13:39 /dev/ram1
brw-rw---- 1 root disk 1, 640 2008-04-03 13:39 /dev/ram10
brw-rw---- 1 root disk 1, 704 2008-04-03 13:39 /dev/ram11
brw-rw---- 1 root disk 1, 768 2008-04-03 13:39 /dev/ram12
brw-rw---- 1 root disk 1, 832 2008-04-03 13:39 /dev/ram13
brw-rw---- 1 root disk 1, 896 2008-04-03 13:39 /dev/ram14
brw-rw---- 1 root disk 1, 960 2008-04-03 13:39 /dev/ram15
brw-rw---- 1 root disk 1, 128 2008-04-03 13:39 /dev/ram2
brw-rw---- 1 root disk 1, 192 2008-04-03 13:39 /dev/ram3
brw-rw---- 1 root disk 1, 256 2008-04-03 13:39 /dev/ram4
brw-rw---- 1 root disk 1, 320 2008-04-03 13:39 /dev/ram5
brw-rw---- 1 root disk 1, 384 2008-04-03 13:39 /dev/ram6
brw-rw---- 1 root disk 1, 448 2008-04-03 13:39 /dev/ram7
brw-rw---- 1 root disk 1, 512 2008-04-03 13:39 /dev/ram8
brw-rw---- 1 root disk 1, 576 2008-04-03 13:39 /dev/ram9
# fdisk /dev/ram0
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): o
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-2, default 1): 1
Last cylinder or +size or +sizeM or +sizeK (1-2, default 2): 2

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
# ls -l /dev/ram0*
brw-rw---- 1 root disk 1, 0 2008-04-03 13:40 /dev/ram0
brw-rw---- 1 root disk 1, 1 2008-04-03 13:40 /dev/ram0p1
# mkfs /dev/ram0p1
mke2fs 1.40-WIP (14-Nov-2006)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
4016 inodes, 16032 blocks
801 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=16515072
2 block groups
8192 blocks per group, 8192 fragments per group
2008 inodes per group
Superblock backups stored on blocks:
8193

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 26 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
# mount /dev/ram0p1 /mnt
df /mnt
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ram0p1 15521 138 14582 1% /mnt
# ls -l /mnt
total 12
drwx------ 2 root root 12288 2008-04-03 13:41 lost+found
# umount /mnt
# rmmod brd

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 30afcb4b 28-Apr-2008 Jared Hulbert <jaredeh@gmail.com>

return pfn from direct_access, for XIP

Alter the block device ->direct_access() API to work with the new
get_xip_mem() API (that requires both kaddr and pfn are returned).

Some architectures will not do the right thing in their virt_to_page() for use
by XIP (to translate from the kernel virtual address returned by
direct_access(), to a user mappable pfn in XIP's page fault handler.

However, we can't switch it to just return the pfn and not the kaddr, because
we have no good way to get a kva from a pfn, and XIP requires the kva for its
read(2) and write(2) handlers. So we have to return both.

Signed-off-by: Jared Hulbert <jaredeh@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-mm@kvack.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 26defe34 21-Apr-2008 Petr Tesarik <ptesarik@suse.cz>

fix brd allocation flags

While looking at the implementation of the Ram backed block device
driver, I stumbled across a write-only local variable, which makes
little sense, so I assume it should actually work like this:

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75acb9cd 08-Feb-2008 Nick Piggin <npiggin@suse.de>

rd: support XIP

Support direct_access XIP method with brd.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9db5579b 08-Feb-2008 Nick Piggin <npiggin@suse.de>

rewrite rd

This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea about
vm/vfs stuff. It's backing store is similar to the buffer cache (a simple
radix-tree of pages), but it doesn't know anything about page cache (the pages
in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it), so
under memory intensive situations, footprint should effectively be the same --
maybe even a slight advantage to the new driver because it can also reclaim
buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

text data bss dec hex filename
2837 849 384 4070 fe6 drivers/block/rd.o
3528 371 12 3911 f47 drivers/block/brd.o

Text is larger, but data and bss are smaller, making total size smaller.

A few other nice things about it:
- Similar structure and layout to the new loop device handlinag.
- Dynamic ramdisk creation.
- Runtime flexible buffer head size (because it is no longer part of the
ramdisk code).
- Boot / load time flexible ramdisk size, which could easily be extended
to a per-ramdisk runtime changeable size (eg. with an ioctl).
- Can use highmem for the backing store.

[akpm@linux-foundation.org: fix build]
[byron.bbradley@gmail.com: make rd_size non-static]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Byron Bradley <byron.bbradley@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>