Cross Reference: /linux-master/drivers/block/loop.c

History log of /linux-master/drivers/block/loop.c
Revision	Date	Author	Comments
# 473516b3	13-Feb-2024	Christoph Hellwig <hch@lst.de>	loop: use the atomic queue limits update API Pass the default limits to blk_mq_alloc_disk and then use the queue_limits_{start,commit}_update API to change the limits in an atomic way on existing loop gendisks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 02aed4a1	13-Feb-2024	Christoph Hellwig <hch@lst.de>	loop: pass queue_limits to blk_mq_alloc_disk Pass the max_hw_sector limit loop sets at initialization time directly to blk_mq_alloc_disk instead of updating it right after the allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 65bdd16f	13-Feb-2024	Christoph Hellwig <hch@lst.de>	loop: cleanup loop_config_discard Initialize the local variables for the discard max sectors and granularity to zero as a sensible default, and then merge the calls assigning them to the queue limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 27e32cd2	13-Feb-2024	Christoph Hellwig <hch@lst.de>	block: pass a queue_limits argument to blk_mq_alloc_disk Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# baa7d536	17-Jan-2024	Christoph Hellwig <hch@lst.de>	loop: fix the the direct I/O support check when used on top of block devices __loop_update_dio only checks the alignment requirement for block backed file systems, but misses them for the case where the loop device is created directly on top of another block device. Due to this creating a loop device with default option plus the direct I/O flag on a > 512 byte sector size file system will lead to incorrect I/O being submitted to the lower block device and a lot of error from the lock layer. This can be seen with xfstests generic/563. Fix the code in __loop_update_dio by factoring the alignment check into a helper, and calling that also for the struct block_device of a block device inode. Also remove the TODO comment talking about dynamically switching between buffered and direct I/O, which is a would be a recipe for horrible performance and occasional data loss. Fixes: 2e5ab5f379f9 ("block: loop: prepare for supporing direct IO") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240117175901.871796-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3d77976c	27-Dec-2023	Christoph Hellwig <hch@lst.de>	loop: don't abuse BLK_DEF_MAX_SECTORS BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for the max_sectors limits. Don't use it to initialize max_hw_setors, which is a hardware / driver capacility. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 34c7db44b	27-Dec-2023	Christoph Hellwig <hch@lst.de>	loop: don't update discard limits from loop_set_status loop_set_status doesn't change anything relevant to the discard and write_zeroes setting, so don't bother calling loop_config_discard. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227082020.249427-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 269aed70	22-Nov-2023	Amir Goldstein <amir73il@gmail.com>	fs: move file_start_write() into vfs_iter_write() All the callers of vfs_iter_write() call file_start_write() just before calling vfs_iter_write() except for target_core_file's fd_do_rw(). Move file_start_write() from the callers into vfs_iter_write(). fd_do_rw() calls vfs_iter_write() with a non-regular file, so file_start_write() is a no-op. This is needed for fanotify "pre content" events. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-11-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
# ab6860f6	10-Aug-2023	Christoph Hellwig <hch@lst.de>	block: simplify the disk_force_media_change interface Hard code the events to DISK_EVENT_MEDIA_CHANGE as that is the only useful use case, and drop the superfluous return value. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-9-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
# bb5faa99	20-Jul-2023	Mauricio Faria de Oliveira <mfo@canonical.com>	loop: do not enforce max_loop hard limit by (new) default Problem: The max_loop parameter is used for 2 different purposes: 1) initial number of loop devices to pre-create on init 2) maximum number of loop devices to add on access/open() Historically, its default value (zero) caused 1) to create non-zero number of devices (CONFIG_BLK_DEV_LOOP_MIN_COUNT), and no hard limit on 2) to add devices with autoloading. However, the default value changed in commit 85c50197716c ("loop: Fix the max_loop commandline argument treatment when it is set to 0") to CONFIG_BLK_DEV_LOOP_MIN_COUNT, for max_loop=0 not to pre-create devices. That does improve 1), but unfortunately it breaks 2), as the default behavior changed from no-limit to hard-limit. Example: For example, this userspace code broke for N >= CONFIG, if the user relied on the default value 0 for max_loop: mknod("/dev/loopN"); open("/dev/loopN"); // now fails with ENXIO Though affected users may "fix" it with (loop.)max_loop=0, this means to require a kernel parameter change on stable kernel update (that commit Fixes: an old commit in stable). Solution: The original semantics for the default value in 2) can be applied if the parameter is not set (ie, default behavior). This still keeps the intended function in 1) and 2) if set, and that commit's intended improvement in 1) if max_loop=0. Before 85c50197716c: - default: 1) CONFIG devices 2) no limit - max_loop=0: 1) CONFIG devices 2) no limit - max_loop=X: 1) X devices 2) X limit After 85c50197716c: - default: 1) CONFIG devices 2) CONFIG limit () - max_loop=0: 1) 0 devices () 2) no limit - max_loop=X: 1) X devices 2) X limit This commit: - default: 1) CONFIG devices 2) no limit () - max_loop=0: 1) 0 devices 2) no limit - max_loop=X: 1) X devices 2) X limit Future: The issue/regression from that commit only affects code under the CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard, thus the fix too is contained under it. Once that deprecated functionality/code is removed, the purpose 2) of max_loop (hard limit) is no longer in use, so the module parameter description can be changed then. Tests: Linux 6.4-rc7 CONFIG_BLK_DEV_LOOP_MIN_COUNT=8 CONFIG_BLOCK_LEGACY_AUTOLOAD=y - default (original) # ls -1 /dev/loop /dev/loop-control /dev/loop0 ... /dev/loop7 # ./test-loop open: /dev/loop8: No such device or address - default (patched) # ls -1 /dev/loop* /dev/loop-control /dev/loop0 ... /dev/loop7 # ./test-loop # - max_loop=0 (original & patched): # ls -1 /dev/loop* /dev/loop-control # ./test-loop # - max_loop=8 (original & patched): # ls -1 /dev/loop* /dev/loop-control /dev/loop0 ... /dev/loop7 # ./test-loop open: /dev/loop8: No such device or address - max_loop=0 (patched; CONFIG_BLOCK_LEGACY_AUTOLOAD is not set) # ls -1 /dev/loop* /dev/loop-control # ./test-loop open: /dev/loop8: No such device or address Fixes: 85c50197716c ("loop: Fix the max_loop commandline argument treatment when it is set to 0") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230720143033.841001-3-mfo@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 23881aec	20-Jul-2023	Mauricio Faria de Oliveira <mfo@canonical.com>	loop: deprecate autoloading callback loop_probe() The 'probe' callback in __register_blkdev() is only used under the CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard. The loop_probe() function is only used for that callback, so guard it too, accordingly. See commit fbdee71bb5d8 ("block: deprecate autoloading based on dev_t"). Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230720143033.841001-2-mfo@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 05bdb996	08-Jun-2023	Christoph Hellwig <hch@lst.de>	block: replace fmode_t with a block-specific type for block open flags The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# ae220766	08-Jun-2023	Christoph Hellwig <hch@lst.de>	block: remove the unused mode argument to ->release The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0718afd4	01-Jun-2023	Christoph Hellwig <hch@lst.de>	block: introduce holder ops Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# bb430b69	20-Mar-2023	Alyssa Ross <hi@alyssa.is>	loop: LOOP_CONFIGURE: send uevents for partitions LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall. When using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for each partition found on the loop device after the second ioctl(), but when using LOOP_CONFIGURE, no such uevent was being sent. In the old setup, uevents are disabled for LOOP_SET_FD, but not for LOOP_SET_STATUS64. This makes sense, as it prevents uevents being sent for a partially configured device during LOOP_SET_FD - they're only sent at the end of LOOP_SET_STATUS64. But for LOOP_CONFIGURE, uevents were disabled for the entire operation, so that final notification was never issued. To fix this, reduce the critical section to exclude the loop_reread_partitions() call, which causes the uevents to be issued, to after uevents are re-enabled, matching the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination. I noticed this because Busybox's losetup program recently changed from using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke my setup, for which I want a notification from the kernel any time a new partition becomes available. Signed-off-by: Alyssa Ross <hi@alyssa.is> [hch: reduced the critical section] Signed-off-by: Christoph Hellwig <hch@lst.de> Fixes: 3448914e8cc5 ("loop: Add LOOP_CONFIGURE ioctl") Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 9b0cb770	14-Mar-2023	Bart Van Assche <bvanassche@acm.org>	loop: Fix use-after-free issues do_req_filebacked() calls blk_mq_complete_request() synchronously or asynchronously when using asynchronous I/O unless memory allocation fails. Hence, modify loop_handle_cmd() such that it does not dereference 'cmd' nor 'rq' after do_req_filebacked() finished unless we are sure that the request has not yet been completed. This patch fixes the following kernel crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000054 Call trace: css_put.42938+0x1c/0x1ac loop_process_work+0xc8c/0xfd4 loop_rootcg_workfn+0x24/0x34 process_one_work+0x244/0x558 worker_thread+0x400/0x8fc kthread+0x16c/0x1e0 ret_from_fork+0x10/0x20 Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Fixes: c74d40e8b5e2 ("loop: charge i/o to mem and blk cg") Fixes: bc07c10a3603 ("block: loop: support DIO & AIO") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230314182155.80625-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 9f6ad5d5	21-Feb-2023	Zhong Jinghua <zhongjinghua@huawei.com>	loop: loop_set_status_from_info() check before assignment In loop_set_status_from_info(), lo->lo_offset and lo->lo_sizelimit should be checked before reassignment, because if an overflow error occurs, the original correct value will be changed to the wrong value, and it will not be changed back. More, the original patch did not solve the problem, the value was set and ioctl returned an error, but the subsequent io used the value in the loop driver, which still caused an alarm: loop_handle_cmd do_req_filebacked loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset; lo_rw_aio cmd->iocb.ki_pos = pos Fixes: c490a0b5a4f3 ("loop: Check for overflow while configuring loop") Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230221095027.3656193-1-zhongjinghua@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# e152a05f	30-Jan-2023	Bart Van Assche <bvanassche@acm.org>	loop: Improve the hw_queue_depth kernel module parameter implementation Make the following minor changes which were reported by colleagues while reviewing this code: - Remove the parentheses from around the LOOP_DEFAULT_HW_Q_DEPTH definition since these are superfluous. - Accept other number formats than decimal, e.g. hexadecimal. - Do not set hw_queue_depth to an out-of-range value, even if that value won't be used. - Use the LOOP_DEFAULT_HW_Q_DEPTH macro in the kernel module parameter description to prevent that the description gets out of sync. This patch has been tested as follows: # modprobe -r loop # modprobe loop hw_queue_depth=-1 modprobe: ERROR: could not insert 'loop': Invalid argument # modprobe loop hw_queue_depth=0 modprobe: ERROR: could not insert 'loop': Invalid argument # modprobe loop hw_queue_depth=1; cat /sys/module/loop/parameters/hw_queue_depth 1 # modprobe -r loop; modprobe loop; cat /sys/module/loop/parameters/hw_queue_depth hw_queue_depth=0x10 16 # modprobe -r loop; modprobe loop; cat /sys/module/loop/parameters/hw_queue_depth hw_queue_depth=128 128 # modprobe -r loop; modprobe loop hw_queue_depth=129; cat /sys/module/loop/parameters/hw_queue_depth 129 # modprobe -r loop; modprobe loop hw_queue_depth=$((1<<32)) modprobe: ERROR: could not insert 'loop': Numerical result out of range See also commit ef44c50837ab ("loop: allow user to set the queue depth"). Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230130211347.832110-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 292a089d	20-Dec-2022	Steven Rostedt (Google) <rostedt@goodmis.org>	treewide: Convert del_timer() to timer_shutdown() Due to several bugs caused by timers being re-armed after they are shutdown and just before they are freed, a new state of timers was added called "shutdown". After a timer is set to this state, then it can no longer be re-armed. The following script was run to find all the trivial locations where del_timer() or del_timer_sync() is called in the same function that the object holding the timer is freed. It also ignores any locations where the timer->function is modified between the del_timer*() and the free(), as that is not considered a "trivial" case. This was created by using a coccinelle script and the following commands: $ cat timer.cocci @@ expression ptr, slab; identifier timer, rfield; @@ ( - del_timer(&ptr->timer); + timer_shutdown(&ptr->timer); \| - del_timer_sync(&ptr->timer); + timer_shutdown_sync(&ptr->timer); ) ... when strict when != ptr->timer ( kfree_rcu(ptr, rfield); \| kmem_cache_free(slab, ptr); \| kfree(ptr); ) $ spatch timer.cocci . > /tmp/t.patch $ patch -p1 < /tmp/t.patch Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ] Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ] Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 85c50197	08-Dec-2022	Isaac J. Manjarres <isaacmanjarres@google.com>	loop: Fix the max_loop commandline argument treatment when it is set to 0 Currently, the max_loop commandline argument can be used to specify how many loop block devices are created at init time. If it is not specified on the commandline, CONFIG_BLK_DEV_LOOP_MIN_COUNT loop block devices will be created. The max_loop commandline argument can be used to override the value of CONFIG_BLK_DEV_LOOP_MIN_COUNT. However, when max_loop is set to 0 through the commandline, the current logic treats it as if it had not been set, and creates CONFIG_BLK_DEV_LOOP_MIN_COUNT devices anyway. Fix this by starting max_loop off as set to CONFIG_BLK_DEV_LOOP_MIN_COUNT. This preserves the intended behavior of creating CONFIG_BLK_DEV_LOOP_MIN_COUNT loop block devices if the max_loop commandline parameter is not specified, and allowing max_loop to be respected for all values, including 0. This allows environments that can create all of their required loop block devices on demand to not have to unnecessarily preallocate loop block devices. Fixes: 732850827450 ("remove artificial software max_loop limit") Cc: stable@vger.kernel.org Cc: Ken Chen <kenchen@google.com> Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Link: https://lore.kernel.org/r/20221208212902.765781-1-isaacmanjarres@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# de4eda9d	15-Sep-2022	Al Viro <viro@zeniv.linux.org.uk>	use less confusing names for iov_iter direction initializers READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# c490a0b5	23-Aug-2022	Siddh Raman Pant <code@siddh.me>	loop: Check for overflow while configuring loop The userspace can configure a loop using an ioctl call, wherein a configuration of type loop_config is passed (see lo_ioctl()'s case on line 1550 of drivers/block/loop.c). This proceeds to call loop_configure() which in turn calls loop_set_status_from_info() (see line 1050 of loop.c), passing &config->info which is of type loop_info64*. This function then sets the appropriate values, like the offset. loop_device has lo_offset of type loff_t (see line 52 of loop.c), which is typdef-chained to long long, whereas loop_info64 has lo_offset of type __u64 (see line 56 of include/uapi/linux/loop.h). The function directly copies offset from info to the device as follows (See line 980 of loop.c): lo->lo_offset = info->lo_offset; This results in an overflow, which triggers a warning in iomap_iter() due to a call to iomap_iter_done() which has: WARN_ON_ONCE(iter->iomap.offset > iter->pos); Thus, check for negative value during loop_set_status_from_info(). Bug report: https://syzkaller.appspot.com/bug?id=c620fe14aac810396d3c3edc9ad73848bf69a29e Reported-and-tested-by: syzbot+a8e049cd3abd342936b6@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Siddh Raman Pant <code@siddh.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220823160810.181275-1-code@siddh.me Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8b9ab626	19-Jun-2022	Christoph Hellwig <hch@lst.de>	block: remove blk_cleanup_disk blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6f8191fd	19-Jun-2022	Christoph Hellwig <hch@lst.de>	block: simplify disk shutdown Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for all disks that do not have separately allocated queues, and thus remove the need to call blk_cleanup_queue for them. Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that this function is intended only for separately allocated blk-mq queues. This saves an extra queue freeze for devices without a separately allocated queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b9684a71	26-May-2022	Christoph Hellwig <hch@lst.de>	block, loop: support partitions without scanning Historically we did distinguish between a flag that surpressed partition scanning, and a combinations of the minors variable and another flag if any partitions were supported. This was generally confusing and doesn't make much sense, but some corner case uses of the loop driver actually do want to support manually added partitions on a device that does not actively scan for partitions. To make things worsee the loop driver also wants to dynamically toggle the scanning for partitions on a live gendisk, which makes the disk->flags updates non-atomic. Introduce a new GD_SUPPRESS_PART_SCAN bit in disk->state that disables just scanning for partitions, and toggle that instead of GENHD_FL_NO_PART in the loop driver. Fixes: 1ebe2e5f9d68 ("block: remove GENHD_FL_EXT_DEVT") Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220527055806.1972352-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a2ad63da	09-May-2022	NeilBrown <neilb@suse.de>	VFS: add FMODE_CAN_ODIRECT file flag Currently various places test if direct IO is possible on a file by checking for the existence of the direct_IO address space operation. This is a poor choice, as the direct_IO operation may not be used - it is only used if the generic_file__iter functions are called for direct IO and some filesystems - particularly NFS - don't do this. Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the various places to check this (avoiding pointer dereferences). do_dentry_open() will set this flag if ->direct_IO is present, so filesystems do not need to be changed. NFS is* changed, to set the flag explicitly and discard the direct_IO entry in the address_space_operations for files. Other filesystems which currently use noop_direct_IO could usefully be changed to set this flag instead. Link: https://lkml.kernel.org/r/164859778128.29473.15189737957277399416.stgit@noble.brown Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# eb04bb15	19-Apr-2022	Christoph Hellwig <hch@lst.de>	loop: remove most the top-of-file boilerplate comment Remove the irrelevant changelogs and todo notes and just leave the SPDX marker and the copyright notice. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220419063303.583106-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f21e6e18	19-Apr-2022	Christoph Hellwig <hch@lst.de>	loop: add a SPDX header The copyright statement says: "Redistribution of this file is permitted under the GNU General Public License." and was added by Ted in 1993, at which point GPLv2 only was the default Linux license. Replace it with the usual GPLv2 only SPDX header. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220419063303.583106-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 754d9679	19-Apr-2022	Christoph Hellwig <hch@lst.de>	loop: remove loop.h Merge loop.h into loop.c as all the content is only used there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220419063303.583106-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4418bfd8	17-Apr-2022	Christoph Hellwig <hch@lst.de>	loop: remove a spurious clear of discard_alignment The loop driver never sets a discard_alignment, so it also doens't need to clear it to zero. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# bbb1ebe7	19-Apr-2022	Christoph Hellwig <hch@lst.de>	blk-cgroup: replace bio_blkcg with bio_blkcg_css All callers of bio_blkcg actually want the CSS, so replace it with an interface that does return the CSS. This now allows to move struct blkcg_gq to block/blk-cgroup.h instead of exposing it in a public header. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220420042723.1010598-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d292dc80	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: don't destroy lo->workqueue in __loop_clr_fd There is no need to destroy the workqueue when clearing unbinding a loop device from a backing file. Not doing so on the other hand avoid creating a complex lock dependency chain involving the global system_transition_mutex. Based on a patch from Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>. Reported-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20220330052917.2566582-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a0e286b6	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: remove lo_refcount and avoid lo_mutex in ->open / ->release lo_refcount counts how many openers a loop device has, but that count is already provided by the block layer in the bd_openers field of the whole-disk block_device. Remove lo_refcount and allow opens to succeed even on devices beeing deleted - now that ->free_disk is implemented we can handle that race gracefull and all I/O on it will just fail. Similarly there is a small race window now where loop_control_remove does not synchronize the delete vs the remove due do bd_openers not being under lo_mutex protection, but we can handle that just as gracefully. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220330052917.2566582-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 158eaeba	29-Mar-2022	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	loop: avoid loop_validate_mutex/lo_mutex in ->release Since ->release is called with disk->open_mutex held, and __loop_clr_fd() from lo_release() is called via ->release when disk_openers() == 0, we are guaranteed that "struct file" which will be passed to loop_validate_file() via fget() cannot be the loop device __loop_clr_fd(lo, true) will clear. Thus, there is no need to hold loop_validate_mutex from __loop_clr_fd() if release == true. When I made commit 3ce6e1f662a91097 ("loop: reintroduce global lock for safe loop_validate_file() traversal"), I wrote "It is acceptable for loop_validate_file() to succeed, for actual clear operation has not started yet.". But now I came to feel why it is acceptable to succeed. It seems that the loop driver was added in Linux 1.3.68, and if (lo->lo_refcnt > 1) return -EBUSY; check in loop_clr_fd() was there from the beginning. The intent of this check was unclear. But now I think that current disk_openers(lo->lo_disk) > 1 form is there for three reasons. (1) Avoid I/O errors when some process which opens and reads from this loop device in response to uevent notification (e.g. systemd-udevd), as described in commit a1ecac3b0656a682 ("loop: Make explicit loop device destruction lazy"). This opener is short-lived because it is likely that the file descriptor used by that process is closed soon. (2) Avoid I/O errors caused by underlying layer of stacked loop devices (i.e. ioctl(some_loop_fd, LOOP_SET_FD, other_loop_fd)) being suddenly disappeared. This opener is long-lived because this reference is associated with not a file descriptor but lo->lo_backing_file. (3) Avoid I/O errors caused by underlying layer of mounted loop device (i.e. mount(some_loop_device, some_mount_point)) being suddenly disappeared. This opener is long-lived because this reference is associated with not a file descriptor but mount. While race in (1) might be acceptable, (2) and (3) should be checked racelessly. That is, make sure that __loop_clr_fd() will not run if loop_validate_file() succeeds, by doing refcount check with global lock held when explicit loop device destruction is requested. As a result of no longer waiting for lo->lo_mutex after setting Lo_rundown, we can remove pointless BUG_ON(lo->lo_state != Lo_rundown) check. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220330052917.2566582-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 498ef5c7	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: suppress uevents while reconfiguring the device Currently, udev change event is generated for a loop device before the device is ready for IO. Due to serialization on lo->lo_mutex in lo_open() this does not matter because anybody is able to open the device and do IO only after the configuration is finished. However this synchronization in lo_open() is going away so make sure userspace reacting to the change event will see the new device state by generating the event only when the device is setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220330052917.2566582-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d2c7f56f	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: implement ->free_disk Ensure that the lo_device which is stored in the gendisk private data is valid until the gendisk is freed. Currently the loop driver uses a lot of effort to make sure a device is not freed when it is still in use, but to to fix a potential deadlock this will be relaxed a bit soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220330052917.2566582-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1fe0b1ac	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: only freeze the queue in __loop_clr_fd when needed ->release is only called after all outstanding I/O has completed, so only freeze the queue when clearing the backing file of a live loop device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220330052917.2566582-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 46dc9674	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: don't freeze the queue in lo_release By the time the final ->release is called there can't be outstanding I/O. For non-final ->release there is no need for driver action at all. Thus remove the useless queue freeze. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220330052917.2566582-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 98ded54a	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: remove the racy bd_inode->i_mapping->nrpages asserts Nothing prevents a file system or userspace opener of the block device from redirtying the page right afte sync_blockdev returned. Fortunately data in the page cache during a block device change is mostly harmless anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220330052917.2566582-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b15ed546	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: initialize the worker tracking fields once There is no need to reinitialize idle_worker_list, worker_tree and timer every time a loop device is configured. Just initialize them once at allocation time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220330052917.2566582-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 2cf429b5	29-Mar-2022	Christoph Hellwig <hch@lst.de>	loop: de-duplicate the idle worker freeing code Use a common helper for both timer based and uncoditional freeing of idle workers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220330052917.2566582-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7b47ef52	14-Apr-2022	Christoph Hellwig <hch@lst.de>	block: add a bdev_discard_granularity helper Abstract away implementation details from file systems by providing a block_device based helper to retrieve the discard granularity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 70200574	14-Apr-2022	Christoph Hellwig <hch@lst.de>	block: remove QUEUE_FLAG_DISCARD Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 10f0d2a5	14-Apr-2022	Christoph Hellwig <hch@lst.de>	block: add a bdev_nonrot helper Add a helper to check the nonrot flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f941c51e	29-Mar-2022	Carlos Llamas <cmllamas@google.com>	loop: fix ioctl calls using compat_loop_info Support for cryptoloop was deleted in commit 47e9624616c8 ("block: remove support for cryptoloop and the xor transfer"), making the usage of loop_info->lo_encrypt_type obsolete. However, this member was also removed from the compat_loop_info definition and this breaks userspace ioctl calls for 32-bit binaries and CONFIG_COMPAT=y. This patch restores the compat_loop_info->lo_encrypt_type member and marks it obsolete as well as in the uapi header definitions. Fixes: 47e9624616c8 ("block: remove support for cryptoloop and the xor transfer") Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220329201815.1347500-1-cmllamas@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# ce8d7861	15-Mar-2022	Christoph Hellwig <hch@lst.de>	nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH Start warning about exposing a namespace as multiple block devices, and set a fixed deprecation release. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
# ef44c508	15-Feb-2022	Chaitanya Kulkarni <kch@nvidia.com>	loop: allow user to set the queue depth Instead of hardcoding queue depth allow user to set the hw queue depth using module parameter. Set default value to 128 to retain the existing behavior. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-5-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 9c64e38c	15-Feb-2022	Chaitanya Kulkarni <kch@nvidia.com>	loop: remove extra variable in lo_req_flush The local variable file is used to pass it to the vfs_fsync(). We can get away with using lo->lo_backing_file instead of storing in a local variable which is not used anywhere else. No functional change in this patch. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-4-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0aab29b8	15-Feb-2022	Chaitanya Kulkarni <kch@nvidia.com>	loop: remove extra variable in lo_fallocate() The local variable q is used to pass it to the blk_queue_discard(). We can get away with using lo->lo_queue instead of storing in a local variable which is not used anywhere else. No functional change in this patch. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-3-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b27824d3	15-Feb-2022	Chaitanya Kulkarni <kch@nvidia.com>	loop: use sysfs_emit() in the sysfs xxx show() sprintf does not know the PAGE_SIZE maximum of the temporary buffer used for outputting sysfs content and it's possible to overrun the PAGE_SIZE buffer length. Use a generic sysfs_emit function that knows the size of the temporary buffer and ensures that no overrun is done for offset attribute in loop_attr_[offset\|sizelimit\|autoclear\|partscan\|dio]_show() callbacks. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d9a74051	08-Feb-2022	Colin Ian King <colin.king@intel.com>	loop: clean up grammar in warning message The phrase "has still" should be "still has" to clean up the grammar. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20220208114656.61629-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 06582bc8	25-Jan-2022	Ming Lei <ming.lei@redhat.com>	block: loop:use kstatfs.f_bsize of backing file to set discard granularity If backing file's filesystem has implemented ->fallocate(), we think the loop device can support discard, then pass sb->s_blocksize as discard_granularity. However, some underlying FS, such as overlayfs, doesn't set sb->s_blocksize, and causes discard_granularity to be set as zero, then the warning in __blkdev_issue_discard() is triggered. Christoph suggested to pass kstatfs.f_bsize as discard granularity, and this way is fine because kstatfs.f_bsize means 'Optimal transfer block size', which still matches with definition of discard granularity. So fix the issue by setting discard_granularity as kstatfs.f_bsize if it is available, otherwise claims discard isn't supported. Cc: Christoph Hellwig <hch@lst.de> Cc: Vivek Goyal <vgoyal@redhat.com> Reported-by: Pei Zhang <pezhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220126035830.296465-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# bf23747e	11-Feb-2022	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	loop: revert "make autoclear operation asynchronous" The kernel test robot is reporting that xfstest which does umount ext2 on xfs umount xfs sequence started failing, for commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") removed a guarantee that fput() of backing file is processed before lo_release() from close() returns to user mode. And syzbot is reporting that deferring destroy_workqueue() from __loop_clr_fd() to a WQ context did not help [1]. Revert that commit. Link: https://syzkaller.appspot.com/bug?extid=831661966588c802aae9 [1] Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: syzbot <syzbot+831661966588c802aae9@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/20220211071554.3424-1-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 413ec805	12-Jan-2022	Colin Ian King <colin.king@intel.com>	loop: remove redundant initialization of pointer node The pointer node is being initialized with a value that is never read, it is being re-assigned the same value a little futher on. Remove the redundant initialization. Cleans up clang scan warning: drivers/block/loop.c:823:19: warning: Value stored to 'node' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20220113001432.1331871-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 322c4293	13-Dec-2021	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>	loop: make autoclear operation asynchronous syzbot is reporting circular locking problem at __loop_clr_fd() [1], for commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker") is calling destroy_workqueue() with disk->open_mutex held. This circular dependency cannot be broken unless we call __loop_clr_fd() without holding disk->open_mutex. Therefore, defer __loop_clr_fd() from lo_release() to a WQ context. Link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372 [1] Reported-by: syzbot <syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: Jan Kara <jack@suse.cz> Tested-by: syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/1ed7df28-ebd6-71fb-70e5-1c2972e05ddb@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6050fa4c	24-Nov-2021	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>	loop: don't hold lo_mutex during __loop_clr_fd() syzbot is reporting circular locking problem at __loop_clr_fd() [1], for commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker") is calling destroy_workqueue() with lo->lo_mutex held. Since all functions where lo->lo_state matters are already checking lo->lo_state with lo->lo_mutex held (in order to avoid racing with e.g. ioctl(LOOP_CTL_REMOVE)), and __loop_clr_fd() can be called from either ioctl(LOOP_CLR_FD) xor close(), lo->lo_state == Lo_rundown is considered as an exclusive lock for __loop_clr_fd(). Therefore, hold lo->lo_mutex inside __loop_clr_fd() only when asserting/updating lo->lo_state. Since ioctl(LOOP_CLR_FD) depends on lo->lo_state == Lo_bound, a valid lo->lo_backing_file must have been assigned by ioctl(LOOP_SET_FD) or ioctl(LOOP_CONFIGURE). Thus, we can remove lo->lo_backing_file test, and convert __loop_clr_fd() into a void function. Link: https://syzkaller.appspot.com/bug?extid=63614029dfb79abd4383 [1] Reported-by: syzbot <syzbot+63614029dfb79abd4383@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/8ebe3b2e-8975-7f26-0620-7144a3b8b8cd@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1ebe2e5f	22-Nov-2021	Christoph Hellwig <hch@lst.de>	block: remove GENHD_FL_EXT_DEVT All modern drivers can support extra partitions using the extended dev_t. In fact except for the ioctl method drivers never even see partitions in normal operation. So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all block devices that do support partitions, and require those that do not support partitions to explicit disallow them using GENHD_FL_NO_PART. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 46e7eac6	22-Nov-2021	Christoph Hellwig <hch@lst.de>	block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART The GENHD_FL_NO_PART_SCAN controls more than just partitions canning, so rename it to GENHD_FL_NO_PART. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20211122130625.1136848-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# e3f9387a	29-Nov-2021	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>	loop: Use pr_warn_once() for loop_control_remove() warning kernel test robot reported that RCU stall via printk() flooding is possible [1] when stress testing. Link: https://lkml.kernel.org/r/20211129073709.GA18483@xsang-OptiPlex-9020 [1] Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 18c6c968	04-Nov-2021	luo penghao <luo.penghao@zte.com.cn>	loop: Remove duplicate assignments The assignment and operation there will be overwritten later, so it should be deleted. The clang_analyzer complains as follows: drivers/block/loop.c:2330:2 warning: Value stored to 'err' is never read change in v2: Repair the sending email box Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Link: https://lore.kernel.org/r/20211104064546.3074-1-luo.penghao@zte.com.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
# af3c570f	26-Oct-2021	Xie Yongji <xieyongji@bytedance.com>	loop: Use blk_validate_block_size() to validate block size Remove loop_validate_block_size() and use the block layer helper to validate block size. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Link: https://lore.kernel.org/r/20211026144015.188-4-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6b19b766	21-Oct-2021	Jens Axboe <axboe@kernel.dk>	fs: get rid of the res2 iocb->ki_complete argument The second argument was only used by the USB gadget code, yet everyone pays the overhead of passing a zero to be passed into aio, where it ends up being part of the aio res2 value. Now that everybody is passing in zero, kill off the extra argument. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 47e96246	19-Oct-2021	Christoph Hellwig <hch@lst.de>	block: remove support for cryptoloop and the xor transfer Support for cyrptoloop has been officially marked broken and deprecated in favor of dm-crypt (which supports the same broken algorithms if needed) in Linux 2.6.4 (released in March 2004), and support for it has been entirely removed from losetup in util-linux 2.23 (released in April 2013). The XOR transfer has never been more than a toy to demonstrate the transfer in the bad old times of crypto export restrictions. Remove them as they have some nasty interactions with loop device life times due to the iteration over all loop devices in loop_unregister_transfer. Suggested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211019075639.2333969-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 19f553db	22-Sep-2021	Xie Yongji <xieyongji@bytedance.com>	loop: Remove the unnecessary bdev checks and unused bdev variable The lo->lo_device can't be null if the lo->lo_backing_file is set. So let's remove the unnecessary bdev checks and the entire bdev variable in __loop_clr_fd() since the lo->lo_backing_file is already checked before. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210922123711.187-4-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# e515be8f	22-Sep-2021	Xie Yongji <xieyongji@bytedance.com>	loop: Use invalidate_disk() helper to invalidate gendisk Use invalidate_disk() helper to simplify the code for gendisk invalidation. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210922123711.187-3-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 905705f0	27-Sep-2021	Luis Chamberlain <mcgrof@kernel.org>	loop: add error handling support for add_disk() We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1c500ad7	01-Sep-2021	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>	loop: reduce the loop_ctl_mutex scope syzbot is reporting circular locking problem at __loop_clr_fd() [1], for commit a160c6159d4a0cf8 ("block: add an optional probe callback to major_names") is calling the module's probe function with major_names_lock held. Fortunately, since commit 990e78116d38059c ("block: loop: fix deadlock between open and remove") stopped holding loop_ctl_mutex in lo_open(), current role of loop_ctl_mutex is to serialize access to loop_index_idr and loop_add()/loop_remove(); in other words, management of id for IDR. To avoid holding loop_ctl_mutex during whole add/remove operation, use a bool flag to indicate whether the loop device is ready for use. loop_unregister_transfer() which is called from cleanup_cryptoloop() currently has possibility of use-after-free problem due to lack of serialization between kfree() from loop_remove() from loop_control_remove() and mutex_lock() from unregister_transfer_cb(). But since lo->lo_encryption should be already NULL when this function is called due to module unload, and commit 222013f9ac30b9ce ("cryptoloop: add a deprecation warning") indicates that we will remove this function shortly, this patch updates this function to emit warning instead of checking lo->lo_encryption. Holding loop_ctl_mutex in loop_exit() is pointless, for all users must close /dev/loop-control and /dev/loop$num (in order to drop module's refcount to 0) before loop_exit() starts, and nobody can open /dev/loop-control or /dev/loop$num afterwards. Link: https://syzkaller.appspot.com/bug?id=7bb10e8b62f83e4d445cdf4c13d69e407e629558 [1] Reported-by: syzbot <syzbot+f61766d5763f9e7a118f@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/adb1e792-fc0e-ee81-7ea0-0906fc36419d@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 2112f5c1	05-Aug-2021	Bart Van Assche <bvanassche@acm.org>	loop: Select I/O scheduler 'none' from inside add_disk() We noticed that the user interface of Android devices becomes very slow under memory pressure. This is because Android uses the zram driver on top of the loop driver for swapping, because under memory pressure the swap code alternates reads and writes quickly, because mq-deadline is the default scheduler for loop devices and because mq-deadline delays writes by five seconds for such a workload with default settings. Fix this by making the kernel select I/O scheduler 'none' from inside add_disk() for loop devices. This default can be overridden at any time from user space, e.g. via a udev rule. This approach has an advantage compared to changing the I/O scheduler from userspace from 'mq-deadline' into 'none', namely that synchronize_rcu() does not get called. This patch changes the default I/O scheduler for loop devices from 'mq-deadline' into 'none'. Additionally, this patch reduces the Android boot time on my test setup with 0.5 seconds compared to configuring the loop I/O scheduler from user space. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Martijn Coenen <maco@android.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210805174200.3250718-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 9f65c489	12-Jul-2021	Matteo Croce <mcroce@microsoft.com>	loop: raise media_change event Make the loop device raise a DISK_MEDIA_CHANGE event on attach or detach. # udevadm monitor -up \|grep -e DISK_MEDIA_CHANGE -e DEVNAME & # losetup -f zero [ 7.454235] loop0: detected capacity change from 0 to 16384 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop0 DEVNAME=/dev/loop0 DEVNAME=/dev/loop0 # losetup -f zero [ 10.205245] loop1: detected capacity change from 0 to 16384 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop1 # losetup -f zero2 [ 13.532368] loop2: detected capacity change from 0 to 40960 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop2 DEVNAME=/dev/loop2 # losetup -D DEVNAME=/dev/loop1 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop2 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop2 DEVNAME=/dev/loop0 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop0 Signed-off-by: Matteo Croce <mcroce@microsoft.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Luca Boccassi <bluca@debian.org> Link: https://lore.kernel.org/r/20210712230530.29323-7-mcroce@linux.microsoft.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4b273122	22-Jul-2021	Christoph Hellwig <hch@lst.de>	loop: don't grab a reference to the block device The whole device block device won't be removed while the disk is still alive, so don't bother to grab a reference to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Ming Lei <ming.lei@rehat.com> Reviewed-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Link: https://lore.kernel.org/r/20210722075402.983367-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3ce6e1f6	06-Jul-2021	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>	loop: reintroduce global lock for safe loop_validate_file() traversal Commit 6cc8e7430801fa23 ("loop: scale loop device by introducing per device lock") re-opened a race window for NULL pointer dereference at loop_validate_file() where commit 310ca162d779efee ("block/loop: Use global lock for ioctl() operation.") has closed. Although we need to guarantee that other loop devices will not change during traversal, we can't take remote "struct loop_device"->lo_mutex inside loop_validate_file() in order to avoid AB-BA deadlock. Therefore, introduce a global lock dedicated for loop_validate_file() which is conditionally taken before local "struct loop_device"->lo_mutex is taken. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 6cc8e7430801fa23 ("loop: scale loop device by introducing per device lock") Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 585af8ed	02-Jul-2021	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	loop: remove unused variable in loop_set_status() Commit 0384264ea8a39bd9 ("block: pass a gendisk to bdev_disk_changed") changed to pass lo->lo_disk instead of lo->lo_device. Fixes: 0384264ea8a3 ("block: pass a gendisk to bdev_disk_changed") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/20210702152714.7978-1-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8e60947d	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: rewrite loop_exit using idr_for_each_entry Use idr_for_each_entry to simplify removing all devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210623145908.92973-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b9848081	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: split loop_lookup loop_lookup has two callers - one wants to do the a find by index and the other wants any unbound loop device. Open code the respective functionality in each caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210623145908.92973-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# e5d66a10	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: don't allow deleting an unspecified loop device Passing a negative index to loop_lookup while return any unbound device. Doing that for a delete does not make much sense, so add check to explicitly reject that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210623145908.92973-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 18d1f200	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: move loop_ctl_mutex locking into loop_add Move acquiring and releasing loop_ctl_mutex from the callers into loop_add. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210623145908.92973-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f9d10764	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: split loop_control_ioctl Split loop_control_ioctl into a helper for each command. This keeps the code nicely separated for the upcoming locking changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210623145908.92973-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4157fe0b	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: don't call loop_lookup before adding a loop device loop_add returns the right error if the slot wasn't available. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210623145908.92973-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d6da83d0	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: remove the l argument to loop_add None of the callers cares about the allocated struct loop_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210623145908.92973-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# bd5c39ed	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: reduce loop_ctl_mutex coverage in loop_exit loop_ctl_mutex is only needed to iterate the IDR for removing the loop devices, so reduce the coverage. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210623145908.92973-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8b52d8be	23-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: reorder loop_exit Unregister the misc and blockdevice first to prevent further access, and only then iterate to remove the devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210623145908.92973-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0384264e	24-Jun-2021	Christoph Hellwig <hch@lst.de>	block: pass a gendisk to bdev_disk_changed bdev_disk_changed can only operate on whole devices. Make that clear by passing a gendisk instead of the struct block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210624123240.441814-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 2b9ac22b	18-Jun-2021	Kristian Klausen <kristian@klausen.dk>	loop: Fix missing discard support when using LOOP_CONFIGURE Without calling loop_config_discard() the discard flag and parameters aren't set/updated for the loop device and worst-case they could indicate discard support when it isn't the case (ex: if the LOOP_SET_STATUS ioctl was used with a different file prior to LOOP_CONFIGURE). Cc: <stable@vger.kernel.org> # 5.8.x- Fixes: 3448914e8cc5 ("loop: Add LOOP_CONFIGURE ioctl") Signed-off-by: Kristian Klausen <kristian@klausen.dk> Link: https://lore.kernel.org/r/20210618115157.31452-1-kristian@klausen.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6a03cd98	16-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: fix order of cleaning up the queue and freeing the tagset We must release the queue before freeing the tagset. Fixes: 1c99502fae35 ("loop: use blk_mq_alloc_disk and blk_cleanup_disk") Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1c99502f	02-Jun-2021	Christoph Hellwig <hch@lst.de>	loop: use blk_mq_alloc_disk and blk_cleanup_disk Use blk_mq_alloc_disk and blk_cleanup_disk to simplify the gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210602065345.355274-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a8698707	25-May-2021	Christoph Hellwig <hch@lst.de>	block: move bd_mutex to struct gendisk Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# c74d40e8	28-Jun-2021	Dan Schatzberg <schatzberg.dan@gmail.com>	loop: charge i/o to mem and blk cg The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 87579e9b	28-Jun-2021	Dan Schatzberg <schatzberg.dan@gmail.com>	loop: use worker per cgroup instead of kworker Patch series "Charge loop device i/o to issuing cgroup", v14. The loop device runs all i/o to the backing file on a separate kworker thread which results in all i/o being charged to the root cgroup. This allows a loop device to be used to trivially bypass resource limits and other policy. This patch series fixes this gap in accounting. A simple script to demonstrate this behavior on cgroupv2 machine: ''' #!/bin/bash set -e CGROUP=/sys/fs/cgroup/test.slice LOOP_DEV=/dev/loop0 if [[ ! -d $CGROUP ]] then sudo mkdir $CGROUP fi grep oom_kill $CGROUP/memory.events # Set a memory limit, write more than that limit to tmpfs -> OOM kill sudo unshare -m bash -c " echo \$\$ > $CGROUP/cgroup.procs; echo 0 > $CGROUP/memory.swap.max; echo 64M > $CGROUP/memory.max; mount -t tmpfs -o size=512m tmpfs /tmp; dd if=/dev/zero of=/tmp/file bs=1M count=256" \|\| true grep oom_kill $CGROUP/memory.events # Set a memory limit, write more than that limit through loopback # device -> no OOM kill sudo unshare -m bash -c " echo \$\$ > $CGROUP/cgroup.procs; echo 0 > $CGROUP/memory.swap.max; echo 64M > $CGROUP/memory.max; mount -t tmpfs -o size=512m tmpfs /tmp; truncate -s 512m /tmp/backing_file losetup $LOOP_DEV /tmp/backing_file dd if=/dev/zero of=$LOOP_DEV bs=1M count=256; losetup -D $LOOP_DEV" \|\| true grep oom_kill $CGROUP/memory.events ''' Naively charging cgroups could result in priority inversions through the single kworker thread in the case where multiple cgroups are reading/writing to the same loop device. This patch series does some minor modification to the loop driver so that each cgroup can make forward progress independently to avoid this inversion. With this patch series applied, the above script triggers OOM kills when writing through the loop device as expected. This patch (of 3): Existing uses of loop device may have multiple cgroups reading/writing to the same device. Simply charging resources for I/O to the backing file could result in priority inversion where one cgroup gets synchronously blocked, holding up all other I/O to the loop device. In order to avoid this priority inversion, we use a single workqueue where each work item is a "struct loop_worker" which contains a queue of struct loop_cmds to issue. The loop device maintains a tree mapping blk css_id -> loop_worker. This allows each cgroup to independently make forward progress issuing I/O to the backing file. There is also a single queue for I/O associated with the rootcg which can be used in cases of extreme memory shortage where we cannot allocate a loop_worker. The locking for the tree and queues is fairly heavy handed - we acquire a per-loop-device spinlock any time either is accessed. The existing implementation serializes all I/O through a single thread anyways, so I don't believe this is any worse. [colin.king@canonical.com: fixes] Link: https://lkml.kernel.org/r/20210610173944.1203706-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20210610173944.1203706-2-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Chris Down <chris@chrisdown.name> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 990e7811	05-Jun-2021	Christoph Hellwig <hch@lst.de>	block: loop: fix deadlock between open and remove Commit c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk") adds disk->part0->bd_mutex in del_gendisk(), this way causes the following AB/BA deadlock between removing loop and opening loop: 1) loop_control_ioctl(LOOP_CTL_REMOVE) -> mutex_lock(&loop_ctl_mutex) -> del_gendisk -> mutex_lock(&disk->part0->bd_mutex) 2) blkdev_get_by_dev -> mutex_lock(&disk->part0->bd_mutex) -> lo_open -> mutex_lock(&loop_ctl_mutex) Add a new Lo_deleting state to remove the need for clearing ->private_data and thus holding loop_ctl_mutex in the ioctl LOOP_CTL_REMOVE path. Based on an analysis and earlier patch from Ming Lei <ming.lei@redhat.com>. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210605140950.5800-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4ee60ec1	06-May-2021	Matthew Wilcox (Oracle) <willy@infradead.org>	include: remove pagemap.h from blkdev.h My UEK-derived config has 1030 files depending on pagemap.h before this change. Afterwards, just 326 files need to be rebuilt when I touch pagemap.h. I think blkdev.h is probably included too widely, but untangling that dependency is harder and this solves my problem. x86 allmodconfig builds, but there may be implicit include problems on other architectures. Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm] Acked-by: Jens Axboe <axboe@kernel.dk> [block] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi] Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 4ceddce5	21-Feb-2021	Mauricio Faria de Oliveira <mfo@canonical.com>	loop: fix I/O error on fsync() in detached loop devices There's an I/O error on fsync() in a detached loop device if it has been previously attached. The issue is write cache is enabled in the attach path in loop_configure() but it isn't disabled in the detach path; thus it remains enabled in the block device regardless of whether it is attached or not. Now fsync() can get an I/O request that will just be failed later in loop_queue_rq() as device's state is not 'Lo_bound'. So, disable write cache in the detach path. Do so based on the queue flag, not the loop device flag for read-only (used to enable) as the queue flag can be changed via sysfs even on read-only loop devices (e.g., losetup -r.) Test-case: # DEV=/dev/loop7 # IMG=/tmp/image # truncate --size 1M $IMG # losetup $DEV $IMG # losetup -d $DEV Before: # strace -e fsync parted -s $DEV print 2>&1 \| grep fsync fsync(3) = -1 EIO (Input/output error) Warning: Error fsyncing/closing /dev/loop7: Input/output error [ 982.529929] blk_update_request: I/O error, dev loop7, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 After: # strace -e fsync parted -s $DEV print 2>&1 \| grep fsync fsync(3) = 0 Co-developed-by: Eric Desrochers <eric.desrochers@canonical.com> Signed-off-by: Eric Desrochers <eric.desrochers@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Tested-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6f24784f	31-Jan-2021	Al Viro <viro@zeniv.linux.org.uk>	whack-a-mole: don't open-code iminor/imajor several instances creeped back into the tree... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 6cc8e743	26-Jan-2021	Pavel Tatashin <pasha.tatashin@soleen.com>	loop: scale loop device by introducing per device lock Currently, loop device has only one global lock: loop_ctl_mutex. This becomes hot in scenarios where many loop devices are used. Scale it by introducing per-device lock: lo_mutex that protects modifications of all fields in struct loop_device. Keep loop_ctl_mutex to protect global data: loop_index_idr, loop_lookup, loop_add. The new lock ordering requirement is that loop_ctl_mutex must be taken before lo_mutex. Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# aeb2b0b1	11-Dec-2020	Lukas Bulwahn <lukas.bulwahn@gmail.com>	block: drop dead assignments in loop_init() Commit 8410d38c2552 ("loop: use __register_blkdev to allocate devices on demand") simplified loop_init(); so computing the range of the block region is not required anymore and can be dropped. Drop dead assignments in loop_init(). As compilers will detect these unneeded assignments and optimize this, the resulting object code is identical before and after this change. No functional change. No change in object code. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a782483c	26-Nov-2020	Christoph Hellwig <hch@lst.de>	block: remove the nr_sects field in struct hd_struct Now that the hd_struct always has a block device attached to it, there is no need for having two size field that just get out of sync. Additionally the field in hd_struct did not use proper serialization, possibly allowing for torn writes. By only using the block_device field this problem also gets fixed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 37c3fc9a	25-Nov-2020	Christoph Hellwig <hch@lst.de>	block: simplify the block device claiming interface Stop passing the whole device as a separate argument given that it can be trivially deducted and cleanup the !holder debug check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a954ea81	23-Nov-2020	Christoph Hellwig <hch@lst.de>	block: remove ->bd_contains Now that each hd_struct has a reference to the corresponding block_device, there is no need for the bd_contains pointer. Add a bdev_whole() helper to look up the whole device block_device struture instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4e7b5671	23-Nov-2020	Christoph Hellwig <hch@lst.de>	block: remove i_bdev Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f46f2a31	16-Nov-2020	Christoph Hellwig <hch@lst.de>	loop: do not call set_blocksize set_blocksize is used by file systems to use their preferred buffer cache block size. Block drivers should not set it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 449f4ec9	16-Nov-2020	Christoph Hellwig <hch@lst.de>	block: remove the update_bdev parameter to set_capacity_revalidate_and_notify The update_bdev argument is always set to true, so remove it. Also rename the function to the slighly less verbose set_capacity_and_notify, as propagating the disk size to the block device isn't really revalidation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3b4f85d0	16-Nov-2020	Christoph Hellwig <hch@lst.de>	loop: let set_capacity_revalidate_and_notify update the bdev size There is no good reason to call revalidate_disk_size separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8410d38c	29-Oct-2020	Christoph Hellwig <hch@lst.de>	loop: use __register_blkdev to allocate devices on demand Use the simpler mechanism attached to major_name to allocate a brd device when a currently unregistered minor is accessed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7a2f0ce1	03-Nov-2020	Christoph Hellwig <hch@lst.de>	loop: use set_disk_ro Use set_disk_ro instead of set_device_ro to match all other block drivers and to ensure all partitions mirror the read-only flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# c01a21b7	12-Nov-2020	Petr Vorel <pvorel@suse.cz>	loop: Fix occasional uevent drop Commit 716ad0986cbd ("loop: Switch to set_capacity_revalidate_and_notify") causes an occasional drop of loop device uevent, which are no longer triggered in loop_set_size() but in a different part of code. Bug is reproducible with LTP test uevent01 [1]: i=0; while true; do i=$((i+1)); echo "== $i ==" lsmod \|grep -q loop && rmmod -f loop ./uevent01 \|\| break done Put back triggering through code called in loop_set_size(). Fix required to add yet another parameter to set_capacity_revalidate_and_notify(). [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/uevents/uevent01.c [hch: rebased on a different change to the prototype of set_capacity_revalidate_and_notify] Cc: stable@vger.kernel.org # v5.9 Fixes: 716ad0986cbd ("loop: Switch to set_capacity_revalidate_and_notify") Reported-by: <ltp@lists.linux.it> Signed-off-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 611bee52	23-Aug-2020	Christoph Hellwig <hch@lst.de>	block: replace bd_set_size with bd_set_nr_sectors Replace bd_set_size with a version that takes the number of sectors instead, as that fits most of the current and future callers much better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 79e5dc59	25-Aug-2020	Martijn Coenen <maco@android.com>	loop: Set correct device size when using LOOP_CONFIGURE The device size calculation was done before processing the loop configuration, which meant that the we set the size on the underlying block device incorrectly in case lo_offset/lo_sizelimit were set in the configuration. Delay computing the size until we've setup the device parameters correctly. Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl") Reported-by: Lennart Poettering <mzxreary@0pointer.de> Tested-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Martijn Coenen <maco@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# df561f66	23-Aug-2020	Gustavo A. R. Silva <gustavoars@kernel.org>	treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
# bcb21c8c	17-Aug-2020	Ming Lei <ming.lei@redhat.com>	block: loop: set discard granularity and alignment for block device backed loop In case of block device backend, if the backend supports write zeros, the loop device will set queue flag of QUEUE_FLAG_DISCARD. However, limits.discard_granularity isn't setup, and this way is wrong, see the following description in Documentation/ABI/testing/sysfs-block: A discard_granularity of 0 means that the device does not support discard functionality. Especially 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()") starts to take q->limits.discard_granularity for computing max discard sectors. And zero discard granularity may cause kernel oops, or fail discard request even though the loop queue claims discard support via QUEUE_FLAG_DISCARD. Fix the issue by setup discard granularity and alignment. Fixes: c52abf563049 ("loop: Better discard support for block devices") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Xiao Ni <xni@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Evan Green <evgreen@chromium.org> Cc: Gwendal Grignou <gwendal@chromium.org> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# fe6a8fc5	10-Aug-2020	Lennart Poettering <mzxreary@0pointer.de>	loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE When LOOP_CONFIGURE is used with LO_FLAGS_PARTSCAN we need to propagate this into the GENHD_FL_NO_PART_SCAN. LOOP_SETSTATUS does this, LOOP_CONFIGURE doesn't so far. Effect is that setting up a loopback device with partition scanning doesn't actually work when LOOP_CONFIGURE is issued, though it works fine with LOOP_SETSTATUS. Let's correct that and propagate the flag in LOOP_CONFIGURE too. Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl") Signed-off-by: Lennart Poettering <lennart@poettering.net> Acked-by: Martijn Coenen <maco@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# ecbe6bc0	16-Jul-2020	Christoph Hellwig <hch@lst.de>	block: use bd_prepare_to_claim directly in the loop driver The arcane magic in bd_start_claiming is only needed to be able to claim a block_device that hasn't been fully set up. Switch the loop driver that claims from the ioctl path with a fully set up struct block_device to just use the much simpler bd_prepare_to_claim directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 200f9337	19-Jun-2020	Luis Chamberlain <mcgrof@kernel.org>	loop: be paranoid on exit and prevent new additions / removals Be pedantic on removal as well and hold the mutex. This should prevent uses of addition while we exit. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 15f73f5b	11-Jun-2020	Christoph Hellwig <hch@lst.de>	blk-mq: move failure injection out of blk_mq_complete_request Move the call to blk_should_fake_timeout out of blk_mq_complete_request and into the drivers, skipping call sites that are obvious error handlers, and remove the now superflous blk_mq_force_complete_rq helper. This ensures we don't keep injecting errors into completions that just terminate the Linux request after the hardware has been reset or the command has been aborted. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f4bd34b1	17-Jun-2020	Zheng Bin <zhengbin13@huawei.com>	loop: replace kill_bdev with invalidate_bdev When a filesystem is mounted on a loop device and on a loop ioctl LOOP_SET_STATUS64, because of kill_bdev, buffer_head mappings are getting destroyed. kill_bdev truncate_inode_pages truncate_inode_pages_range do_invalidatepage block_invalidatepage discard_buffer -->clear BH_Mapped flag sb_bread __bread_gfp bh = __getblk_gfp -->discard_buffer clear BH_Mapped flag __bread_slow submit_bh submit_bh_wbc BUG_ON(!buffer_mapped(bh)) --> hit this BUG_ON Fixes: 5db470e229e2 ("loop: drop caches if offset or block_size are changed") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6ac92fb5	04-Jun-2020	Martijn Coenen <maco@android.com>	loop: Fix wrong masking of status flags In faf1d25440d6, loop_set_status() now assigns lo_status directly from the passed in lo_flags, but then fixes it up by masking out flags that can't be set by LOOP_SET_STATUS; unfortunately the mask was negated. Re-ran all ltp ioctl_loop tests, and they all passed. Pass run of the previously failing one: tst_test.c:1247: INFO: Timeout per run is 0h 05m 00s tst_device.c:88: INFO: Found free device 0 '/dev/loop0' ioctl_loop01.c:49: PASS: /sys/block/loop0/loop/partscan = 0 ioctl_loop01.c:50: PASS: /sys/block/loop0/loop/autoclear = 0 ioctl_loop01.c:51: PASS: /sys/block/loop0/loop/backing_file = '/tmp/ZRJ6H4/test.img' ioctl_loop01.c:65: PASS: get expected lo_flag 12 ioctl_loop01.c:67: PASS: /sys/block/loop0/loop/partscan = 1 ioctl_loop01.c:68: PASS: /sys/block/loop0/loop/autoclear = 1 ioctl_loop01.c:77: PASS: access /dev/loop0p1 succeeds ioctl_loop01.c:83: PASS: access /sys/block/loop0/loop0p1 succeeds Summary: passed 8 failed 0 skipped 0 warnings 0 Fixes: faf1d25440d6 ("loop: Clean up LOOP_SET_STATUS lo_flags handling") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Martijn Coenen <maco@android.com> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a37b0715	01-Jun-2020	NeilBrown <neilb@suse.de>	mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a larger dirty page allowance, but that it needs a private dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the daemon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does not change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. [akpm@linux-foundation.org: coding style fixes] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd] Cc: Christoph Hellwig <hch@lst.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# bf0beec0	29-May-2020	Ming Lei <ming.lei@redhat.com>	blk-mq: drain I/O when all CPUs in a hctx are offline Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup up queue mapping. Thomas mentioned the following point[1]: "That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again." However, current blk-mq implementation doesn't quiesce hw queue before the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a cpuhp state handled after the CPU is down, so there isn't any chance to quiesce the hctx before shutting down the CPU. Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs where the last CPU goes away, and wait for completion of in-flight requests. This guarantees that there is no inflight I/O before shutting down the managed IRQ. Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need to wait for completion of in-flight requests from these drivers to avoid a potential dead-lock. It is safe to do this for stacking drivers as those do not use interrupts at all and their I/O completions are triggered by underlying devices I/O completion. [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [hch: different retry mechanism, merged two patches, minor cleanups] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d29b92f5	24-May-2020	Colin Ian King <colin.king@canonical.com>	loop: remove redundant assignment to variable error The variable error is being assigned a value that is never read so the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3448914e	13-May-2020	Martijn Coenen <maco@android.com>	loop: Add LOOP_CONFIGURE ioctl This allows userspace to completely setup a loop device with a single ioctl, removing the in-between state where the device can be partially configured - eg the loop device has a backing file associated with it, but is reading from the wrong offset. Besides removing the intermediate state, another big benefit of this ioctl is that LOOP_SET_STATUS can be slow; the main reason for this slowness is that LOOP_SET_STATUS(64) calls blk_mq_freeze_queue() to freeze the associated queue; this requires waiting for RCU synchronization, which I've measured can take about 15-20ms on this device on average. In addition to doing what LOOP_SET_STATUS can do, LOOP_CONFIGURE can also be used to: - Set the correct block size immediately by setting loop_config.block_size (avoids LOOP_SET_BLOCK_SIZE) - Explicitly request direct I/O mode by setting LO_FLAGS_DIRECT_IO in loop_config.info.lo_flags (avoids LOOP_SET_DIRECT_IO) - Explicitly request read-only mode by setting LO_FLAGS_READ_ONLY in loop_config.info.lo_flags Here's setting up ~70 regular loop devices with an offset on an x86 Android device, using LOOP_SET_FD and LOOP_SET_STATUS: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m03.40s real 0m00.02s user 0m00.03s system Here's configuring ~70 devices in the same way, but using a modified losetup that uses the new LOOP_CONFIGURE ioctl: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m01.94s real 0m00.01s user 0m00.01s system Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# faf1d254	13-May-2020	Martijn Coenen <maco@android.com>	loop: Clean up LOOP_SET_STATUS lo_flags handling LOOP_SET_STATUS(64) will actually allow some lo_flags to be modified; in particular, LO_FLAGS_AUTOCLEAR can be set and cleared, whereas LO_FLAGS_PARTSCAN can be set to request a partition scan. Make this explicit by updating the UAPI to include the flags that can be set/cleared using this ioctl. The implementation can then blindly take over the passed in flags, and use the previous flags for those flags that can't be set / cleared using LOOP_SET_STATUS. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 571fae6e	13-May-2020	Martijn Coenen <maco@android.com>	loop: Rework lo_ioctl() __user argument casting In preparation for a new ioctl that needs to copy_from_user(); makes the code easier to read as well. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 62ab466c	13-May-2020	Martijn Coenen <maco@android.com>	loop: Move loop_set_status_from_info() and friends up So we can use it without forward declaration. This is a separate commit to make it easier to verify that this is just a move, without functional modifications. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0c3796c2	13-May-2020	Martijn Coenen <maco@android.com>	loop: Factor out configuring loop from status Factor out this code into a separate function, so it can be reused by other code more easily. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0a6ed1b5	13-May-2020	Martijn Coenen <maco@android.com>	loop: Remove figure_loop_size() This function was now only used by loop_set_capacity(). Just open code the remaining code in the caller instead. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b0bd158d	13-May-2020	Martijn Coenen <maco@android.com>	loop: Refactor loop_set_status() size calculation figure_loop_size() calculates the loop size based on the passed in parameters, but at the same time it updates the offset and sizelimit parameters in the loop device configuration. That is a somewhat unexpected side effect of a function with this name, and it is only only needed by one of the two callers of this function - loop_set_status(). Move the lo_offset and lo_sizelimit assignment back into loop_set_status(), and use the newly factored out functions to validate and apply the newly calculated size. This allows us to get rid of figure_loop_size() in a follow-up commit. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 716ad098	13-May-2020	Martijn Coenen <maco@android.com>	loop: Switch to set_capacity_revalidate_and_notify() This was recently added to block/genhd.c, and takes care of both updating the capacity and notifying userspace of the new size. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 5795b6f5	13-May-2020	Martijn Coenen <maco@android.com>	loop: Factor out setting loop device size This code is used repeatedly. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 083a6a50	13-May-2020	Martijn Coenen <maco@android.com>	loop: Remove sector_t truncation checks sector_t is now always u64, so we don't need to check for truncation. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7c5014b0	13-May-2020	Martijn Coenen <maco@android.com>	loop: Call loop_config_discard() only after new config is applied loop_set_status() calls loop_config_discard() to configure discard for the loop device; however, the discard configuration depends on whether the loop device uses encryption, and when we call it the encryption configuration has not been updated yet. Move the call down so we apply the correct discard configuration based on the new configuration. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# efbe3c24	30-Apr-2020	Ira Weiny <ira.weiny@intel.com>	fs: Remove unneeded IS_DAX() check in io_is_direct() Remove the check because DAX now has it's own read/write methods and file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on their own. Therefore, it does not matter if the file state is DAX when the iocb flags are created. Also remove io_is_direct() as it is just a simple flag check. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
# c52abf56	03-Apr-2020	Evan Green <evgreen@chromium.org>	loop: Better discard support for block devices If the backing device for a loop device is itself a block device, then mirror the "write zeroes" capabilities of the underlying block device into the loop device. Copy this capability into both max_write_zeroes_sectors and max_discard_sectors of the loop device. The reason for this is that REQ_OP_DISCARD on a loop device translates into blkdev_issue_zeroout(), rather than blkdev_issue_discard(). This presents a consistent interface for loop devices (that discarded data is zeroed), regardless of the backing device type of the loop device. There should be no behavior change for loop devices backed by regular files. This change fixes blktest block/003, and removes an extraneous error print in block/013 when testing on a loop device backed by a block device that does not support discard. Signed-off-by: Evan Green <evgreen@chromium.org> Reviewed-by: Gwendal Grignou <gwendal@chromium.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [used updated version of Evan's comment in loop_config_discard()] [moved backingq to local scope, removed redundant braces] Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@collabora.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8cd55087	03-Apr-2020	Evan Green <evgreen@chromium.org>	loop: Report EOPNOTSUPP properly Properly plumb out EOPNOTSUPP from loop driver operations, which may get returned when for instance a discard operation is attempted but not supported by the underlying block device. Before this change, everything was reported in the log as an I/O error, which is scary and not helpful in debugging. Signed-off-by: Evan Green <evgreen@chromium.org> Reviewed-by: Gwendal Grignou <gwendal@chromium.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@collabora.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0fbcf579	10-Mar-2020	Martijn Coenen <maco@android.com>	loop: Only freeze block queue when needed. __loop_update_dio() can be called as a part of loop_set_fd(), when the block queue is not yet up and running; avoid freezing the block queue in that case, since that is an expensive operation. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Martijn Coenen <maco@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7e81f99a	10-Mar-2020	Martijn Coenen <maco@android.com>	loop: Only change blocksize when needed. Return early in loop_set_block_size() if the requested block size is identical to the one we already have; this avoids expensive calls to freeze the block queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martijn Coenen <maco@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f0b870df	14-Nov-2019	Christoph Hellwig <hch@lst.de>	block: remove (__)blkdev_reread_part as an exported API In general drivers should never mess with partition tables directly. Unfortunately s390 and loop do for somewhat historic reasons, but they can use bdev_disk_changed directly instead when we export it as they satisfy the sanity checks we have in __blkdev_reread_part. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd] Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# efcfec57	30-Oct-2019	Darrick J. Wong <darrick.wong@oracle.com>	loop: fix no-unmap write-zeroes request behavior Currently, if the loop device receives a WRITE_ZEROES request, it asks the underlying filesystem to punch out the range. This behavior is correct if unmapping is allowed. However, a NOUNMAP request means that the caller doesn't want us to free the storage backing the range, so punching out the range is incorrect behavior. To satisfy a NOUNMAP \| WRITE_ZEROES request, loop should ask the underlying filesystem to FALLOC_FL_ZERO_RANGE, which is (according to the fallocate documentation) required to ensure that the entire range is backed by real storage, which suffices for our purposes. Fixes: 19372e2769179dd ("loop: implement REQ_OP_WRITE_ZEROES") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 85560117	04-Sep-2019	Martijn Coenen <maco@android.com>	loop: change queue block size to match when using DIO The loop driver assumes that if the passed in fd is opened with O_DIRECT, the caller wants to use direct I/O on the loop device. However, if the underlying block device has a different block size than the loop block queue, direct I/O can't be enabled. Instead of requiring userspace to manually change the blocksize and re-enable direct I/O, just change the queue block sizes to match, as well as the io_min size. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martijn Coenen <maco@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# fdbe4eee	06-Aug-2019	Alessio Balsini <balsini@android.com>	loop: Add LOOP_SET_DIRECT_IO to compat ioctl Enabling Direct I/O with loop devices helps reducing memory usage by avoiding double caching. 32 bit applications running on 64 bits systems are currently not able to request direct I/O because is missing from the lo_compat_ioctl. This patch fixes the compatibility issue mentioned above by exporting LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry. The input argument for this ioctl is a single long converted to a 1-bit boolean, so compatibility is preserved. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Alessio Balsini <balsini@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d0a255e7	08-Aug-2019	Mikulas Patocka <mpatocka@redhat.com>	loop: set PF_MEMALLOC_NOIO for the worker thread A deadlock with this stacktrace was observed. The loop thread does a GFP_KERNEL allocation, it calls into dm-bufio shrinker and the shrinker depends on I/O completion in the dm-bufio subsystem. In order to fix the deadlock (and other similar ones), we set the flag PF_MEMALLOC_NOIO at loop thread entry. PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f #13 [ffff8813dedfbec0] kthread at ffffffff810a8428 #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 PID: 14127 TASK: ffff881455749c00 CPU: 11 COMMAND: "loop1" #0 [ffff88272f5af228] __schedule at ffffffff8173f405 #1 [ffff88272f5af280] schedule at ffffffff8173fa27 #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5 #4 [ffff88272f5af330] mutex_lock at ffffffff81742133 #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio] #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778 #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34 #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8 #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3 #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71 #12 [ffff88272f5af760] new_slab at ffffffff811f4523 #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5 #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3 #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3 #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs] #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994 #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs] #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop] #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop] #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c #23 [ffff88272f5afec0] kthread at ffffffff810a8428 #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 89e524c0	30-Jul-2019	Jan Kara <jack@suse.cz>	loop: Fix mount(2) failure due to race with LOOP_SET_FD Commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") made LOOP_SET_FD ioctl acquire exclusive block device reference while it updates loop device binding. However this can make perfectly valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding temporarily the exclusive bdev reference in cases like this: for i in {a..z}{a..z}; do dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024 mkfs.ext2 $i.image mkdir mnt$i done echo "Run" for i in {a..z}{a..z}; do mount -o loop -t ext2 $i.image mnt$i & done Fix the problem by not getting full exclusive bdev reference in LOOP_SET_FD but instead just mark the bdev as being claimed while we update the binding information. This just blocks new exclusive openers instead of failing them with EBUSY thus fixing the problem. Fixes: 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") Cc: stable@vger.kernel.org Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b6207430	26-Jun-2019	Christoph Hellwig <hch@lst.de>	block: never take page references for ITER_BVEC If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 33ec3e53	16-May-2019	Jan Kara <jack@suse.cz>	loop: Don't change loop device under exclusive opener Loop module allows calling LOOP_SET_FD while there are other openers of the loop device. Even exclusive ones. This can lead to weird consequences such as kernel deadlocks like: mount_bdev() lo_ioctl() udf_fill_super() udf_load_vrs() sb_set_blocksize() - sets desired block size B udf_tread() sb_bread() __bread_gfp(bdev, block, B) loop_set_fd() set_blocksize() - now __getblk_slow() indefinitely loops because B != bdev block size Fix the problem by disallowing LOOP_SET_FD ioctl when there are exclusive openers of a loop device. [Deliberately chosen not to CC stable as a user with priviledges to trigger this race has other means of taking the system down and this has a potential of breaking some weird userspace setup] Reported-and-tested-by: syzbot+10007d66ca02b08f0e60@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 81ba6abd	27-Mar-2019	Ming Lei <ming.lei@redhat.com>	block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF loop is one block device, for any bio submitted to this device, the upper layer does guarantee that pages added to loop's bio won't go away when the bio is in-flight. So mark loop's bvec as ITER_BVEC_FLAG_NO_REF then get_page/put_page can be saved for serving loop's IO. Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 56a85fd8	12-Feb-2019	Holger Hoffstätte <holger.hoffstaette@googlemail.com>	loop: properly observe rotational flag of underlying device The loop driver always declares the rotational flag of its device as rotational, even when the device of the mapped file is nonrotational, as is the case with SSDs or on tmpfs. This can confuse filesystem tools which are SSD-aware; in my case I frequently forget to tell mkfs.btrfs that my loop device on tmpfs is nonrotational, and that I really don't need any automatic metadata redundancy. The attached patch fixes this by introspecting the rotational flag of the mapped file's underlying block device, if it exists. If the mapped file's filesystem has no associated block device - as is the case on e.g. tmpfs - we assume nonrotational storage. If there is a better way to identify such non-devices I'd love to hear them. Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: holger@applied-asynchrony.com Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Gwendal Grignou <gwendal@chromium.org> Signed-off-by: Benjamin Gordon <bmgordon@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f7c8a412	18-Mar-2019	Dongli Zhang <dongli.zhang@oracle.com>	loop: access lo_backing_file only when the loop device is Lo_bound Commit 758a58d0bc67 ("loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()") separates "lo->lo_backing_file = NULL" and "lo->lo_state = Lo_unbound" into different critical regions protected by loop_ctl_mutex. However, there is below race that the NULL lo->lo_backing_file would be accessed when the backend of a loop is another loop device, e.g., loop0's backend is a file, while loop1's backend is loop0. loop0's backend is file loop1's backend is loop0 __loop_clr_fd() mutex_lock(&loop_ctl_mutex); lo->lo_backing_file = NULL; --> set to NULL mutex_unlock(&loop_ctl_mutex); loop_set_fd() mutex_lock_killable(&loop_ctl_mutex); loop_validate_file() f = l->lo_backing_file; --> NULL access if loop0 is not Lo_unbound mutex_lock(&loop_ctl_mutex); lo->lo_state = Lo_unbound; mutex_unlock(&loop_ctl_mutex); lo->lo_backing_file should be accessed only when the loop device is Lo_bound. In fact, the problem has been introduced already in commit 7ccd0791d985 ("loop: Push loop_ctl_mutex down into loop_clr_fd()") after which loop_validate_file() could see devices in Lo_rundown state with which it did not count. It was harmless at that point but still. Fixes: 7ccd0791d985 ("loop: Push loop_ctl_mutex down into loop_clr_fd()") Reported-by: syzbot+9bdc1adc1c55e7fe765b@syzkaller.appspotmail.com Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 758a58d0	22-Feb-2019	Dongli Zhang <dongli.zhang@oracle.com>	loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() Commit 0da03cab87e6 ("loop: Fix deadlock when calling blkdev_reread_part()") moves blkdev_reread_part() out of the loop_ctl_mutex. However, GENHD_FL_NO_PART_SCAN is set before __blkdev_reread_part(). As a result, __blkdev_reread_part() will fail the check of GENHD_FL_NO_PART_SCAN and will not rescan the loop device to delete all partitions. Below are steps to reproduce the issue: step1 # dd if=/dev/zero of=tmp.raw bs=1M count=100 step2 # losetup -P /dev/loop0 tmp.raw step3 # parted /dev/loop0 mklabel gpt step4 # parted -a none -s /dev/loop0 mkpart primary 64s 1 step5 # losetup -d /dev/loop0 Step5 will not be able to delete /dev/loop0p1 (introduced by step4) and there is below kernel warning message: [ 464.414043] __loop_clr_fd: partition scan of loop0 failed (rc=-22) This patch sets GENHD_FL_NO_PART_SCAN after blkdev_reread_part(). Fixes: 0da03cab87e6 ("loop: Fix deadlock when calling blkdev_reread_part()") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 40853d6f	22-Feb-2019	Dongli Zhang <dongli.zhang@oracle.com>	loop: do not print warn message if partition scan is successful Do not print warn message when the partition scan returns 0. Fixes: d57f3374ba48 ("loop: Move special partition reread handling in loop_clr_fd()") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 56d18f62	15-Feb-2019	Ming Lei <ming.lei@redhat.com>	block: kill BLK_MQ_F_SG_MERGE QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 86af5952	15-Feb-2019	Ming Lei <ming.lei@redhat.com>	block: loop: pass multi-page bvec to iov_iter iov_iter is implemented on bvec itererator helpers, so it is safe to pass multi-page bvec to it, and this way is much more efficient than passing one page in each bvec. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 5db470e2	09-Jan-2019	Jaegeuk Kim <jaegeuk@kernel.org>	loop: drop caches if offset or block_size are changed If we don't drop caches used in old offset or block_size, we can get old data from new offset/block_size, which gives unexpected data to user. For example, Martijn found a loopback bug in the below scenario. 1) LOOP_SET_FD loads first two pages on loop file 2) LOOP_SET_STATUS64 changes the offset on the loop file 3) mount is failed due to the cached pages having wrong superblock Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Reported-by: Martijn Coenen <maco@google.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# c4110369	22-Dec-2018	Chengguang Xu <cgxu519@gmx.com>	block: loop: remove redundant code Code cleanup for removing redundant break in switch case. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 38a3499f	16-Dec-2018	Chengguang Xu <cgxu519@gmx.com>	block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add() blk_mq_init_queue() will not return NULL pointer to its caller, so it's better to replace IS_ERR_OR_NULL using IS_ERR in loop_add(). If in the future things change to check NULL pointer inside loop_add(), we should return -ENOMEM as return code instead of PTR_ERR(NULL). Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# db6638d7	04-Dec-2018	Dennis Zhou <dennis@kernel.org>	blkcg: remove bio->bi_css and instead use bio->bi_blkg Prior patches ensured that any bio that interacts with a request_queue is properly associated with a blkg. This makes bio->bi_css unnecessary as blkg maintains a reference to blkcg already. This removes the bio field bi_css and transfers corresponding uses to access via bi_blkg. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 628bd859	12-Nov-2018	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	loop: Fix double mutex_unlock(&loop_ctl_mutex) in loop_control_ioctl() Commit 0a42e99b58a20883 ("loop: Get rid of loop_index_mutex") forgot to remove mutex_unlock(&loop_ctl_mutex) from loop_control_ioctl() when replacing loop_index_mutex with loop_ctl_mutex. Fixes: 0a42e99b58a20883 ("loop: Get rid of loop_index_mutex") Reported-by: syzbot <syzbot+c0138741c2290fc5e63f@syzkaller.appspotmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# c28445fa	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Get rid of 'nested' acquisition of loop_ctl_mutex The nested acquisition of loop_ctl_mutex (->lo_ctl_mutex back then) has been introduced by commit f028f3b2f987e "loop: fix circular locking in loop_clr_fd()" to fix lockdep complains about bd_mutex being acquired after lo_ctl_mutex during partition rereading. Now that these are properly fixed, let's stop fooling lockdep. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1dded9ac	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Avoid circular locking dependency between loop_ctl_mutex and bd_mutex Code in loop_change_fd() drops reference to the old file (and also the new file in a failure case) under loop_ctl_mutex. Similarly to a situation in loop_set_fd() this can create a circular locking dependency if this was the last reference holding the file open. Delay dropping of the file reference until we have released loop_ctl_mutex. Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0da03cab	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Fix deadlock when calling blkdev_reread_part() Calling blkdev_reread_part() under loop_ctl_mutex causes lockdep to complain about circular lock dependency between bdev->bd_mutex and lo->lo_ctl_mutex. The problem is that on loop device open or close lo_open() and lo_release() get called with bdev->bd_mutex held and they need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is called with loop_ctl_mutex held, it will call blkdev_reread_part() which acquires bdev->bd_mutex. See syzbot report for details [1]. Move call to blkdev_reread_part() in __loop_clr_fd() from under loop_ctl_mutex to finish fixing of the lockdep warning and the possible deadlock. [1] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588 Reported-by: syzbot <syzbot+4684a000d5abdade83fac55b1e7d1f935ef1936e@syzkaller.appspotmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 85b0a54a	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Move loop_reread_partitions() out of loop_ctl_mutex Calling loop_reread_partitions() under loop_ctl_mutex causes lockdep to complain about circular lock dependency between bdev->bd_mutex and lo->lo_ctl_mutex. The problem is that on loop device open or close lo_open() and lo_release() get called with bdev->bd_mutex held and they need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is called with loop_ctl_mutex held, it will call blkdev_reread_part() which acquires bdev->bd_mutex. See syzbot report for details [1]. Move all calls of loop_rescan_partitions() out of loop_ctl_mutex to avoid lockdep warning and fix deadlock possibility. [1] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d1588 Reported-by: syzbot <syzbot+4684a000d5abdade83fac55b1e7d1f935ef1936e@syzkaller.appspotmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d57f3374	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Move special partition reread handling in loop_clr_fd() The call of __blkdev_reread_part() from loop_reread_partition() happens only when we need to invalidate partitions from loop_release(). Thus move a detection for this into loop_clr_fd() and simplify loop_reread_partition(). This makes loop_reread_partition() safe to use without loop_ctl_mutex because we use only lo->lo_number and lo->lo_file_name in case of error for reporting purposes (thus possibly reporting outdate information is not a big deal) and we are safe from 'lo' going away under us by elevated lo->lo_refcnt. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# c3710770	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Push loop_ctl_mutex down to loop_change_fd() Push loop_ctl_mutex down to loop_change_fd(). We will need this to be able to call loop_reread_partitions() without loop_ctl_mutex. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 757ecf40	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Push loop_ctl_mutex down to loop_set_fd() Push lo_ctl_mutex down to loop_set_fd(). We will need this to be able to call loop_reread_partitions() without lo_ctl_mutex. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 550df5fd	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Push loop_ctl_mutex down to loop_set_status() Push loop_ctl_mutex down to loop_set_status(). We will need this to be able to call loop_reread_partitions() without loop_ctl_mutex. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4a5ce9ba	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Push loop_ctl_mutex down to loop_get_status() Push loop_ctl_mutex down to loop_get_status() to avoid the unusual convention that the function gets called with loop_ctl_mutex held and releases it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7ccd0791	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Push loop_ctl_mutex down into loop_clr_fd() loop_clr_fd() has a weird locking convention that is expects loop_ctl_mutex held, releases it on success and keeps it on failure. Untangle the mess by moving locking of loop_ctl_mutex into loop_clr_fd(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a2505b79	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Split setting of lo_state from loop_clr_fd Move setting of lo_state to Lo_rundown out into the callers. That will allow us to unlock loop_ctl_mutex while the loop device is protected from other changes by its special state. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a1316544	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Push lo_ctl_mutex down into individual ioctls Push acquisition of lo_ctl_mutex down into individual ioctl handling branches. This is a preparatory step for pushing the lock down into individual ioctl handling functions so that they can release the lock as they need it. We also factor out some simple ioctl handlers that will not need any special handling to reduce unnecessary code duplication. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0a42e99b	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Get rid of loop_index_mutex Now that loop_ctl_mutex is global, just get rid of loop_index_mutex as there is no good reason to keep these two separate and it just complicates the locking. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 967d1dc1	08-Nov-2018	Jan Kara <jack@suse.cz>	loop: Fold __loop_release into loop_release __loop_release() has a single call site. Fold it there. This is currently not a huge win but it will make following replacement of loop_index_mutex more obvious. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 310ca162	08-Nov-2018	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	block/loop: Use global lock for ioctl() operation. syzbot is reporting NULL pointer dereference [1] which is caused by race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other loop devices at loop_validate_file() without holding corresponding lo->lo_ctl_mutex locks. Since ioctl() request on loop devices is not frequent operation, we don't need fine grained locking. Let's use global lock in order to allow safe traversal at loop_validate_file(). Note that syzbot is also reporting circular locking dependency between bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling blkdev_reread_part() with lock held. This patch does not address it. [1] https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3 [2] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+bf89c128e05dd6c62523@syzkaller.appspotmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b1ab5fa3	08-Nov-2018	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	block/loop: Don't grab "struct file" for vfs_getattr() operation. vfs_getattr() needs "struct path" rather than "struct file". Let's use path_get()/path_put() rather than get_file()/fput(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b5f2954d	01-Nov-2018	Dennis Zhou <dennis@kernel.org>	blkcg: revert blkcg cleanups series This reverts a series committed earlier due to null pointer exception bug report in [1]. It seems there are edge case interactions that I did not consider and will need some time to understand what causes the adverse interactions. The original series can be found in [2] with a follow up series in [3]. [1] https://www.spinics.net/lists/cgroups/msg20719.html [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/ This reverts the following commits: d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae, f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef, a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53 Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# aa563d7b	19-Oct-2018	David Howells <dhowells@redhat.com>	iov_iter: Separate type from direction and use accessor functions In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: David Howells <dhowells@redhat.com>
# c839e7a0	11-Sep-2018	Dennis Zhou (Facebook) <dennisszhou@gmail.com>	blkcg: remove bio->bi_css and instead use bio->bi_blkg Prior patches ensured that all bios are now associated with some blkg. This now makes bio->bi_css unnecessary as blkg maintains a reference to the blkcg already. This patch removes the field bi_css and transfers corresponding uses to access via bi_blkg. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d893ff86	01-Jul-2018	Gustavo A. R. Silva <gustavo@embeddedor.com>	block/loop: mark expected switch fall-through In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f4354a94	02-Jul-2018	Colin Ian King <colin.king@canonical.com>	loop: remove redundant pointer inode Pointer inode is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'inode' set but not used [-Wunused-but-set-variable] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 9fea4b39	02-Jul-2018	Evan Green <evgreen@chromium.org>	loop: Add LOOP_SET_BLOCK_SIZE in compat ioctl This change adds LOOP_SET_BLOCK_SIZE as one of the supported ioctls in lo_compat_ioctl. It only takes an unsigned long argument, and in practice a 32-bit value works fine. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Evan Green <evgreen@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6da2ec56	12-Jun-2018	Kees Cook <keescook@chromium.org>	treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| kmalloc( - sizeof(u8) * COUNT + COUNT , ...) \| kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) \| kmalloc( - sizeof(char) * COUNT + COUNT , ...) \| kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) \| kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) \| kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) \| kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) \| kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) \| kmalloc(sizeof(TYPE) * C2, ...) \| kmalloc(C1 * C2 * C3, ...) \| kmalloc(C1 * C2, ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) \| - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) \| - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) \| - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
# d2ac838e	07-May-2018	Theodore Ts'o <tytso@mit.edu>	loop: add recursion validation to LOOP_CHANGE_FD Refactor the validation code used in LOOP_SET_FD so it is also used in LOOP_CHANGE_FD. Otherwise it is possible to construct a set of loop devices that all refer to each other. This can lead to a infinite loop in starting with "while (is_loop_device(f)) .." in loop_set_fd(). Fix this by refactoring out the validation code and using it for LOOP_CHANGE_FD as well as LOOP_SET_FD. Reported-by: syzbot+4349872271ece473a7c91190b68b4bac7c5dbc87@syzkaller.appspotmail.com Reported-by: syzbot+40bd32c4d9a3cc12a339@syzkaller.appspotmail.com Reported-by: syzbot+769c54e66f994b041be7@syzkaller.appspotmail.com Reported-by: syzbot+0a89a9ce473936c57065@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d9a08a9e	22-May-2018	Adam Manzanares <adam.manzanares@wdc.com>	fs: Add aio iopriority support This is the per-I/O equivalent of the ioprio_set system call. When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. This patch depends on block: add ioprio_check_cap function. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 5657a819	24-May-2018	Joe Perches <joe@perches.com>	block drivers/block: Use octal not symbolic permissions Convert the S_<FOO> symbolic permissions to their octal equivalents as using octal and not symbolic permissions is preferred by many as more readable. see: https://lkml.org/lkml/2016/8/2/1945 Done with automated conversion via: $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...> Miscellanea: o Wrapped modified multi-line calls to a single line where appropriate o Realign modified multi-line calls to open parenthesis Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# eedffa28	21-May-2018	Jeff Layton <jlayton@kernel.org>	loop: clear wb_err in bd_inode when detaching backing file When a loop block device encounters a writeback error, that error will get propagated to the bd_inode's wb_err field. If we then detach the backing file from it, attach another and fsync it, we'll get back the writeback error that we had from the previous backing file. This is a bit of a grey area as POSIX doesn't cover loop devices, but it is somewhat counterintuitive. If we detach a backing file from the loopdev while there are still unreported errors, take it as a sign that we're no longer interested in the previous file, and clear out the wb_err in the loop blockdev. Reported-and-Tested-by: Theodore Y. Ts'o <tytso@mit.edu> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d3349b6b	04-May-2018	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	loop: remember whether sysfs_create_group() was done syzbot is hitting WARN() triggered by memory allocation fault injection [1] because loop module is calling sysfs_remove_group() when sysfs_create_group() failed. Fix this by remembering whether sysfs_create_group() succeeded. [1] https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+9f03168400f56df89dbc6f1751f4458fe739ff29@syzkaller.appspotmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Renamed sysfs_ready -> sysfs_inited. Signed-off-by: Jens Axboe <axboe@kernel.dk>
# f9de14bc	13-Apr-2018	Jens Axboe <axboe@kernel.dk>	loop: handle short DIO reads We ran into an issue with loop and btrfs, where btrfs would complain about checksum errors. It turns out that is because we don't handle short reads at all, we just zero fill the remainder. Worse than that, we don't handle the filling properly, which results in loop trying to advance a single bio by much more than its size, since it doesn't take chaining into account. Handle short reads appropriately, by simply retrying at the new correct offset. End the remainder of the request with EIO, if we get a 0 read. Fixes: bc07c10a3603 ("block: loop: support DIO & AIO") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1894e916	13-Apr-2018	Jens Axboe <axboe@kernel.dk>	loop: remove cmd->rq member We can always get at the request from the payload, no need to store a pointer to it. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# bdac616db	06-Apr-2018	Omar Sandoval <osandov@fb.com>	loop: fix LOOP_GET_STATUS lock imbalance Commit 2d1d4c1e591f made loop_get_status() drop lo_ctx_mutex before returning, but the loop_get_status_old(), loop_get_status64(), and loop_get_status_compat() wrappers don't call loop_get_status() if the passed argument is NULL. The callers expect that the lock is dropped, so make sure we drop it in that case, too. Reported-by: syzbot+31e8daa8b3fc129e75f2@syzkaller.appspotmail.com Fixes: 2d1d4c1e591f ("loop: don't call into filesystem while holding lo_ctl_mutex") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1e047eaa	05-Apr-2018	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	block/loop: fix deadlock after loop_set_status syzbot is reporting deadlocks at __blkdev_get() [1]. ---------------------------------------- [ 92.493919] systemd-udevd D12696 525 1 0x00000000 [ 92.495891] Call Trace: [ 92.501560] schedule+0x23/0x80 [ 92.502923] schedule_preempt_disabled+0x5/0x10 [ 92.504645] __mutex_lock+0x416/0x9e0 [ 92.510760] __blkdev_get+0x73/0x4f0 [ 92.512220] blkdev_get+0x12e/0x390 [ 92.518151] do_dentry_open+0x1c3/0x2f0 [ 92.519815] path_openat+0x5d9/0xdc0 [ 92.521437] do_filp_open+0x7d/0xf0 [ 92.527365] do_sys_open+0x1b8/0x250 [ 92.528831] do_syscall_64+0x6e/0x270 [ 92.530341] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 92.931922] 1 lock held by systemd-udevd/525: [ 92.933642] #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x73/0x4f0 ---------------------------------------- The reason of deadlock turned out that wait_event_interruptible() in blk_queue_enter() got stuck with bdev->bd_mutex held at __blkdev_put() due to q->mq_freeze_depth == 1. ---------------------------------------- [ 92.787172] a.out S12584 634 633 0x80000002 [ 92.789120] Call Trace: [ 92.796693] schedule+0x23/0x80 [ 92.797994] blk_queue_enter+0x3cb/0x540 [ 92.803272] generic_make_request+0xf0/0x3d0 [ 92.807970] submit_bio+0x67/0x130 [ 92.810928] submit_bh_wbc+0x15e/0x190 [ 92.812461] __block_write_full_page+0x218/0x460 [ 92.815792] __writepage+0x11/0x50 [ 92.817209] write_cache_pages+0x1ae/0x3d0 [ 92.825585] generic_writepages+0x5a/0x90 [ 92.831865] do_writepages+0x43/0xd0 [ 92.836972] __filemap_fdatawrite_range+0xc1/0x100 [ 92.838788] filemap_write_and_wait+0x24/0x70 [ 92.840491] __blkdev_put+0x69/0x1e0 [ 92.841949] blkdev_close+0x16/0x20 [ 92.843418] __fput+0xda/0x1f0 [ 92.844740] task_work_run+0x87/0xb0 [ 92.846215] do_exit+0x2f5/0xba0 [ 92.850528] do_group_exit+0x34/0xb0 [ 92.852018] SyS_exit_group+0xb/0x10 [ 92.853449] do_syscall_64+0x6e/0x270 [ 92.854944] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 92.943530] 1 lock held by a.out/634: [ 92.945105] #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x3c/0x1e0 ---------------------------------------- The reason of q->mq_freeze_depth == 1 turned out that loop_set_status() forgot to call blk_mq_unfreeze_queue() at error paths for info->lo_encrypt_type != NULL case. ---------------------------------------- [ 37.509497] CPU: 2 PID: 634 Comm: a.out Tainted: G W 4.16.0+ #457 [ 37.513608] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 37.518832] RIP: 0010:blk_freeze_queue_start+0x17/0x40 [ 37.521778] RSP: 0018:ffffb0c2013e7c60 EFLAGS: 00010246 [ 37.524078] RAX: 0000000000000000 RBX: ffff8b07b1519798 RCX: 0000000000000000 [ 37.527015] RDX: 0000000000000002 RSI: ffffb0c2013e7cc0 RDI: ffff8b07b1519798 [ 37.529934] RBP: ffffb0c2013e7cc0 R08: 0000000000000008 R09: 47a189966239b898 [ 37.532684] R10: dad78b99b278552f R11: 9332dca72259d5ef R12: ffff8b07acd73678 [ 37.535452] R13: 0000000000004c04 R14: 0000000000000000 R15: ffff8b07b841e940 [ 37.538186] FS: 00007fede33b9740(0000) GS:ffff8b07b8e80000(0000) knlGS:0000000000000000 [ 37.541168] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 37.543590] CR2: 00000000206fdf18 CR3: 0000000130b30006 CR4: 00000000000606e0 [ 37.546410] Call Trace: [ 37.547902] blk_freeze_queue+0x9/0x30 [ 37.549968] loop_set_status+0x67/0x3c0 [loop] [ 37.549975] loop_set_status64+0x3b/0x70 [loop] [ 37.549986] lo_ioctl+0x223/0x810 [loop] [ 37.549995] blkdev_ioctl+0x572/0x980 [ 37.550003] block_ioctl+0x34/0x40 [ 37.550006] do_vfs_ioctl+0xa7/0x6d0 [ 37.550017] ksys_ioctl+0x6b/0x80 [ 37.573076] SyS_ioctl+0x5/0x10 [ 37.574831] do_syscall_64+0x6e/0x270 [ 37.576769] entry_SYSCALL_64_after_hwframe+0x42/0xb7 ---------------------------------------- [1] https://syzkaller.appspot.com/bug?id=cd662bc3f6022c0979d01a262c318fab2ee9b56f Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <bot+48594378e9851eab70bcd6f99327c7db58c5a28a@syzkaller.appspotmail.com> Fixes: ecdd09597a572513 ("block/loop: fix race between I/O and set_status") Cc: Ming Lei <tom.leiming@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: stable <stable@vger.kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3148ffbd	26-Mar-2018	Omar Sandoval <osandov@fb.com>	loop: use killable lock in ioctls Even after the previous patch to drop lo_ctl_mutex while calling vfs_getattr(), there are other cases where we can end up sleeping for a long time while holding lo_ctl_mutex. Let's avoid the uninterruptible sleep from the ioctls. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 2d1d4c1e	26-Mar-2018	Omar Sandoval <osandov@fb.com>	loop: don't call into filesystem while holding lo_ctl_mutex We hit an issue where a loop device on NFS was stuck in loop_get_status() doing vfs_getattr() after the NFS server died, which caused a pile-up of uninterruptible processes waiting on lo_ctl_mutex. There's no reason to hold this lock while we wait on the filesystem; let's drop it so that other processes can do their thing. We need to grab a reference on lo_backing_file while we use it, and we can get rid of the check on lo_device, which has been unnecessary since commit a34c0ae9ebd6 ("[PATCH] loop: remove the bio remapping capability") in the linux-history tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1d037577	09-Mar-2018	Ross Zwisler <zwisler@kernel.org>	loop: Fix lost writes caused by missing flag The following commit: commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") replaced __do_lo_send_write(), which used ITER_KVEC iterators, with lo_write_bvec() which uses ITER_BVEC iterators. In this change, though, the WRITE flag was lost: - iov_iter_kvec(&from, ITER_KVEC \| WRITE, &kvec, 1, len); + iov_iter_bvec(&i, ITER_BVEC, bvec, 1, bvec->bv_len); This flag is necessary for the DAX case because we make decisions based on whether or not the iterator is a READ or a WRITE in dax_iomap_actor() and in dax_iomap_rw(). We end up going through this path in configurations where we combine a PMEM device with 4k sectors, a loopback device and DAX. The consequence of this missed flag is that what we intend as a write actually turns into a read in the DAX code, so no data is ever written. The very simplest test case is to create a loopback device and try and write a small string to it, then hexdump a few bytes of the device to see if the write took. Without this patch you read back all zeros, with this you read back the string you wrote. For XFS this causes us to fail or panic during the following xfstests: xfs/074 xfs/078 xfs/216 xfs/217 xfs/250 For ext4 we have a similar issue where writes never happen, but we don't currently have any xfstests that use loopback and show this issue. Fix this by restoring the WRITE flag argument to iov_iter_bvec(). This causes the xfstests to all pass. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Fixes: commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8b904b5b	07-Mar-2018	Bart Van Assche <bvanassche@acm.org>	block: Use blk_queue_flag_() in drivers instead of queue_flag_() This patch has been generated as follows: for verb in set_unlocked clear_unlocked set clear; do replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \ $(git grep -lw queue_flag_${verb} drivers block/bsg*) done Except for protecting all queue flag changes with the queue lock this patch does not change any functionality. Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0fa8ebdd	28-Feb-2018	Bart Van Assche <bvanassche@acm.org>	block/loop: Delete gendisk before cleaning up the request queue Remove the disk, partition and bdi sysfs attributes before cleaning up the request queue associated with the disk. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Shaohua Li <shli@fb.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3079c22e	26-Feb-2018	Jan Kara <jack@suse.cz>	genhd: Rename get_disk() to get_disk_and_module() Rename get_disk() to get_disk_and_module() to make sure what the function does. It's not a great name but at least it is now clear that put_disk() is not it's counterpart. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# ae665016	05-Jan-2018	Linus Torvalds <torvalds@linux-foundation.org>	loop: fix concurrent lo_open/lo_release 范龙飞 reports that KASAN can report a use-after-free in __lock_acquire. The reason is due to insufficient serialization in lo_release(), which will continue to use the loop device even after it has decremented the lo_refcnt to zero. In the meantime, another process can come in, open the loop device again as it is being shut down. Confusion ensues. Reported-by: 范龙飞 <long7573@126.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 0b508bc9	26-Sep-2017	Shaohua Li <shli@fb.com>	block: fix a build error The code is only for blkcg not for all cgroups Fixes: d4478e92d618 ("block/loop: make loop cgroup aware") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# d4478e92	25-Sep-2017	Shaohua Li <shli@fb.com>	block/loop: make loop cgroup aware loop block device handles IO in a separate thread. The actual IO dispatched isn't cloned from the IO loop device received, so the dispatched IO loses the cgroup context. I'm ignoring buffer IO case now, which is quite complicated. Making the loop thread aware cgroup context doesn't really help. The loop device only writes to a single file. In current writeback cgroup implementation, the file can only belong to one cgroup. For direct IO case, we could workaround the issue in theory. For example, say we assign cgroup1 5M/s BW for loop device and cgroup2 10M/s. We can create a special cgroup for loop thread and assign at least 15M/s for the underlayer disk. In this way, we correctly throttle the two cgroups. But this is tricky to setup. This patch tries to address the issue. We record bio's css in loop command. When loop thread is handling the command, we then use the API provided in patch 1 to set the css for current task. The bio layer will use the css for new IO (from patch 3). Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# bf093753	05-Sep-2017	Omar Sandoval <osandov@fb.com>	loop: set physical block size to logical block size Commit 6c6b6f28b333 ("loop: set physical block size to PAGE_SIZE") caused mkfs.xfs to barf on ppc64 [1]. Always using PAGE_SIZE as the physical block size still makes the most sense semantically, but let's just lie and always set it to the same value as the logical block size (same goes for io_min). In the future we might want to at least bump up io_min to PAGE_SIZE but I'm sick of these stupid changes so let's play it safe. 1: https://marc.info/?l=linux-xfs&m=150459024723753&w=2 Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 92d77332	01-Sep-2017	Shaohua Li <shli@fb.com>	block/loop: fix use after free lo_rw_aio->call_read_iter-> 1 aops->direct_IO 2 iov_iter_revert lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could be freed before 2, which accesses bvec. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 40326d8a	31-Aug-2017	Shaohua Li <shli@fb.com>	block/loop: allow request merge for directio mode Currently loop disables merge. While it makes sense for buffer IO mode, directio mode can benefit from request merge. Without merge, loop could send small size IO to underlayer disk and harm performance. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 54bb0ade	31-Aug-2017	Shaohua Li <shli@fb.com>	block/loop: set hw_sectors Loop can handle any size of request. Limiting it to 255 sectors just burns the CPU for bio split and request merge for underlayer disk and also cause bad fs block allocation in directio mode. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 43cade80	24-Aug-2017	Omar Sandoval <osandov@fb.com>	loop: fold loop_switch() into callers The comments here are really outdated, and blk-mq made flushing much simpler, so just fold the two cases into the callers. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 89e4fdec	24-Aug-2017	Omar Sandoval <osandov@fb.com>	loop: add ioctl for changing logical block size This is a different approach from the first attempt in f2c6df7dbf9a ("loop: support 4k physical blocksize"). Rather than extending LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block size. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 6c6b6f28	24-Aug-2017	Omar Sandoval <osandov@fb.com>	loop: set physical block size to PAGE_SIZE The physical block size is "the lowest possible sector size that the hardware can operate on without reverting to read-modify-write operations" (from the comment on blk_queue_physical_block_size()). Since loop does buffered I/O on the backing file by default, the RMW unit is a page. This isn't the case for direct I/O mode, but let's keep it simple. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8a0740c4	24-Aug-2017	Omar Sandoval <osandov@fb.com>	loop: get rid of lo_blocksize This is only used for setting the soft block size on the struct block_device once and then never used again. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1e6ec9ea	23-Aug-2017	Omar Sandoval <osandov@fb.com>	Revert "loop: support 4k physical blocksize" There's some stuff still up in the air, let's not get stuck with a subpar ABI. I'll follow up with something better for 4.14. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a8c1d064	07-Aug-2017	Anton Volkov <avolkov@ispras.ru>	loop: fix to a race condition due to the early registration of device The early device registration made possible a race leading to allocations of disks with wrong minors. This patch moves the device registration further down the loop_init function to make the race infeasible. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Anton Volkov <avolkov@ispras.ru> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# abbb6589	27-May-2017	Christoph Hellwig <hch@lst.de>	fs: implement vfs_iter_write using do_iter_write De-dupliate some code and allow for passing the flags argument to vfs_iter_write. Additionally it now properly updates timestamps. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 18e9710e	27-May-2017	Christoph Hellwig <hch@lst.de>	fs: implement vfs_iter_read using do_iter_read De-dupliate some code and allow for passing the flags argument to vfs_iter_read. Additional it properly updates atime now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# b2ee7d46	15-Jun-2017	NeilBrown <neilb@suse.com>	loop: Add PF_LESS_THROTTLE to block/loop device thread. When a filesystem is mounted from a loop device, writes are throttled by balance_dirty_pages() twice: once when writing to the filesystem and once when the loop_handle_cmd() writes to the backing file. This double-throttling can trigger positive feedback loops that create significant delays. The throttling at the lower level is seen by the upper level as a slow device, so it throttles extra hard. The PF_LESS_THROTTLE flag was created to handle exactly this circumstance, though with an NFS filesystem mounted from a local NFS server. It reduces the throttling on the lower layer so that it can proceed largely unthrottled. To demonstrate this, create a filesystem on a loop device and write (e.g. with dd) several large files which combine to consume significantly more than the limit set by /proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total time taken. When I do this directly on a device (no loop device) the total time for several runs (mkfs, mount, write 200 files, umount) is fairly stable: 28-35 seconds. When I do this over a loop device the times are much worse and less stable. 52-460 seconds. Half below 100seconds, half above. When I apply this patch, the times become stable again, though not as fast as the no-loop-back case: 53-72 seconds. There may be room for further improvement as the total overhead still seems too high, but this is a big improvement. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# fc17b653	03-Jun-2017	Christoph Hellwig <hch@lst.de>	blk-mq: switch ->queue_rq return value to blk_status_t Use the same values for use for request completion errors as the return value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause a requeue, and all the others are completed as-is. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 2a842aca	03-Jun-2017	Christoph Hellwig <hch@lst.de>	block: introduce new block status code type Currently we use nornal Linux errno values in the block layer, and while we accept any error a few have overloaded magic meanings. This patch instead introduces a new blk_status_t value that holds block layer specific status codes and explicitly explains their meaning. Helpers to convert from and to the previous special meanings are provided for now, but I suspect we want to get rid of them in the long run - those drivers that have a errno input (e.g. networking) usually get errnos that don't know about the special block layer overloads, and similarly returning them to userspace will usually return somethings that strictly speaking isn't correct for file system operations, but that's left as an exercise for later. For now the set of errors is a very limited set that closely corresponds to the previous overloaded errno values, but there is some low hanging fruite to improve it. blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse typechecking, so that we can easily catch places passing the wrong values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# b040ad9c	08-Jun-2017	Arnd Bergmann <arnd@arndb.de>	loop: fix error handling regression gcc points out an unusual indentation: drivers/block/loop.c: In function 'loop_set_status': drivers/block/loop.c:1149:3: error: this 'if' clause does not guard... [-Werror=misleading-indentation] if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit, ^~ drivers/block/loop.c:1152:4: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if' goto exit; This was introduced by a new feature that accidentally moved the opening braces from one condition to another. Adding a second pair of braces makes it work correctly again and also more readable. Fixes: f2c6df7dbf9a ("loop: support 4k physical blocksize") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# f2c6df7d	08-Jun-2017	Hannes Reinecke <hare@suse.de>	loop: support 4k physical blocksize When generating bootable VM images certain systems (most notably s390x) require devices with 4k blocksize. This patch implements a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical blocksize to that of the underlying device, and allow to change the logical blocksize for up to the physical blocksize. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 51001b7d	08-Jun-2017	Hannes Reinecke <hare@suse.de>	loop: Remove unused 'bdev' argument from loop_set_capacity Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 64604957	08-Jun-2017	James Wang <jnwang@suse.com>	Fix loop device flush before configure v3 While installing SLES-12 (based on v4.4), I found that the installer will stall for 60+ seconds during LVM disk scan. The root cause was determined to be the removal of a bound device check in loop_flush() by commit b5dd2f6047ca ("block: loop: improve performance via blk-mq"). Restoring this check, examining ->lo_state as set by loop_set_fd() eliminates the bad behavior. Test method: modprobe loop max_loop=64 dd if=/dev/zero of=disk bs=512 count=200K for((i=0;i<4;i++))do losetup -f disk; done mkfs.ext4 -F /dev/loop0 for((i=0;i<4;i++))do mkdir t$i; mount /dev/loop$i t$i;done for f in `ls /dev/loop[0-9]*\|sort`; do \ echo $f; dd if=$f of=/dev/null bs=512 count=1; \ done Test output: stock patched /dev/loop0 18.1217e-05 8.3842e-05 /dev/loop1 6.1114e-05 0.000147979 /dev/loop10 0.414701 0.000116564 /dev/loop11 0.7474 6.7942e-05 /dev/loop12 0.747986 8.9082e-05 /dev/loop13 0.746532 7.4799e-05 /dev/loop14 0.480041 9.3926e-05 /dev/loop15 1.26453 7.2522e-05 Note that from loop10 onward, the device is not mounted, yet the stock kernel consumes several orders of magnitude more wall time than it does for a mounted device. (Thanks for Mike Galbraith <efault@gmx.de>, give a changelog review.) Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: James Wang <jnwang@suse.com> Fixes: b5dd2f6047ca ("block: loop: improve performance via blk-mq") Signed-off-by: Jens Axboe <axboe@fb.com>
# d6296d39	01-May-2017	Christoph Hellwig <hch@lst.de>	blk-mq: update ->init_request and ->exit_request prototypes Remove the request_idx parameter, which can't be used safely now that we support I/O schedulers with blk-mq. Except for a superflous check in mtip32xx it was unused anyway. Also pass the tag_set instead of just the driver data - this allows drivers to avoid some code duplication in a follow on cleanup. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 08e0029a	20-Apr-2017	Christoph Hellwig <hch@lst.de>	blk-mq: remove the error argument to blk_mq_complete_request Now that all drivers that call blk_mq_complete_requests have a ->complete callback we can remove the direct call to blk_mq_end_request, as well as the error argument to blk_mq_complete_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# fe2cb290	20-Apr-2017	Christoph Hellwig <hch@lst.de>	loop: zero-fill bio on the submitting cpu In thruth I've just audited which blk-mq drivers don't currently have a complete callback, but I think this change is at least borderline useful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 48920ff2	05-Apr-2017	Christoph Hellwig <hch@lst.de>	block: remove the discard_zeroes_data flag Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 19372e27	05-Apr-2017	Christoph Hellwig <hch@lst.de>	loop: implement REQ_OP_WRITE_ZEROES It's identical to discard as hole punches will always leave us with zeroes on reads. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# f363b089	30-Mar-2017	Eric Biggers <ebiggers@google.com>	blk-mq: constify struct blk_mq_ops Constify all instances of blk_mq_ops, as they are never modified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# a528d35e	31-Jan-2017	David Howells <dhowells@redhat.com>	statx: Add a system call to make enhanced file info available Add a system call to make extended file information available, including file creation and some attribute flags where available through the underlying filesystem. The getattr inode operation is altered to take two additional arguments: a u32 request_mask and an unsigned int flags that indicate the synchronisation mode. This change is propagated to the vfs_getattr() function. Functions like vfs_stat() are now inline wrappers around new functions vfs_statx() and vfs_statx_fd() to reduce stack usage. ======== OVERVIEW ======== The idea was initially proposed as a set of xattrs that could be retrieved with getxattr(), but the general preference proved to be for a new syscall with an extended stat structure. A number of requests were gathered for features to be included. The following have been included: (1) Make the fields a consistent size on all arches and make them large. (2) Spare space, request flags and information flags are provided for future expansion. (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an __s64). (4) Creation time: The SMB protocol carries the creation time, which could be exported by Samba, which will in turn help CIFS make use of FS-Cache as that can be used for coherency data (stx_btime). This is also specified in NFSv4 as a recommended attribute and could be exported by NFSD [Steve French]. (5) Lightweight stat: Ask for just those details of interest, and allow a netfs (such as NFS) to approximate anything not of interest, possibly without going to the server [Trond Myklebust, Ulrich Drepper, Andreas Dilger] (AT_STATX_DONT_SYNC). (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks its cached attributes are up to date [Trond Myklebust] (AT_STATX_FORCE_SYNC). And the following have been left out for future extension: (7) Data version number: Could be used by userspace NFS servers [Aneesh Kumar]. Can also be used to modify fill_post_wcc() in NFSD which retrieves i_version directly, but has just called vfs_getattr(). It could get it from the kstat struct if it used vfs_xgetattr() instead. (There's disagreement on the exact semantics of a single field, since not all filesystems do this the same way). (8) BSD stat compatibility: Including more fields from the BSD stat such as creation time (st_btime) and inode generation number (st_gen) [Jeremy Allison, Bernd Schubert]. (9) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd Schubert]. (This was asked for but later deemed unnecessary with the open-by-handle capability available and caused disagreement as to whether it's a security hole or not). (10) Extra coherency data may be useful in making backups [Andreas Dilger]. (No particular data were offered, but things like last backup timestamp, the data version number and the DOS archive bit would come into this category). (11) Allow the filesystem to indicate what it can/cannot provide: A filesystem can now say it doesn't support a standard stat feature if that isn't available, so if, for instance, inode numbers or UIDs don't exist or are fabricated locally... (This requires a separate system call - I have an fsinfo() call idea for this). (12) Store a 16-byte volume ID in the superblock that can be returned in struct xstat [Steve French]. (Deferred to fsinfo). (13) Include granularity fields in the time data to indicate the granularity of each of the times (NFSv4 time_delta) [Steve French]. (Deferred to fsinfo). (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags. Note that the Linux IOC flags are a mess and filesystems such as Ext4 define flags that aren't in linux/fs.h, so translation in the kernel may be a necessity (or, possibly, we provide the filesystem type too). (Some attributes are made available in stx_attributes, but the general feeling was that the IOC flags were to ext[234]-specific and shouldn't be exposed through statx this way). (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer, Michael Kerrisk]. (Deferred, probably to fsinfo. Finding out if there's an ACL or seclabal might require extra filesystem operations). (16) Femtosecond-resolution timestamps [Dave Chinner]. (A __reserved field has been left in the statx_timestamp struct for this - if there proves to be a need). (17) A set multiple attributes syscall to go with this. =============== NEW SYSTEM CALL =============== The new system call is: int ret = statx(int dfd, const char filename, unsigned int flags, unsigned int mask, struct statx buffer); The dfd, filename and flags parameters indicate the file to query, in a similar way to fstatat(). There is no equivalent of lstat() as that can be emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is also no equivalent of fstat() as that can be emulated by passing a NULL filename to statx() with the fd of interest in dfd. Whether or not statx() synchronises the attributes with the backing store can be controlled by OR'ing a value into the flags argument (this typically only affects network filesystems): (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this respect. (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise its attributes with the server - which might require data writeback to occur to get the timestamps correct. (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a network filesystem. The resulting values should be considered approximate. mask is a bitmask indicating the fields in struct statx that are of interest to the caller. The user should set this to STATX_BASIC_STATS to get the basic set returned by stat(). It should be noted that asking for more information may entail extra I/O operations. buffer points to the destination for the data. This must be 256 bytes in size. ====================== MAIN ATTRIBUTES RECORD ====================== The following structures are defined in which to return the main attribute set: struct statx_timestamp { __s64 tv_sec; __s32 tv_nsec; __s32 __reserved; }; struct statx { __u32 stx_mask; __u32 stx_blksize; __u64 stx_attributes; __u32 stx_nlink; __u32 stx_uid; __u32 stx_gid; __u16 stx_mode; __u16 __spare0[1]; __u64 stx_ino; __u64 stx_size; __u64 stx_blocks; __u64 __spare1[1]; struct statx_timestamp stx_atime; struct statx_timestamp stx_btime; struct statx_timestamp stx_ctime; struct statx_timestamp stx_mtime; __u32 stx_rdev_major; __u32 stx_rdev_minor; __u32 stx_dev_major; __u32 stx_dev_minor; __u64 __spare2[14]; }; The defined bits in request_mask and stx_mask are: STATX_TYPE Want/got stx_mode & S_IFMT STATX_MODE Want/got stx_mode & ~S_IFMT STATX_NLINK Want/got stx_nlink STATX_UID Want/got stx_uid STATX_GID Want/got stx_gid STATX_ATIME Want/got stx_atime{,_ns} STATX_MTIME Want/got stx_mtime{,_ns} STATX_CTIME Want/got stx_ctime{,_ns} STATX_INO Want/got stx_ino STATX_SIZE Want/got stx_size STATX_BLOCKS Want/got stx_blocks STATX_BASIC_STATS [The stuff in the normal stat struct] STATX_BTIME Want/got stx_btime{,_ns} STATX_ALL [All currently available stuff] stx_btime is the file creation time, stx_mask is a bitmask indicating the data provided and __spares[] are where as-yet undefined fields can be placed. Time fields are structures with separate seconds and nanoseconds fields plus a reserved field in case we want to add even finer resolution. Note that times will be negative if before 1970; in such a case, the nanosecond fields will also be negative if not zero. The bits defined in the stx_attributes field convey information about a file, how it is accessed, where it is and what it does. The following attributes map to FS__FL flags and are the same numerical value: STATX_ATTR_COMPRESSED File is compressed by the fs STATX_ATTR_IMMUTABLE File is marked immutable STATX_ATTR_APPEND File is append-only STATX_ATTR_NODUMP File is not to be dumped STATX_ATTR_ENCRYPTED File requires key to decrypt in fs Within the kernel, the supported flags are listed by: KSTAT_ATTR_FS_IOC_FLAGS [Are any other IOC flags of sufficient general interest to be exposed through this interface?] New flags include: STATX_ATTR_AUTOMOUNT Object is an automount trigger These are for the use of GUI tools that might want to mark files specially, depending on what they are. Fields in struct statx come in a number of classes: (0) stx_dev_, stx_blksize. These are local system information and are always available. (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino, stx_size, stx_blocks. These will be returned whether the caller asks for them or not. The corresponding bits in stx_mask will be set to indicate whether they actually have valid values. If the caller didn't ask for them, then they may be approximated. For example, NFS won't waste any time updating them from the server, unless as a byproduct of updating something requested. If the values don't actually exist for the underlying object (such as UID or GID on a DOS file), then the bit won't be set in the stx_mask, even if the caller asked for the value. In such a case, the returned value will be a fabrication. Note that there are instances where the type might not be valid, for instance Windows reparse points. (2) stx_rdev_*. This will be set only if stx_mode indicates we're looking at a blockdev or a chardev, otherwise will be 0. (3) stx_btime. Similar to (1), except this will be set to 0 if it doesn't exist. ======= TESTING ======= The following test program can be used to test the statx system call: samples/statx/test-statx.c Just compile and run, passing it paths to the files you want to examine. The file is built automatically if CONFIG_SAMPLES is enabled. Here's some example output. Firstly, an NFS directory that crosses to another FSID. Note that the AUTOMOUNT attribute is set because transiting this directory will cause d_automount to be invoked by the VFS. [root@andromeda ~]# /tmp/test-statx -A /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:26 Inode: 1703937 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------) Secondly, the result of automounting on that directory. [root@andromeda ~]# /tmp/test-statx /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:27 Inode: 2 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# e02898b4	01-Mar-2017	Omar Sandoval <osandov@fb.com>	loop: fix LO_FLAGS_PARTSCAN hang loop_reread_partitions() needs to do I/O, but we just froze the queue, so we end up waiting forever. This can easily be reproduced with losetup -P. Fix it by moving the reread to after we unfreeze the queue. Fixes: ecdd09597a57 ("block/loop: fix race between I/O and set_status") Reported-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 89d790ab	27-Feb-2017	Masahiro Yamada <yamada.masahiro@socionext.com>	scripts/spelling.txt: add "algined" pattern and fix typo instances Fix typos and add the following to the scripts/spelling.txt: algined\|\|aligned While we are here, fix the "appplication" in the touched line in drivers/block/loop.c. Also, fix the "may not naturally ..." to "may not be naturally ..." in the touched line in mm/page_alloc. Link: http://lkml.kernel.org/r/1481573103-11329-9-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# bb7462b6	20-Feb-2017	Miklos Szeredi <mszeredi@redhat.com>	vfs: use helpers for calling f_op->{read,write}_iter() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
# ecdd0959	10-Feb-2017	Ming Lei <tom.leiming@gmail.com>	block/loop: fix race between I/O and set_status Inside set_status, transfer need to setup again, so we have to drain IO before the transition, otherwise oops may be triggered like the following: divide error: 0000 [#1] SMP KASAN CPU: 0 PID: 2935 Comm: loop7 Not tainted 4.10.0-rc7+ #213 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006ba1e840 task.stack: ffff880067338000 RIP: 0010:transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP: 0018:ffff88006733f108 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8800688d7000 RCX: 0000000000000059 RDX: 0000000000000000 RSI: 1ffff1000d743f43 RDI: ffff880068891c08 RBP: ffff88006733f160 R08: ffff8800688d7001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800688d7000 R13: ffff880067b7d000 R14: dffffc0000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006c17e0 CR3: 0000000066e3b000 CR4: 00000000001406f0 Call Trace: lo_do_transfer drivers/block/loop.c:251 [inline] lo_read_transfer drivers/block/loop.c:392 [inline] do_req_filebacked drivers/block/loop.c:541 [inline] loop_handle_cmd drivers/block/loop.c:1677 [inline] loop_queue_work+0xda0/0x49b0 drivers/block/loop.c:1689 kthread_worker_fn+0x4c3/0xa30 kernel/kthread.c:630 kthread+0x326/0x3f0 kernel/kthread.c:227 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 Code: 03 83 e2 07 41 29 df 42 0f b6 04 30 4d 8d 44 24 01 38 d0 7f 08 84 c0 0f 85 62 02 00 00 44 89 f8 41 0f b6 48 ff 25 ff 01 00 00 99 <f7> 7d c8 48 63 d2 48 03 55 d0 48 89 d0 48 89 d7 48 c1 e8 03 83 RIP: transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP: ffff88006733f108 ---[ end trace 0166f7bd3b0c0933 ]--- Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 7c0f6ba6	24-Dec-2016	Linus Torvalds <torvalds@linux-foundation.org>	Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]#[[:blank:]]include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"\|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# b4a567e8	14-Nov-2016	Omar Sandoval <osandov@fb.com>	loop: return proper error from loop_queue_rq() ->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not an errno. f4aa4c7bbac6 ("block: loop: convert to per-device workqueue") Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
# 3989144f	11-Oct-2016	Petr Mladek <pmladek@suse.com>	kthread: kthread worker API cleanup A good practice is to prefix the names of functions by the name of the subsystem. The kthread worker API is a mix of classic kthreads and workqueues. Each worker has a dedicated kthread. It runs a generic function that process queued works. It is implemented as part of the kthread subsystem. This patch renames the existing kthread worker API to use the corresponding name from the workqueues API prefixed by kthread_: __init_kthread_worker() -> __kthread_init_worker() init_kthread_worker() -> kthread_init_worker() init_kthread_work() -> kthread_init_work() insert_kthread_work() -> kthread_insert_work() queue_kthread_work() -> kthread_queue_work() flush_kthread_work() -> kthread_flush_work() flush_kthread_worker() -> kthread_flush_worker() Note that the names of DEFINE_KTHREAD_WORK*() macros stay as they are. It is common that the "DEFINE_" prefix has precedence over the subsystem names. Note that INIT() macros and init() functions use different naming scheme. There is no good solution. There are several reasons for this solution: + "init" in the function names stands for the verb "initialize" aka "initialize worker". While "INIT" in the macro names stands for the noun "INITIALIZER" aka "worker initializer". + INIT() macros are used only in DEFINE() macros + init() functions are used close to the other kthread() functions. It looks much better if all the functions use the same scheme. + There will be also kthread_destroy_worker() that will be used close to kthread_cancel_work(). It is related to the init() function. Again it looks better if all functions use the same naming scheme. + there are several precedents for such init() function names, e.g. amd_iommu_init_device(), free_area_init_node(), jump_label_init_type(), regmap_init_mmio_clk(), + It is not an argument but it was inconsistent even before. [arnd@arndb.de: fix linux-next merge conflict] Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 7d7e0f90	14-Sep-2016	Christoph Hellwig <hch@lst.de>	blk-mq: remove ->map_queue All drivers use the default, so provide an inline version of it. If we ever need other queue mapping we can add an optional method back, although supporting will also require major changes to the queue setup code. This provides better code generation, and better debugability as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# c1c87c2b	04-Aug-2016	Christoph Hellwig <hch@lst.de>	loop: make do_req_filebacked more robust Use a switch statement to iterate over the possible operations and error out if it's an incorrect one. Signed-off-by: Jens Axboe <axboe@fb.com>
# f0225cac	04-Aug-2016	Christoph Hellwig <hch@lst.de>	loop: don't try to use AIO for discards Fix a fat-fingered conversion to the req_op accessors, and also use a switch statement to make it more obvious what is being checked. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Dave Chinner <david@fromorbit.com> Fixes: c2df40 ("drivers: use req op accessor"); Reviewed-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 7a649737	06-Jun-2016	Minfei Huang <mnghuan@gmail.com>	loop: Make user notify for adding loop device failed There is no error number returned if loop driver fails in function alloc_disk to add new loop device. Add a correct error number to make user notify in this case. Signed-off-by: Minfei Huang <mnghuan@gmail.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 3a5e02ce	05-Jun-2016	Mike Christie <mchristi@redhat.com>	block, drivers: add REQ_OP_FLUSH operation This adds a REQ_OP_FLUSH operation that is sent to request_fn based drivers by the block layer's flush code, instead of sending requests with the request->cmd_flags REQ_FLUSH bit set. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# c2df40df	05-Jun-2016	Mike Christie <mchristi@redhat.com>	drivers: use req op accessor The req operation REQ_OP is separated from the rq_flag_bits definition. This converts the block layer drivers to use req_op to get the op from the request struct. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# a8ebb056	05-Jun-2016	Mike Christie <mchristi@redhat.com>	block, drivers, cgroup: use op_is_write helper instead of checking for REQ_WRITE We currently set REQ_WRITE/WRITE for all non READ IOs like discard, flush, writesame, etc. In the next patches where we no longer set up the op as a bitmap, we will not be able to detect a operation direction like writesame by testing if REQ_WRITE is set. This patch converts the drivers and cgroup to use the op_is_write helper. This should just cover the simple cases. I did dm, md and bcache in their own patches because they were more involved. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# a7297a6a	15-Apr-2016	Ming Lei <ming.lei@canonical.com>	block: loop: fix filesystem corruption in case of aio/dio Starting from commit e36f620428(block: split bios to max possible length), block core starts to split bio in the middle of bvec. Unfortunately loop dio/aio doesn't consider this situation, and always treat 'iter.iov_offset' as zero. Then filesystem corruption is observed. This patch figures out the offset of the base bvevc via 'bio->bi_iter.bi_bvec_done' and fixes the issue by passing the offset to iov iterator. Fixes: e36f6204288088f (block: split bios to max possible length) Cc: Keith Busch <keith.busch@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org (4.5) Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 21d0727f	30-Mar-2016	Jens Axboe <axboe@fb.com>	loop: switch to using blk_queue_write_cache() Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
# f4829a9b	27-Sep-2015	Christoph Hellwig <hch@lst.de>	blk-mq: fix racy updates of rq->errors blk_mq_complete_request may be a no-op if the request has already been completed by others means (e.g. a timeout or cancellation), but currently drivers have to set rq->errors before calling blk_mq_complete_request, which might leave us with the wrong error value. Add an error parameter to blk_mq_complete_request so that we can defer setting rq->errors until we known we won the race to complete the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# bc07c10a	16-Aug-2015	Ming Lei <ming.lei@canonical.com>	block: loop: support DIO & AIO There are at least 3 advantages to use direct I/O and AIO on read/write loop's backing file: 1) double cache can be avoided, then memory usage gets decreased a lot 2) not like user space direct I/O, there isn't cost of pinning pages 3) avoid context switch for obtaining good throughput - in buffered file read, random I/O top throughput is often obtained only if they are submitted concurrently from lots of tasks; but for sequential I/O, most of times they can be hit from page cache, so concurrent submissions often introduce unnecessary context switch and can't improve throughput much. There was such discussion[1] to use non-blocking I/O to improve the problem for application. - with direct I/O and AIO, concurrent submissions can be avoided and random read throughput can't be affected meantime xfstests(-g auto, ext4) is basically passed when running with direct I/O(aio), one exception is generic/232, but it failed in loop buffered I/O(4.2-rc6-next-20150814) too. Follows the fio test result for performance purpose: 4 jobs fio test inside ext4 file system over loop block 1) How to run - KVM: 4 VCPUs, 2G RAM - linux kernel: 4.2-rc6-next-20150814(base) with the patchset - the loop block is over one image on SSD. - linux psync, 4 jobs, size 1500M, ext4 over loop block - test result: IOPS from fio output 2) Throughput(IOPS) becomes a bit better with direct I/O(aio) ------------------------------------------------------------- test cases \|randread \|read \|randwrite \|write \| ------------------------------------------------------------- base \|8015 \|113811 \|67442 \|106978 ------------------------------------------------------------- base+loop aio \|8136 \|125040 \|67811 \|111376 ------------------------------------------------------------- - somehow, it should be caused by more page cache avaiable for application or one extra page copy is avoided in case of direct I/O 3) context switch - context switch decreased by ~50% with loop direct I/O(aio) compared with loop buffered I/O(4.2-rc6-next-20150814) 4) memory usage from /proc/meminfo ------------------------------------------------------------- \| Buffers \| Cached ------------------------------------------------------------- base \| > 760MB \| ~950MB ------------------------------------------------------------- base+loop direct I/O(aio) \| < 5MB \| ~1.6GB ------------------------------------------------------------- - so there are much more page caches available for application with direct I/O [1] https://lwn.net/Articles/612483/ Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# ab1cb278	16-Aug-2015	Ming Lei <ming.lei@canonical.com>	block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO If loop block is mounted via 'mount -o loop', it isn't easy to pass file descriptor opened as O_DIRECT, so this patch introduces a new command to support direct IO for this case. Cc: linux-api@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 2e5ab5f3	16-Aug-2015	Ming Lei <ming.lei@canonical.com>	block: loop: prepare for supporing direct IO This patches provides one interface for enabling direct IO from user space: - userspace(such as losetup) can pass 'file' which is opened/fcntl as O_DIRECT Also __loop_update_dio() is introduced to check if direct I/O can be used on current loop setting. The last big change is to introduce LO_FLAGS_DIRECT_IO flag for userspace to know if direct IO is used to access backing file. Cc: linux-api@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# e03a3d7a	16-Aug-2015	Ming Lei <ming.lei@canonical.com>	block: loop: use kthread_work The following patch will use dio/aio to submit IO to backing file, then it needn't to schedule IO concurrently from work, so use kthread_work for decreasing context switch cost a lot. For non-AIO case, single thread has been used for long long time, and it was just converted to work in v4.0, which has caused performance regression for fedora live booting already. In discussion[1], even though submitting I/O via work concurrently can improve random read IO throughput, meantime it might hurt sequential read IO performance, so better to restore to single thread behaviour. For the following AIO support, it is better to use multi hw-queue with per-hwq kthread than current work approach suppose there is so high performance requirement for loop. [1] http://marc.info/?t=143082678400002&r=1&w=2 Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 5b5e20f4	16-Aug-2015	Ming Lei <ming.lei@canonical.com>	block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop It doesn't make sense to enable merge because the I/O submitted to backing file is handled page by page. Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 2bb4cd5c	14-Jul-2015	Jens Axboe <axboe@fb.com>	block: have drivers use blk_queue_max_discard_sectors() Some drivers use it now, others just set the limits field manually. But in preparation for splitting this into a hard and soft limit, ensure that they all call the proper function for setting the hw limit for discards. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 9bf39ab2	19-Jun-2015	Miklos Szeredi <mszeredi@suse.cz>	vfs: add file_path() helper Turn d_path(&file->f_path, ...); into file_path(file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 6a927007	20-May-2015	Jens Axboe <axboe@fb.com>	loop: remove (now) unused 'out' label gcc, righfully, complains: drivers/block/loop.c:1369:1: warning: label 'out' defined but not used [-Wunused-label] Kill it. Signed-off-by: Jens Axboe <axboe@fb.com>
# 06f0e9e6	05-May-2015	Ming Lei <ming.lei@canonical.com>	block: loop: fix another reread part failure loop_clr_fd() can be run piggyback with lo_release(), and under this situation, reread partition may always fail because bd_mutex has been held already. This patch detects the situation by the reference count, and call __blkdev_reread_part() to avoid acquiring the lock again. In the meantime, this patch switches to new kernel APIs of blkdev_reread_part() and __blkdev_reread_part(). Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# f8933667	05-May-2015	Ming Lei <ming.lei@canonical.com>	block: loop: don't hold lo_ctl_mutex in lo_open The lo_ctl_mutex is held for running all ioctl handlers, and in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for rereading partitions, which requires bd_mutex. So it is easy to cause failure because trylock(bd_mutex) may fail inside blkdev_reread_part(), and follows the lock context: blkid or other application: ->open() ->mutex_lock(bd_mutex) ->lo_open() ->mutex_lock(lo_ctl_mutex) losetup(set fd ioctl): ->mutex_lock(lo_ctl_mutex) ->ioctl_by_bdev(BLKRRPART) ->trylock(bd_mutex) This patch trys to eliminate the ABBA lock dependency by removing lo_ctl_mutext in lo_open() with the following approach: 1) make lo_refcnt as atomic_t and avoid acquiring lo_ctl_mutex in lo_open(): - for open vs. add/del loop, no any problem because of loop_index_mutex - freeze request queue during clr_fd, so I/O can't come until clearing fd is completed, like the effect of holding lo_ctl_mutex in lo_open - both open() and release() have been serialized by bd_mutex already 2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in lo_release(), then lo_ctl_mutex is only required for the last release. Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 4d4e41ae	05-May-2015	Ming Lei <ming.lei@canonical.com>	block: loop: avoiding too many pending per work I/O If there are too many pending per work I/O, too many high priority work thread can be generated so that system performance can be effected. This patch limits the max_active parameter of workqueue as 16. This patch fixes Fedora 22 live booting performance regression when it is booted from squashfs over dm based on loop, and looks the following reasons are related with the problem: - not like other filesyststems(such as ext4), squashfs is a bit special, and I observed that increasing I/O jobs to access file in squashfs only improve I/O performance a little, but it can make big difference for ext4 - nested loop: both squashfs.img and ext3fs.img are mounted as loop block, and ext3fs.img is inside the squashfs - during booting, lots of tasks may run concurrently Fixes: b5dd2f6047ca108001328aac0e8588edd15f1778 Cc: stable@vger.kernel.org (v4.0) Cc: Justin M. Forbes <jforbes@fedoraproject.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
# f4aa4c7b	05-May-2015	Ming Lei <ming.lei@canonical.com>	block: loop: convert to per-device workqueue Documentation/workqueue.txt: If there is dependency among multiple work items used during memory reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM. Loop devices can be stacked, so we have to convert to per-device workqueue. One example is Fedora live CD. Fixes: b5dd2f6047ca108001328aac0e8588edd15f1778 Cc: stable@vger.kernel.org (v4.0) Cc: Justin M. Forbes <jforbes@fedoraproject.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
# 6cd18e71	26-Apr-2015	NeilBrown <neilb@suse.de>	block: destroy bdi before blockdev is unregistered. Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit c4db59d31e39ea067c32163ac961e9c80198fd37 Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37 Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# aa4d8616	07-Apr-2015	Christoph Hellwig <hch@lst.de>	block: loop: switch to VFS ITER_BVEC Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 283e7e5d	03-Apr-2015	Al Viro <viro@zeniv.linux.org.uk>	switch /dev/loop to vfs_iter_write() all writable files that might be used as backing store for /dev/loop already support ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 78e367a3	02-Jan-2015	Jens Axboe <axboe@fb.com>	loop: add blk-mq.h include Looks like we pull it in through other ways on x86, but we fail on sparc: In file included from drivers/block/cryptoloop.c:30:0: drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type struct blk_mq_tag_set tag_set; Add the include to loop.h, kill it from loop.c. Signed-off-by: Jens Axboe <axboe@fb.com>
# af65aa8e	31-Dec-2014	Ming Lei <ming.lei@canonical.com>	block: loop: don't handle REQ_FUA explicitly block core handles REQ_FUA by its flush state machine, so won't do it in loop explicitly. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# cf655d95	31-Dec-2014	Ming Lei <ming.lei@canonical.com>	block: loop: introduce lo_discard() and lo_req_flush() No behaviour change, just move the handling for REQ_DISCARD and REQ_FLUSH in these two functions. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 30112013	31-Dec-2014	Ming Lei <ming.lei@canonical.com>	block: loop: say goodby to bio Switch to block request completely. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# b5dd2f60	31-Dec-2014	Ming Lei <ming.lei@canonical.com>	block: loop: improve performance via blk-mq The conversion is a bit straightforward, and use work queue to dispatch requests of loop block, and one big change is that requests is submitted to backend file/device concurrently with work queue, so throughput may get improved much. Given write requests over same file are often run exclusively, so don't handle them concurrently for avoiding extra context switch cost, possible lock contention and work schedule cost. Also with blk-mq, there is opportunity to get loop I/O merged before submitting to backend file/device. In the following test: - base: v3.19-rc2-2041231 - loop over file in ext4 file system on SSD disk - bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1 - throughput: IOPS ------------------------------------------------------ \| \| base \| base with loop-mq \| delta \| ------------------------------------------------------ \| randread \| 1740 \| 25318 \| +1355%\| ------------------------------------------------------ \| read \| 42196 \| 51771 \| +22.6%\| ----------------------------------------------------- \| randwrite \| 35709 \| 34624 \| -3% \| ----------------------------------------------------- \| write \| 39137 \| 40326 \| +3% \| ----------------------------------------------------- So loop-mq can improve throughput for both read and randread, meantime, performance of write and randwrite isn't hurted basically. Another benefit is that loop driver code gets simplified much after blk-mq conversion, and the patch can be thought as cleanup too. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 8698a745	11-Mar-2014	Dongsheng Yang <yangds.fnst@cn.fujitsu.com>	sched, treewide: Replace hardcoded nice values with MIN_NICE/MAX_NICE Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com Cc: devel@driverdev.osuosl.org Cc: devicetree@vger.kernel.org Cc: fcoe-devel@open-fcoe.org Cc: linux390@de.ibm.com Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-s390@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: nbd-general@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: openipmi-developer@lists.sourceforge.net Cc: qla2xxx-upstream@qlogic.com Cc: linux-arch@vger.kernel.org [ Consolidated the patches, twiddled the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
# 44bd70c3	08-Apr-2014	Mike Galbraith <bitbucket@online.de>	drivers/block/loop.c: ratelimit error messages Metric tons of high speed spew is not helpful when things go pear shaped. systemd lost its mind, forgot how to stop services it insists on being sole manager of, massive printk() flood ensued, box eventually died. [16206.684000] loop: Write error at byte offset 11412291584, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. [16206.684000] loop: Write error at byte offset 13155434496, length 4096. [16206.684000] loop: Write error at byte offset 13155438592, length 4096. [16206.684000] loop: Write error at byte offset 13155442688, length 4096. [16206.684000] loop: Write error at byte offset 13960736768, length 4096. [16206.684000] loop: Write error at byte offset 14229172224, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. [16206.684000] loop: Write error at byte offset 14766043136, length 4096. [16206.684000] loop: Write error at byte offset 15034478592, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. Signed-off-by: Mike Galbraith <bitbucket@online.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>
# 12a64d2f	21-Jan-2014	Olaf Hering <olaf@aepfle.de>	drivers/block/loop.c: fix comment typo in loop_config_discard Discard requests are ignored if the encryption is enabled for the given loop device. Update comment to match the code, and similar comments elsewhere in the file. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7988613b	23-Nov-2013	Kent Overstreet <kmo@daterainc.com>	block: Convert bio_for_each_segment() to bvec_iter More prep work for immutable biovecs - with immutable bvecs drivers won't be able to use the biovec directly, they'll need to use helpers that take into account bio->bi_iter.bi_bvec_done. This updates callers for the new usage without changing the implementation yet. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: Jim Paris <jim@jtan.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: support@lsi.com Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Cc: linux-m68k@lists.linux-m68k.org Cc: linuxppc-dev@lists.ozlabs.org Cc: drbd-user@lists.linbit.com Cc: nbd-general@lists.sourceforge.net Cc: cbe-oss-dev@lists.ozlabs.org Cc: xen-devel@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: DL-MPTFusionLinux@lsi.com Cc: linux-scsi@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com Cc: linux-mm@kvack.org Acked-by: Geoff Levand <geoff@infradead.org>
# 4f024f37	11-Oct-2013	Kent Overstreet <kmo@daterainc.com>	block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
# ef7e7c82	15-Oct-2013	Mikulas Patocka <mpatocka@redhat.com>	loop: fix crash when using unassigned loop device When the loop module is loaded, it creates 8 loop devices /dev/loop[0-7]. The devices have no request routine and thus, when they are used without being assigned, a crash happens. For example, these commands cause crash (assuming there are no used loop devices): Kernel Fault: Code=26 regs=000000007f420980 (Addr=0000000000000010) CPU: 1 PID: 50 Comm: kworker/1:1 Not tainted 3.11.0 #1 Workqueue: ksnaphd do_metadata [dm_snapshot] task: 000000007fcf4078 ti: 000000007f420000 task.ti: 000000007f420000 [ 116.319988] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001001111111100001111 Not tainted r00-03 000000ff0804ff0f 00000000408bf5d0 00000000402d8204 000000007b7ff6c0 r04-07 00000000408a95d0 000000007f420950 000000007b7ff6c0 000000007d06c930 r08-11 000000007f4205c0 0000000000000001 000000007f4205c0 000000007f4204b8 r12-15 0000000000000010 0000000000000000 0000000000000000 0000000000000000 r16-19 000000001108dd48 000000004061cd7c 000000007d859800 000000000800000f r20-23 0000000000000000 0000000000000008 0000000000000000 0000000000000000 r24-27 00000000ffffffff 000000007b7ff6c0 000000007d859800 00000000408a95d0 r28-31 0000000000000000 000000007f420950 000000007f420980 000000007f4208e8 sr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000303000 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 117.549988] IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d82fc 00000000402d8300 IIR: 53820020 ISR: 0000000000000000 IOR: 0000000000000010 CPU: 1 CR30: 000000007f420000 CR31: ffffffffffffffff ORIG_R28: 0000000000000001 IAOQ[0]: generic_make_request+0x11c/0x1a0 IAOQ[1]: generic_make_request+0x120/0x1a0 RP(r2): generic_make_request+0x24/0x1a0 Backtrace: [<00000000402d83f0>] submit_bio+0x70/0x140 [<0000000011087c4c>] dispatch_io+0x234/0x478 [dm_mod] [<0000000011087f44>] sync_io+0xb4/0x190 [dm_mod] [<00000000110883bc>] dm_io+0x2c4/0x310 [dm_mod] [<00000000110bfcd0>] do_metadata+0x28/0xb0 [dm_snapshot] [<00000000401591d8>] process_one_work+0x160/0x460 [<0000000040159bc0>] worker_thread+0x300/0x478 [<0000000040161a70>] kthread+0x118/0x128 [<0000000040104020>] end_fault_vector+0x20/0x28 [<0000000040177220>] task_tick_fair+0x420/0x4d0 [<00000000401aa048>] invoke_rcu_core+0x50/0x60 [<00000000401ad5b8>] rcu_check_callbacks+0x210/0x8d8 [<000000004014aaa0>] update_process_times+0xa8/0xc0 [<00000000401ab86c>] rcu_process_callbacks+0x4b4/0x598 [<0000000040142408>] __do_softirq+0x250/0x2c0 [<00000000401789d0>] find_busiest_group+0x3c0/0xc70 [ 119.379988] Kernel panic - not syncing: Kernel Fault Rebooting in 1 seconds.. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a207f593	13-Oct-2013	Mikulas Patocka <mpatocka@redhat.com>	block: fix a probe argument to blk_register_region The probe function is supposed to return NULL on failure (as we can see in kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj; However, in loop and brd, it returns negative error from ERR_PTR. This causes a crash if we simulate disk allocation failure and run less -f /dev/loop0 because the negative number is interpreted as a pointer: BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4 IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450 PGD 23c677067 PUD 23d6d1067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia] CPU: 1 PID: 6831 Comm: less Tainted: P W O 3.10.15-devel #18 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000 RIP: 0010:[<ffffffff8118b188>] [<ffffffff8118b188>] __blkdev_get+0x28/0x450 RSP: 0018:ffff88023e47dbd8 EFLAGS: 00010286 RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658 R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640 FS: 00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640 ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60 ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292 Call Trace: [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60 [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60 [<ffffffff8118c365>] blkdev_open+0x65/0x80 [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0 [<ffffffff8114d214>] finish_open+0x34/0x50 [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50 [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0 [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0 [<ffffffff8115f4f0>] do_filp_open+0x40/0x90 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50 [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0 [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0 [<ffffffff8114e559>] SyS_open+0x19/0x20 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5 a4 07 00 44 89 RIP [<ffffffff8118b188>] __blkdev_get+0x28/0x450 RSP <ffff88023e47dbd8> CR2: 00000000000002b4 ---[ end trace bb7f32dbf02398dc ]--- The brd change should be backported to stable kernels starting with 2.6.25. The loop change should be backported to stable kernels starting with 2.6.22. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org # 2.6.22+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3ec981e3	13-Oct-2013	Mikulas Patocka <mpatocka@redhat.com>	loop: fix crash if blk_alloc_queue fails loop: fix crash if blk_alloc_queue fails If blk_alloc_queue fails, loop_add cleans up, but it doesn't clean up the identifier allocated with idr_alloc. That causes crash on module unload in idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); where we attempt to remove non-existed device with that id. BUG: unable to handle kernel NULL pointer dereference at 0000000000000380 IP: [<ffffffff812057c9>] del_gendisk+0x19/0x2d0 PGD 43d399067 PUD 43d0ad067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: loop(-) dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_powersave spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc lm85 hwmon_vid snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq ohci_hcd freq_table tg3 ehci_pci mperf ehci_hcd kvm_amd kvm sata_svw serverworks libphy libata ide_core k10temp usbcore hwmon microcode ptp pcspkr pps_core e100 skge mii usb_common i2c_piix4 floppy evdev rtc_cmos i2c_core processor but! ton unix CPU: 7 PID: 2735 Comm: rmmod Tainted: G W 3.10.15-devel #15 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 task: ffff88043d38e780 ti: ffff88043d21e000 task.ti: ffff88043d21e000 RIP: 0010:[<ffffffff812057c9>] [<ffffffff812057c9>] del_gendisk+0x19/0x2d0 RSP: 0018:ffff88043d21fe10 EFLAGS: 00010282 RAX: ffffffffa05102e0 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88043ea82800 RDI: 0000000000000000 RBP: ffff88043d21fe48 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000ff R13: 0000000000000080 R14: 0000000000000000 R15: ffff88043ea82800 FS: 00007ff646534700(0000) GS:ffff880447000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000380 CR3: 000000043e9bf000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffffffff8100aba4 0000000000000092 ffff88043d21fe48 ffff88043ea82800 00000000000000ff ffff88043d21fe98 0000000000000000 ffff88043d21fe60 ffffffffa05102b4 0000000000000000 ffff88043d21fe70 ffffffffa05102ec Call Trace: [<ffffffff8100aba4>] ? native_sched_clock+0x24/0x80 [<ffffffffa05102b4>] loop_remove+0x14/0x40 [loop] [<ffffffffa05102ec>] loop_exit_cb+0xc/0x10 [loop] [<ffffffff81217b74>] idr_for_each+0x104/0x190 [<ffffffffa05102e0>] ? loop_remove+0x40/0x40 [loop] [<ffffffff8109adc5>] ? trace_hardirqs_on_caller+0x105/0x1d0 [<ffffffffa05135dc>] loop_exit+0x34/0xa58 [loop] [<ffffffff810a98ea>] SyS_delete_module+0x13a/0x260 [<ffffffff81221d5e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f Code: f0 4c 8b 6d f8 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 4c 8d af 80 00 00 00 41 54 53 48 89 fb 48 83 ec 18 <48> 83 bf 80 03 00 00 00 74 4d e8 98 fe ff ff 31 f6 48 c7 c7 20 RIP [<ffffffff812057c9>] del_gendisk+0x19/0x2d0 RSP <ffff88043d21fe10> CR2: 0000000000000380 ---[ end trace 64ec069ec70f1309 ]--- Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org # 3.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 83a87611	12-May-2013	Al Viro <viro@zeniv.linux.org.uk>	move linux/loop.h to drivers/block Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# db2a144b	05-May-2013	Al Viro <viro@zeniv.linux.org.uk>	block_device_operations->release() should return void The value passed is 0 in all but "it can never happen" cases (and those only in a couple of drivers) and it would've been lost on the way out anyway, even if something tried to pass something meaningful. Just don't bother. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 03d95eb2	20-Mar-2013	Al Viro <viro@zeniv.linux.org.uk>	lift sb_start_write() out of ->write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# c2fccc1c	08-Apr-2013	Jens Axboe <axboe@kernel.dk>	Revert "loop: cleanup partitions when detaching loop device" This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107. There are situations where the destruction path is called with the bdev->bd_mutex already held, which then deadlocks in loop_clr_fd(). The normal partition cleanup does a trylock() on the mutex, but it'd be nice to have a more bullet proof method in loop. So punt this more involved fix to the next merge window, and just back out this buggy fix for now. Signed-off-by: Jens Axboe <axboe@kernel.dk>
# c1681bf8	01-Apr-2013	Anatol Pomozov <anatol.pomozov@gmail.com>	loop: prevent bdev freeing while device in use struct block_device lifecycle is defined by its inode (see fs/block_dev.c) - block_device allocated first time we access /dev/loopXX and deallocated on bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile" we want that block_device stay alive until we destroy the loop device with "losetup -d". But because we do not hold /dev/loopXX inode its counter goes 0, and inode/bdev can be destroyed at any moment. Usually it happens at memory pressure or when user drops inode cache (like in the test below). When later in loop_clr_fd() we want to use bdev we have use-after-free error with following stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000280 bd_set_size+0x10/0xa0 loop_clr_fd+0x1f8/0x420 [loop] lo_ioctl+0x200/0x7e0 [loop] lo_compat_ioctl+0x47/0xe0 [loop] compat_blkdev_ioctl+0x341/0x1290 do_filp_open+0x42/0xa0 compat_sys_ioctl+0xc1/0xf20 do_sys_open+0x16e/0x1d0 sysenter_dispatch+0x7/0x1a To prevent use-after-free we need to grab the device in loop_set_fd() and put it later in loop_clr_fd(). The issue is reprodusible on current Linus head and v3.3. Here is the test: dd if=/dev/zero of=loop.file bs=1M count=1 while [ true ]; do losetup /dev/loop0 loop.file echo 2 > /proc/sys/vm/drop_caches losetup -d /dev/loop0 done [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every time we call loop_set_fd() we check that loop_device->lo_state is Lo_unbound and set it to Lo_bound If somebody will try to set_fd again it will get EBUSY. And if we try to loop_clr_fd() on unbound loop device we'll get ENXIO. loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under loop_device->lo_ctl_mutex. ] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 8761a3dc	22-Mar-2013	Phillip Susi <psusi@ubuntu.com>	loop: cleanup partitions when detaching loop device Any partitions added by user space to the loop device were being left in place after detaching the loop device. This was because the detach path issued a BLKRRPART to clean up partitions if LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto scanned on attach. Replace this BLKRRPART with code that unconditionally cleans up partitions on detach instead. Signed-off-by: Phillip Susi <psusi@ubuntu.com> Modified by Jens to export delete_partition(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 183cfb57	22-Mar-2013	Wei Yongjun <yongjun_wei@trendmicro.com.cn>	loop: fix error return code in loop_add() Fix to return a negative error code from the error handling case, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# c718aa65	27-Feb-2013	Tejun Heo <tj@kernel.org>	block/loop: convert to idr_alloc() Convert to the much saner new idr interface. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 9d609166	27-Feb-2013	Tejun Heo <tj@kernel.org>	block/loop: don't use idr_remove_all() idr_destroy() can destroy idr by itself and idr_remove_all() is being deprecated. Drop its usage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 3dadecce	24-Jan-2013	Al Viro <viro@zeniv.linux.org.uk>	switch vfs_getattr() to struct path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# b7a1da69	21-Feb-2013	Guo Chao <yan@linux.vnet.ibm.com>	loopdev: ignore negative offset when calculate loop device size Negative offset may cause loop device size larger than backing file size. $ fallocate -l 1M a $ losetup --offset 0xffffffffffff0000 /dev/loop0 a $ blockdev --getsize64 /dev/loop0 1114112 $ ls -l a -rw-r--r-- 1 root root 1048576 Jan 23 12:46 a $ cat /dev/loop0 cat: /dev/loop0: Input/output error It makes no sense to do that. Only apply offset when it's positive. Fix a typo in the comment by the way. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b1a66504	21-Feb-2013	Guo Chao <yan@linux.vnet.ibm.com>	loopdev: remove an user triggerable oops When loopdev is built as module and we pass an invalid parameter, loop_init() will return directly without deregister misc device, which will cause an oops when insert loop module next time because we left some garbage in the misc device list. Test case: sudo modprobe loop max_part=1024 (failed due to invalid parameter) sudo modprobe loop (oops) Clean up nicely to avoid such oops. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7b0576a3	21-Feb-2013	Guo Chao <yan@linux.vnet.ibm.com>	loopdev: move common code into loop_figure_size() Update block device size in accord with gendisk size and let userspace know the change in loop_figure_size(). This is a clean up to remove common code of loop_figure_size()'s two callers. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 541c742a	21-Feb-2013	Guo Chao <yan@linux.vnet.ibm.com>	loopdev: update block device size in loop_set_status() Loop device driver sometimes fails to impose the size limit on the device. Keep issuing following two commands: losetup --offset 7517244416 --sizelimit 3224971264 /dev/loop0 backed_file blockdev --getsize64 /dev/loop0 blockdev reports file size instead of sizelimit several out of 100 times. The problems are: - losetup set up the device in two ioctl: LOOP_SET_FD and LOOP_SET_STATUS64. - LOOP_SET_STATUS64 only update size of gendisk. Block device size will be updated lazily when device comes to use. If udev rushes in between the two ioctl, it will bring in a block device whose size is backing file size. If the device is not released after LOOP_SET_STATUS64 ioctl, blockdev will not see the updated size. Update block size in LOOP_SET_STATUS64 ioctl. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Reported-by: M. Hindess <hindessm@uk.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 5370019d	21-Feb-2013	Guo Chao <yan@linux.vnet.ibm.com>	loopdev: fix a deadlock bd_mutex and lo_ctl_mutex can be held in different order. Path #1: blkdev_open blkdev_get __blkdev_get (hold bd_mutex) lo_open (hold lo_ctl_mutex) Path #2: blkdev_ioctl lo_ioctl (hold lo_ctl_mutex) lo_set_capacity (hold bd_mutex) Lockdep does not report it, because path #2 actually holds a subclass of lo_ctl_mutex. This subclass seems creep into the code by mistake. The patch author actually just mentioned it in the changelog, see commit f028f3b2 ("loop: fix circular locking in loop_clr_fd()"), also see: http://marc.info/?l=linux-kernel&m=123806169129727&w=2 Path #2 hold bd_mutex to call bd_set_size(), I've protected it with i_mutex in a previous patch, so drop bd_mutex at this site. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7b5a3522	30-Nov-2012	Lukas Czerner <lczerner@redhat.com>	loop: Limit the number of requests in the bio list Currently there is not limitation of number of requests in the loop bio list. This can lead into some nasty situations when the caller spawns tons of bio requests taking huge amount of memory. This is even more obvious with discard where blkdev_issue_discard() will submit all bios for the range and wait for them to finish afterwards. On really big loop devices and slow backing file system this can lead to OOM situation as reported by Dave Chinner. With this patch we will wait in loop_make_request() if the number of bios in the loop bio list would exceed 'nr_congestion_on'. We'll wake up the process as we process the bios form the list. Some threshold hysteresis is in place to avoid high frequency oscillation. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a1ecac3b	28-Sep-2012	Dave Chinner <dchinner@redhat.com>	loop: Make explicit loop device destruction lazy xfstests has always had random failures of tests due to loop devices failing to be torn down and hence leaving filesytems that cannot be unmounted. This causes test runs to immediately stop. Over the past 6 or 7 years we've added hacks like explicit unmount -d commands for loop mounts, losetup -d after unmount -d fails, etc, but still the problems persist. Recently, the frequency of loop related failures increased again to the point that xfstests 259 will reliably fail with a stray loop device that was not torn down. That is despite the fact the test is above as simple as it gets - loop 5 or 6 times running mkfs.xfs with different paramters: lofile=$(losetup -f) losetup $lofile "$testfile" "$MKFS_XFS_PROG" -b size=512 $lofile >/dev/null \|\| echo "mkfs failed!" sync losetup -d $lofile And losteup -d $lofile is failing with EBUSY on 1-3 of these loops every time the test is run. Turns out that blkid is running simultaneously with losetup -d, and so it sees an elevated reference count and returns EBUSY. But why is blkid running? It's obvious, isn't it? udev has decided to try and find out what is on the block device as a result of a creation notification. And it is racing with mkfs, so might still be scanning the device when mkfs finishes and we try to tear it down. So, make losetup -d force autoremove behaviour. That is, when the last reference goes away, tear down the device. xfstests wants it gone, not causing random teardown failures when we know that all the operations the tests have specifically run on the device have completed and are no longer referencing the loop device. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# e4849737	11-Feb-2012	Eric W. Biederman <ebiederm@xmission.com>	userns: Convert loop to use kuid_t instead of uid_t Cc: Jens Axboe <jaxboe@fusionio.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
# 68d740d7	14-Jul-2012	Silva Paulo <psdasilva@yahoo.com>	blk: fix wrong idr_pre_get() error check in loop.c The idr_pre_get() function never returns a value < 0. It returns 0 (no memory) or 1 (OK). Reported-by: Silva Paulo <psdasilva@yahoo.com> [ Rewrote Silva's patch, but attributing it to Silva anyway - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# cfd8005c	25-Nov-2011	Cong Wang <amwang@redhat.com>	block: remove the second argument of k[un]map_atomic() Signed-off-by: Cong Wang <amwang@redhat.com>
# 306df071	08-Feb-2012	Dave Young <dyoung@redhat.com>	loop: zero fill bio instead of return -EIO for partial read commit 8268f5a741 ("deny partial write for loop dev fd") tried to fix the loop device partial read information leak problem. But it changed the semantics of read behavior. When we read beyond the end of the device we should get 0 bytes, which is normal behavior, we should not just return -EIO Instead of returning -EIO, zero out the bio to avoid information leak in case of partail read. Signed-off-by: Dave Young <dyoung@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Jeff Moyer <jmoyer@redhat.com> Cc: Dmitry Monakhov <dmonakhov@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# ff01bb48	16-Sep-2011	Al Viro <viro@zeniv.linux.org.uk>	fs: move code out of buffer.c Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# dfaf3c03	02-Dec-2011	Lukas Czerner <lczerner@redhat.com>	loop: Fix discard_alignment default setting discard_alignment is not relevant to the loop driver since it is supposed to be set as a workaround for the old sector 63 alignments. So set it to zero rather than block size. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# ae95757a	25-Nov-2011	Dave Young <dyoung@redhat.com>	loop: fix loop block driver discard and encryption comment The loop driver does not support discard if encryption is enabled, fix the comment. Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7035b5df	16-Nov-2011	Dmitry Monakhov <dmonakhov@openvz.org>	loop: cleanup set_status interface 1) Anyone who has read access to loopdev has permission to call set_status and may change important parameters such as lo_offset, lo_sizelimit and so on, which contradicts to read access pattern and definitely equals to write access pattern. 2) Add lo_offset over i_size check to prevent blkdev_size overflow. ##Testcase_bagin #dd if=/dev/zero of=./file bs=1k count=1 #losetup /dev/loop0 ./file /* userspace_application / struct loop_info64 loinf; fd = open("/dev/loop0", O_RDONLY); ioctl(fd, LOOP_GET_STATUS64, &loinf); / Set offset to any value which is bigger than i_size, and sizelimit * to nonzero value/ loinf.lo_offset = 40961024; loinf.lo_sizelimit = 1024; ioctl(fd, LOOP_SET_STATUS64, &loinf); /* After this loop device will have size similar to 0x7fffffffffxxxx */ #blockdev --getsz /dev/loop0 ##OUTPUT: 36028797018955968 ##Testcase_end [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 3bb90682	16-Nov-2011	Dmitry Monakhov <dmonakhov@openvz.org>	loop: prevent information leak after failed read If read was not fully successful we have to fail whole bio to prevent information leak of old pages ##Testcase_begin dd if=/dev/zero of=./file bs=1M count=1 losetup /dev/loop0 ./file -o 4096 truncate -s 0 ./file # OOps loop offset is now beyond i_size, so read will silently fail. # So bio's pages would not be cleared, may which result in information leak. hexdump -C /dev/loop0 ##testcase_end Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 456be148	16-Oct-2011	Christoph Hellwig <hch@infradead.org>	loop: remove the incorrect write_begin/write_end shortcut Currently the loop device tries to call directly into write_begin/write_end instead of going through ->write if it can. This is a fairly nasty shortcut as write_begin and write_end are only callbacks for the generic write code and expect to be called with filesystem specific locks held. This code currently causes various issues for clustered filesystems as it doesn't take the required cluster locks, and it also causes issues for XFS as it doesn't properly lock against the swapext ioctl as called by the defragmentation tools. This in case causes data corruption if defragmentation hits a busy loop device in the wrong time window, as reported by RH QA. The reason why we have this shortcut is that it saves a data copy when doing a transformation on the loop device, which is the technical term for using cryptoloop (or an XOR transformation). Given that cryptoloop has been deprecated in favour of dm-crypt my opinion is that we should simply drop this shortcut instead of finding complicated ways to to introduce a formal interface for this shortcut. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4c823cc3	21-Sep-2011	Ayan George <ayan@ayan.net>	drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd() If the loop device is associated (lo->lo_state == Lo_bound), it will have a valid bdev pointed to by lo->lo_device. There is no reason to ever pass an additional block_device pointer. Signed-off-by: Ayan George <ayan.george@canonical.com> Cc: Phillip Susi <psusi@cfl.rr.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 8a9c5944	21-Sep-2011	Phillip Susi <psusi@cfl.rr.com>	drivers/block/loop.c: emit uevent on auto release The loopback driver failed to emit the change uevent when auto releasing the device. Fixed lo_release() to pass the bdev to loop_clr_fd() so it can emit the event. Signed-off-by: Phillip Susi <psusi@cfl.rr.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ayan George <ayan@ayan.net> Signed-off-by: Andrew Morton <akpm@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 5a7bbad2	11-Sep-2011	Christoph Hellwig <hch@infradead.org>	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# e03c8dd1	23-Aug-2011	Kay Sievers <kay.sievers@vrfy.org>	loop: always allow userspace partitions and optionally support automatic scanning Automatic partition scanning can be requested individually per loop device during its setup by setting LO_FLAGS_PARTSCAN. By default, no partition tables are scanned. Userspace can now always add and remove partitions from all loop devices, regardless if the in-kernel partition scanner is enabled or not. The needed partition minor numbers are allocated from the extended minors space, the main loop device numbers will continue to match the loop minors, regardless of the number of partitions used. # grep . /sys/class/block/loop1/loop/* /sys/block/loop1/loop/autoclear:0 /sys/block/loop1/loop/backing_file:/home/kay/data/stuff/part.img /sys/block/loop1/loop/offset:0 /sys/block/loop1/loop/partscan:1 /sys/block/loop1/loop/sizelimit:0 # ls -l /dev/loop* brw-rw---- 1 root disk 7, 0 Aug 14 20:22 /dev/loop0 brw-rw---- 1 root disk 7, 1 Aug 14 20:23 /dev/loop1 brw-rw---- 1 root disk 259, 0 Aug 14 20:23 /dev/loop1p1 brw-rw---- 1 root disk 259, 1 Aug 14 20:23 /dev/loop1p2 brw-rw---- 1 root disk 7, 99 Aug 14 20:23 /dev/loop99 brw-rw---- 1 root disk 259, 2 Aug 14 20:23 /dev/loop99p1 brw-rw---- 1 root disk 259, 3 Aug 14 20:23 /dev/loop99p2 crw------T 1 root root 10, 237 Aug 14 20:22 /dev/loop-control Cc: Karel Zak <kzak@redhat.com> Cc: Davidlohr Bueso <dave@gnu.org> Acked-By: Tejun Heo <tj@kernel.org> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# dfaa2ef6	19-Aug-2011	Lukas Czerner <lczerner@redhat.com>	loop: add discard support for loop devices This commit adds discard support for loop devices. Discard is usually supported by SSD and thinly provisioned devices as a method for reclaiming unused space. This is no different than trying to reclaim back space which is not used by the file system on the image, but it still occupies space on the host file system. We can do the reclamation on file system which does support hole punching. So when discard request gets to the loop driver we can translate that to punch a hole to the underlying file, hence reclaim the free space. This is very useful for trimming down the size of the image to only what is really used by the file system on that image. Fstrim may be used for that purpose. It has been tested on ext4, xfs and btrfs with the image file systems ext4, ext3, xfs and btrfs. ext4, or ext6 image on ext4 file system has some problems but it seems that ext4 punch hole implementation is somewhat flawed and it is unrelated to this commit. Also this is a very good method of validating file systems punch hole implementation. Note that when encryption is used, discard support is disabled, because using it might leak some information useful for possible attacker. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 05eb0f25	31-Jul-2011	Kay Sievers <kay.sievers@vrfy.org>	loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other LOOP_CLR_FD takes lo->lo_ctl_mutex and tries to remove the loop sysfs files. Sysfs calls show() and waits for lo->lo_ctl_mutex. LOOP_CLR_FD waits for show() to finish to remove the sysfs file. cat /sys/class/block/loop0/loop/backing_file mutex_lock_nested+0x176/0x350 ? loop_attr_do_show_backing_file+0x2f/0xd0 [loop] ? loop_attr_do_show_backing_file+0x2f/0xd0 [loop] loop_attr_do_show_backing_file+0x2f/0xd0 [loop] dev_attr_show+0x1b/0x60 ? sysfs_read_file+0x86/0x1a0 ? __get_free_pages+0x12/0x50 sysfs_read_file+0xaf/0x1a0 ioctl(LOOP_CLR_FD): wait_for_common+0x12c/0x180 ? try_to_wake_up+0x2a0/0x2a0 wait_for_completion+0x18/0x20 sysfs_deactivate+0x178/0x180 ? sysfs_addrm_finish+0x43/0x70 ? sysfs_addrm_start+0x1d/0x20 sysfs_addrm_finish+0x43/0x70 sysfs_hash_and_remove+0x85/0xa0 sysfs_remove_group+0x59/0x100 loop_clr_fd+0x1dc/0x3f0 [loop] lo_ioctl+0x223/0x7a0 [loop] Instead of taking the lo_ctl_mutex from sysfs code, take the inner lo->lo_lock, to protect the access to the backing_file data. Thanks to Tejun for help debugging and finding a solution. Cc: Milan Broz <mbroz@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# d134b00b	31-Jul-2011	Kay Sievers <kay.sievers@vrfy.org>	loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices Instead of unconditionally creating a fixed number of dead loop devices which need to be investigated by storage handling services, even when they are never used, we allow distros start with 0 loop devices and have losetup(8) and similar switch to the dynamic /dev/loop-control interface instead of searching /dev/loop%i for free devices. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 770fe30a	31-Jul-2011	Kay Sievers <kay.sievers@vrfy.org>	loop: add management interface for on-demand device allocation Loop devices today have a fixed pre-allocated number of usually 8. The number can only be changed at module init time. To find a free device to use, /dev/loop%i needs to be scanned, and all devices need to be opened until a free one is possibly found. This adds a new /dev/loop-control device node, that allows to dynamically find or allocate a free device, and to add and remove loop devices from the running system: LOOP_CTL_ADD adds a specific device. Arg is the number of the device. It returns the device i or a negative error code. LOOP_CTL_REMOVE removes a specific device, Arg is the number the device. It returns the device i or a negative error code. LOOP_CTL_GET_FREE finds the next unbound device or allocates a new one. No arg is given. It returns the device i or a negative error code. The loop kernel module gets automatically loaded when /dev/loop-control is accessed the first time. The alias specified in the module, instructs udev to create this 'dead' device node, even when the module is not loaded. Example: cfd = open("/dev/loop-control", O_RDWR); # add a new specific loop device err = ioctl(cfd, LOOP_CTL_ADD, devnr); # remove a specific loop device err = ioctl(cfd, LOOP_CTL_REMOVE, devnr); # find or allocate a free loop device to use devnr = ioctl(cfd, LOOP_CTL_GET_FREE); sprintf(loopname, "/dev/loop%i", devnr); ffd = open("backing-file", O_RDWR); lfd = open(loopname, O_RDWR); err = ioctl(lfd, LOOP_SET_FD, ffd); Cc: Tejun Heo <tj@kernel.org> Cc: Karel Zak <kzak@redhat.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 34dd82af	31-Jul-2011	Kay Sievers <kay.sievers@vrfy.org>	loop: replace linked list of allocated devices with an idr index Replace the linked list, that keeps track of allocated devices, with an idr index to allow a more efficient lookup of devices. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# ac04fee0	26-May-2011	Namhyung Kim <namhyung@gmail.com>	loop: export module parameters Export 'max_loop' and 'max_part' parameters to sysfs so user can know that how many devices are allowed and how many partitions are supported. If 'max_loop' is 0, there is no restriction on the number of loop devices. User can create/use the devices as many as minor numbers available. If 'max_part' is 0, it means simply the device doesn't support partitioning. Also note that 'max_part' can be adjusted to power of 2 minus 1 form if needed. User should check this value after the module loading if he/she want to use that number correctly (i.e. fdisk, mknod, etc.). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# a1c15c59	24-May-2011	Namhyung Kim <namhyung@gmail.com>	loop: handle on-demand devices correctly When finding or allocating a loop device, loop_probe() did not take partition numbers into account so that it can result to a different device. Consider following example: $ sudo modprobe loop max_part=15 $ ls -l /dev/loop* brw-rw---- 1 root disk 7, 0 2011-05-24 22:16 /dev/loop0 brw-rw---- 1 root disk 7, 16 2011-05-24 22:16 /dev/loop1 brw-rw---- 1 root disk 7, 32 2011-05-24 22:16 /dev/loop2 brw-rw---- 1 root disk 7, 48 2011-05-24 22:16 /dev/loop3 brw-rw---- 1 root disk 7, 64 2011-05-24 22:16 /dev/loop4 brw-rw---- 1 root disk 7, 80 2011-05-24 22:16 /dev/loop5 brw-rw---- 1 root disk 7, 96 2011-05-24 22:16 /dev/loop6 brw-rw---- 1 root disk 7, 112 2011-05-24 22:16 /dev/loop7 $ sudo mknod /dev/loop8 b 7 128 $ sudo losetup /dev/loop8 ~/temp/disk-with-3-parts.img $ sudo losetup -a /dev/loop128: [0805]:278201 (/home/namhyung/temp/disk-with-3-parts.img) $ ls -l /dev/loop* brw-rw---- 1 root disk 7, 0 2011-05-24 22:16 /dev/loop0 brw-rw---- 1 root disk 7, 16 2011-05-24 22:16 /dev/loop1 brw-rw---- 1 root disk 7, 2048 2011-05-24 22:18 /dev/loop128 brw-rw---- 1 root disk 7, 2049 2011-05-24 22:18 /dev/loop128p1 brw-rw---- 1 root disk 7, 2050 2011-05-24 22:18 /dev/loop128p2 brw-rw---- 1 root disk 7, 2051 2011-05-24 22:18 /dev/loop128p3 brw-rw---- 1 root disk 7, 32 2011-05-24 22:16 /dev/loop2 brw-rw---- 1 root disk 7, 48 2011-05-24 22:16 /dev/loop3 brw-rw---- 1 root disk 7, 64 2011-05-24 22:16 /dev/loop4 brw-rw---- 1 root disk 7, 80 2011-05-24 22:16 /dev/loop5 brw-rw---- 1 root disk 7, 96 2011-05-24 22:16 /dev/loop6 brw-rw---- 1 root disk 7, 112 2011-05-24 22:16 /dev/loop7 brw-r--r-- 1 root root 7, 128 2011-05-24 22:17 /dev/loop8 After this patch, /dev/loop8 - instead of /dev/loop128 - was accessed correctly. In addition, 'range' passed to blk_register_region() should include all range of dev_t that LOOP_MAJOR can address. It does not need to be limited by partition numbers unless 'max_loop' param was specified. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Laurent Vivier <Laurent.Vivier@bull.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 78f4bb36	24-May-2011	Namhyung Kim <namhyung@gmail.com>	loop: limit 'max_part' module param to DISK_MAX_PARTS The 'max_part' parameter controls the number of maximum partition a loop block device can have. However if a user specifies very large value it would exceed the limitation of device minor number and can cause a kernel panic (or, at least, produce invalid device nodes in some cases). On my desktop system, following command kills the kernel. On qemu, it triggers similar oops but the kernel was alive: $ sudo modprobe loop max_part0000 ------------[ cut here ]------------ kernel BUG at /media/Linux_Data/project/linux/fs/sysfs/group.c:65! invalid opcode: 0000 [#1] SMP last sysfs file: CPU 0 Modules linked in: loop(+) Pid: 43, comm: insmod Tainted: G W 2.6.39-qemu+ #155 Bochs Bochs RIP: 0010:[<ffffffff8113ce61>] [<ffffffff8113ce61>] internal_create_group= +0x2a/0x170 RSP: 0018:ffff880007b3fde8 EFLAGS: 00000246 RAX: 00000000ffffffef RBX: ffff880007b3d878 RCX: 00000000000007b4 RDX: ffffffff8152da50 RSI: 0000000000000000 RDI: ffff880007b3d878 RBP: ffff880007b3fe38 R08: ffff880007b3fde8 R09: 0000000000000000 R10: ffff88000783b4a8 R11: ffff880007b3d878 R12: ffffffff8152da50 R13: ffff880007b3d868 R14: 0000000000000000 R15: ffff880007b3d800 FS: 0000000002137880(0063) GS:ffff880007c00000(0000) knlGS:00000000000000= 00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000422680 CR3: 0000000007b50000 CR4: 00000000000006b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000 Process insmod (pid: 43, threadinfo ffff880007b3e000, task ffff880007afb9c= 0) Stack: ffff880007b3fe58 ffffffff811e66dd ffff880007b3fe58 ffffffff811e570b 0000000000000010 ffff880007b3d800 ffff880007a7b390 ffff880007b3d868 0000000000400920 ffff880007b3d800 ffff880007b3fe48 ffffffff8113cfc8 Call Trace: [<ffffffff811e66dd>] ? device_add+0x4bc/0x5af [<ffffffff811e570b>] ? dev_set_name+0x3c/0x3e [<ffffffff8113cfc8>] sysfs_create_group+0xe/0x12 [<ffffffff810b420e>] blk_trace_init_sysfs+0x14/0x16 [<ffffffff8116a090>] blk_register_queue+0x47/0xf7 [<ffffffff8116f527>] add_disk+0xdf/0x290 [<ffffffffa00060eb>] loop_init+0xeb/0x1b8 [loop] [<ffffffffa0006000>] ? 0xffffffffa0005fff [<ffffffff8100020a>] do_one_initcall+0x7a/0x12e [<ffffffff81096804>] sys_init_module+0x9c/0x1e0 [<ffffffff813329bb>] system_call_fastpath+0x16/0x1b Code: c3 55 48 89 e5 41 57 41 56 41 89 f6 41 55 41 54 49 89 d4 53 48 89 fb= 48 83 ec 28 48 85 ff 74 0b 85 f6 75 0b 48 83 7f 30 00 75 14 <0f> 0b eb fe = 48 83 7f 30 00 b9 ea ff ff ff 0f 84 18 01 00 00 49 RIP [<ffffffff8113ce61>] internal_create_group+0x2a/0x170 RSP <ffff880007b3fde8> ---[ end trace a123eb592043acad ]--- Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Laurent Vivier <Laurent.Vivier@bull.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 7eaceacc	10-Mar-2011	Jens Axboe <jaxboe@fusionio.com>	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# fd51469f	03-Mar-2011	Petr Uzel <petr.uzel@suse.cz>	block: kill loop_mutex Following steps lead to deadlock in kernel: dd if=/dev/zero of=img bs=512 count=1000 losetup -f img mkfs.ext2 /dev/loop0 mount -t ext2 -o loop /dev/loop0 mnt umount mnt/ Stacktrace: [<c102ec04>] irq_exit+0x36/0x59 [<c101502c>] smp_apic_timer_interrupt+0x6b/0x75 [<c127f639>] apic_timer_interrupt+0x31/0x38 [<c101df88>] mutex_spin_on_owner+0x54/0x5b [<fe2250e9>] lo_release+0x12/0x67 [loop] [<c10c4eae>] __blkdev_put+0x7c/0x10c [<c10a4da5>] fput+0xd5/0x1aa [<fe2250cf>] loop_clr_fd+0x1a9/0x1b1 [loop] [<fe225110>] lo_release+0x39/0x67 [loop] [<c10c4eae>] __blkdev_put+0x7c/0x10c [<c10a59d9>] deactivate_locked_super+0x17/0x36 [<c10b6f37>] sys_umount+0x27e/0x2a5 [<c10b6f69>] sys_oldumount+0xb/0xe [<c1002897>] sysenter_do_call+0x12/0x26 [<ffffffff>] 0xffffffff Regression since 2a48fc0ab24241755dc9, which introduced the private loop_mutex as part of the BKL removal process. As per [1], the mutex can be safely removed. [1] http://www.gossamer-threads.com/lists/linux/kernel/1341930 Addresses: https://bugzilla.novell.com/show_bug.cgi?id=669394 Addresses: https://bugzilla.kernel.org/show_bug.cgi?id=29172 Signed-off-by: Petr Uzel <petr.uzel@suse.cz> Cc: stable@kernel.org Reviewed-by: Nikanth Karthikesan <knikanth@suse.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# cd25f549	02-Mar-2011	Vivek Goyal <vgoyal@redhat.com>	loop: No need to initialize ->queue_lock explicitly before calling blk_cleanup_queue() Now we initialize ->queue_lock at queue allocation time so driver does not have to worry about initializing it before calling blk_cleanup_queue(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# ee71a968	19-Jan-2011	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>	loop: queue_lock NULL pointer derefence in blk_throtl_exit Performing $ sudo mount -o loop -o umask=0 /dev/sdb1 /mnt/ mount: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg \| tail or so $ sudo modprobe -r loop results in oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [<ffffffff812479d4>] do_raw_spin_lock+0x14/0x122 Process modprobe (pid: 6189, threadinfo ffff88009a898000, task ffff880154a88000) Call Trace: [<ffffffff81486788>] _raw_spin_lock_irq+0x4a/0x51 [<ffffffff8123404b>] ? blk_throtl_exit+0x3b/0xa0 [<ffffffff8105b120>] ? cancel_delayed_work_sync+0xd/0xf [<ffffffff8123404b>] blk_throtl_exit+0x3b/0xa0 [<ffffffff81229bc8>] blk_release_queue+0x21/0x65 [<ffffffff8123bb06>] kobject_release+0x51/0x66 [<ffffffff8123bab5>] ? kobject_release+0x0/0x66 [<ffffffff8123ce1e>] kref_put+0x43/0x4d [<ffffffff8123ba27>] kobject_put+0x47/0x4b [<ffffffff8122717c>] blk_cleanup_queue+0x56/0x5b [<ffffffffa01c3824>] loop_exit+0x68/0x844 [loop] [<ffffffff8107cccc>] sys_delete_module+0x1e8/0x25b [<ffffffff814864c9>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff81002112>] system_call_fastpath+0x16/0x1b because of an attempt to acquire NULL queue_lock. I added the same lines as in blk_queue_make_request - index 44e18c0..49e6a54 100644`fall back to embedded per-queue lock'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 3603b8ea	20-Dec-2010	Jens Axboe <jaxboe@fusionio.com>	Fix compile warnings due to missing removal of a 'ret' variable Commit a8adbe3 forgot to remove the return variable, kill it. drivers/block/loop.c: In function 'lo_splice_actor': drivers/block/loop.c:398: warning: unused variable 'ret' [...] fs/nfsd/vfs.c: In function 'nfsd_splice_actor': fs/nfsd/vfs.c:848: warning: unused variable 'ret' Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# a8adbe37	17-Dec-2010	Michał Mirosław <mirq-linux@rere.qmqm.pl>	fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors This patch pulls calls to buf->ops->confirm() from all actors passed (also indirectly) to splice_from_pipe_feed(). Is avoiding the call to buf->ops->confirm() while splice()ing to /dev/null is an intentional optimization? No other user does that and this will remove this special case. Against current linux.git 6313e3c21743cc88bb5bd8aa72948ee1e83937b6. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 02e031cb	10-Nov-2010	Christoph Hellwig <hch@lst.de>	block: remove REQ_HARDBARRIER REQ_HARDBARRIER is dead now, so remove the leftovers. What's left at this point is: - various checks inside the block layer. - sanity checks in bio based drivers. - now unused bio_empty_barrier helper. - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while, but Xen really needs to sort out it's barrier situaton. - setting of ordered tags in uas - dead code copied from old scsi drivers. - scsi different retry for barriers - it's dead and should have been removed when flushes were converted to FS requests. - blktrace handling of barriers - removed. Someone who knows blktrace better should add support for REQ_FLUSH and REQ_FUA, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 51a0bb0c	27-Oct-2010	Milan Broz <mbroz@redhat.com>	loop: Properly clear sysfs in autoclear mode In autoclear mode bdev is NULL but the sysfs entry should be destroyed otherwise this warning appears: WARNING: at fs/sysfs/dir.c:451 sysfs_add_one+0x82/0x95() sysfs: cannot create duplicate filename '/devices/virtual/block/loop0/loop' Fixes commit ee86273062cbb310665fe49e1f1937d2cf85b0b9 Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 61ecdb80	26-Oct-2010	Peter Zijlstra <a.p.zijlstra@chello.nl>	mm: strictly nested kmap_atomic() Ensure kmap_atomic() usage is strictly nested Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 2a48fc0a	02-Jun-2010	Arnd Bergmann <arnd@arndb.de>	block: autoconvert trivial BKL users to private mutex The block device drivers have all gained new lock_kernel calls from a recent pushdown, and some of the drivers were already using the BKL before. This turns the BKL into a set of per-driver mutexes. Still need to check whether this is safe to do. file=$1 name=$2 if grep -q lock_kernel ${file} ; then if grep -q 'include.linux.mutex.h' ${file} ; then sed -i '/include.<linux\/smp_lock.h>/d' ${file} else sed -i 's/include.<linux\/smp_lock.h>.$/include <linux\/mutex.h>/g' ${file} fi sed -i ${file} \ -e "/^#include.linux.mutex.h/,$ { 1,/^$static\\|int\\|long$/ { /^$static\\|int\\|long$/istatic DEFINE_MUTEX(${name}_mutex); } }" \ -e "s/$un$lock_kernel\>[ ]()/mutex_\1lock(\&${name}_mutex)/g" \ -e '/[ ]cycle_kernel_lock();/d' else sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \ -e '/cycle_kernel_lock()/d' fi Signed-off-by: Arnd Bergmann <arnd@arndb.de>
# 6259f284	03-Sep-2010	Tejun Heo <tj@kernel.org>	block/loop: implement REQ_FLUSH/FUA support Deprecate REQ_HARDBARRIER and implement REQ_FLUSH/FUA instead. Also, instead of checking file->f_op->fsync() directly, look at the value of vfs_fsync() and ignore -EINVAL return. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 4913efe4	03-Sep-2010	Tejun Heo <tj@kernel.org>	block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush() Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler blk_queue_flush(). blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a device has write cache and can flush it, it should set REQ_FLUSH. If the device can handle FUA writes, it should also set REQ_FUA. All blk_queue_ordered() users are converted. * ORDERED_DRAIN is mapped to 0 which is the default value. * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH. * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH \| REQ_FUA. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Boaz Harrosh <bharrosh@panasas.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Cc: David S. Miller <davem@davemloft.net> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Pierre Ossman <drzeus@drzeus.cx> Cc: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 589d7ed0	03-Sep-2010	Tejun Heo <tj@kernel.org>	block/loop: queue ordered mode should be DRAIN_FLUSH loop implements FLUSH using fsync but was incorrectly setting its ordered mode to DRAIN. Change it to DRAIN_FLUSH. In practice, this doesn't change anything as loop doesn't make use of the block layer ordered implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# ee862730	23-Aug-2010	Milan Broz <mbroz@redhat.com>	loop: add some basic read-only sysfs attributes Create /sys/block/loopX/loop directory and provide these attributes: - backing_file - autoclear - offset - sizelimit This loop directory is present only if loop device is configured. To be used in util-linux-ng (and possibly elsewhere like udev rules) where code need to get loop attributes from kernel (and not store duplicate info in userspace). Moreover loop ioctls are not even able to provide full backing file info because of buffer limits. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 5e00d1b5	12-Aug-2010	Jiri Slaby <jirislaby@kernel.org>	BLOCK: fix bio.bi_rw handling Return of the bi_rw tests is no longer bool after commit 74450be1. But results of such tests are stored in bools. This doesn't fit in there for some compilers (gcc 4.5 here), so either use !! magic to get real bools or use ulong where the result is assigned somewhere. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 6e9624b8	07-Aug-2010	Arnd Bergmann <arnd@arndb.de>	block: push down BKL into .open and .release The open and release block_device_operations are currently called with the BKL held. In order to change that, we must first make sure that all drivers that currently rely on this have no regressions. This blindly pushes the BKL into all .open and .release operations for all block drivers to prepare for the next step. The drivers can subsequently replace the BKL with their own locks or remove it completely when it can be shown that it is not needed. The functions blkdev_get and blkdev_put are the only remaining users of the big kernel lock in the block layer, besides a few uses in the ioctl code, none of which need to serialize with blkdev_{get,put}. Most of these two functions is also under the protection of bdev->bd_mutex, including the actual calls to ->open and ->release, and the common code does not access any global data structures that need the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 00fff265	03-Jul-2010	FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>	block: remove q->prepare_flush_fn completely This removes q->prepare_flush_fn completely (changes the blk_queue_ordered API). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 7b6d91da	07-Aug-2010	Christoph Hellwig <hch@lst.de>	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
# 8018ab05	22-Mar-2010	Christoph Hellwig <hch@lst.de>	sanitize vfs_fsync calling conventions Now that the last user passing a NULL file pointer is gone we can remove the redundant dentry argument and associated hacks inside vfs_fsynmc_range. The next step will be removig the dentry argument from ->fsync, but given the luck with the last round of method prototype changes I'd rather defer this until after the main merge window. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# c3473c63	03-May-2010	David Zeuthen <davidz@redhat.com>	generate "change" uevent for loop device Recent udev versions probe loop devices for filesystems meaning that the /dev/disk hierarchy may contain useful entries such as $ ls -l /dev/disk/by-label/Fedora-12-x86_64-Live lrwxrwxrwx 1 root root 11 Mar 11 13:41 /dev/disk/by-label/Fedora-12-x86_64-Live -> ../../loop0 Unfortunately, no "change" uevent is generated when the loop device is detached so the symlink persists. Additionally, no "change" uevent is guaranteed to be generated when attaching an fd or changing capacity. For example, user space could open the loop device O_RDONLY (in fact, recent util-linux-ng does this) so udev's OPTIONS+="watch" machinery may not trigger the "change" uevent. This patch ensures that the "change" uevent is generated in all of these cases. As a result, the /dev/disk hierarchy works as expected for loop devices. Signed-off-by: David Zeuthen <davidz@redhat.com> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
# 02246c41	08-Apr-2010	Nikanth Karthikesan <knikanth@suse.de>	loop: Update mtime when writing using aops Update mtime when writing to backing filesystem using the address space operations write_begin and write_end. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 5a0e3ad6	24-Mar-2010	Tejun Heo <tj@kernel.org>	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
# cf6e6932	26-Oct-2009	Alexey Dobriyan <adobriyan@gmail.com>	loop: fix NULL dereference if mount fails Commit bb21488482bd36eae6b30b014d93619063773fd4 ("[PATCH] switch loop") started to pass NULL bdev to ioctl hook. Steps to reproduce: [boot with loop.max_part=1] [mount -o loop something so mount fails] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 IP: [<ffffffff811486ee>] blkdev_ioctl+0x2e/0xa30 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:35/ACPI0003:00/power_supply/ACAD/online CPU 0 Modules linked in: zfs nvidia(P) [last unloaded: zfs] Pid: 15177, comm: mount Tainted: P 2.6.32-rc4-zfs #2 Satellite X200 RIP: 0010:[<ffffffff811486ee>] [<ffffffff811486ee>] blkdev_ioctl+0x2e/0xa30 RSP: 0018:ffff88003b3d5bb8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 000000000000125f RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88003b3d5ce8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffffffff000 R13: 0000000000000000 R14: ffff880071cef280 R15: 00000000000200da FS: 00007fd77cfe7740(0000) GS:ffff880001600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000000b8 CR3: 0000000001001000 CR4: 00000000000026f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mount (pid: 15177, threadinfo ffff88003b3d4000, task ffff88007572f920) Stack: ffff88003b3d5c38 ffffffff812f95f5 ffff88007eeb6600 0000000000000000 <0> 0000000000000000 ffff88003b3d5c18 ffffffff811547d9 ffff88001bf11ef0 <0> 7fffffffffffffff ffff88001bf11ee8 ffff88001bf11ef0 0000000000000000 Call Trace: [<ffffffff812f95f5>] ? schedule_timeout+0x1f5/0x250 [<ffffffff811547d9>] ? rb_insert_color+0x109/0x140 [<ffffffff812fb754>] ? _spin_unlock_irq+0x14/0x40 [<ffffffff812f84c6>] ? wait_for_common+0x66/0x170 [<ffffffff8105a280>] ? default_wake_function+0x0/0x10 [<ffffffff810f8258>] ioctl_by_bdev+0x38/0x50 [<ffffffff811d2481>] loop_clr_fd+0x1e1/0x210 [<ffffffff811d2522>] lo_release+0x72/0x80 [<ffffffff810f934c>] __blkdev_put+0x1ac/0x1d0 [<ffffffff810f937b>] blkdev_put+0xb/0x10 [<ffffffff810f93b9>] blkdev_close+0x39/0x60 [<ffffffff810ccef3>] __fput+0xd3/0x230 [<ffffffff810cd06d>] fput+0x1d/0x30 [<ffffffff810c9680>] filp_close+0x50/0x80 [<ffffffff81061f11>] put_files_struct+0x81/0x100 [<ffffffff81061fde>] exit_files+0x4e/0x60 [<ffffffff81063ec5>] do_exit+0x6b5/0x730 [<ffffffff8107b279>] ? up_read+0x9/0x10 [<ffffffff8104c86e>] ? do_page_fault+0x18e/0x2a0 [<ffffffff81063f81>] do_group_exit+0x41/0xc0 [<ffffffff81064012>] sys_exit_group+0x12/0x20 [<ffffffff81030deb>] system_call_fastpath+0x16/0x1b Code: f8 48 89 e5 48 81 ec 30 01 00 00 48 89 5d d8 4c 89 6d e8 4c 89 65 e0 4c 89 75 f0 4c 89 7d f8 48 89 bd e8 fe ff ff 49 89 cd 89 f3 <49> 8b 88 b8 00 00 00 81 fa 68 12 00 00 0f 84 57 05 00 00 0f 86 RIP [<ffffffff811486ee>] blkdev_ioctl+0x2e/0xa30 RSP <ffff88003b3d5bb8> CR2: 00000000000000b8 ---[ end trace c0b4d3c3118d1427 ]--- Fixing recursive fault but reboot is needed! Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 83d5cde4	21-Sep-2009	Alexey Dobriyan <adobriyan@gmail.com>	const: make block_device_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 1f98a13f	11-Sep-2009	Jens Axboe <jens.axboe@oracle.com>	bio: first step in sanitizing the bio->bi_rw flag testing Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 405f5571	11-Jul-2009	Alexey Dobriyan <adobriyan@gmail.com>	headers: smp_lock.h redux * Remove smp_lock.h from files which don't need it (including some headers!) * Add smp_lock.h to files which do need it * Make smp_lock.h include conditional in hardirq.h It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT This will make hardirq.h inclusion cheaper for every PREEMPT=n config (which includes allmodconfig/allyesconfig, BTW) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 6818173b	07-May-2009	Miklos Szeredi <miklos@szeredi.hu>	splice: implement default splice_read method If f_op->splice_read() is not implemented, fall back to a plain read. Use vfs_readv() to read into previously allocated pages. This will allow splice and functions using splice, such as the loop device, to work on all filesystems. This includes "direct_io" files in fuse which bypass the page cache. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# e686307f	17-Apr-2009	Akinobu Mita <akinobu.mita@gmail.com>	loop: use BIO list management functions Now that the bio list management stuff is generic, convert loop to use bio lists instead of its own private bio list implementation. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# ffcd7dca	07-Apr-2009	Alexander Beregalov <a.beregalov@gmail.com>	loop: mutex already unlocked in loop_clr_fd() mount/1865 is trying to release lock (&lo->lo_ctl_mutex) at: but there are no more locks to release! mutex is already unlocked in loop_clr_fd(), we should not try to unlock it in lo_release() again. Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 53d66608	31-Mar-2009	J. R. Okajima <hooanon05@yahoo.co.jp>	loop: add ioctl to resize a loop device Add the ability to 'resize' the loop device on the fly. One practical application is a loop file with XFS filesystem, already mounted: You can easily enlarge the file (append some bytes) and then call ioctl(fd, LOOP_SET_CAPACITY, new); The loop driver will learn about the new size and you can use xfs_growfs later on, which will allow you to use full capacity of the loop file without the need to unmount. Test app: #include <linux/fs.h> #include <linux/loop.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <assert.h> #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define _GNU_SOURCE #include <getopt.h> char me; void usage(FILE f) { fprintf(f, "%s [options] loop_dev [backend_file]\n" "-s, --set new_size_in_bytes\n" "\twhen backend_file is given, " "it will be expanded too while keeping the original contents\n", me); } struct option opts[] = { { .name = "set", .has_arg = 1, .flag = NULL, .val = 's' }, { .name = "help", .has_arg = 0, .flag = NULL, .val = 'h' } }; void err_size(char name, __u64 old) { fprintf(stderr, "size must be larger than current %s (%llu)\n", name, old); } int main(int argc, char argv[]) { int fd, err, c, i, bfd; ssize_t ssz; size_t sz; __u64 old, new, append; char a[BUFSIZ]; struct stat st; FILE out; char backend, *dev; err = EINVAL; out = stderr; me = argv[0]; new = 0; while ((c = getopt_long(argc, argv, "s:h", opts, &i)) != -1) { switch (c) { case 's': errno = 0; new = strtoull(optarg, NULL, 0); if (errno) { err = errno; perror(argv[i]); goto out; } break; case 'h': err = 0; out = stdout; goto err; default: perror(argv[i]); goto err; } } if (optind < argc) dev = argv[optind++]; else goto err; fd = open(dev, O_RDONLY); if (fd < 0) { err = errno; perror(dev); goto out; } err = ioctl(fd, BLKGETSIZE64, &old); if (err) { err = errno; perror("ioctl BLKGETSIZE64"); goto out; } if (!new) { printf("%llu\n", old); goto out; } if (new < old) { err = EINVAL; err_size(dev, old); goto out; } if (optind < argc) { backend = argv[optind++]; bfd = open(backend, O_WRONLY\|O_APPEND); if (bfd < 0) { err = errno; perror(backend); goto out; } err = fstat(bfd, &st); if (err) { err = errno; perror(backend); goto out; } if (new < st.st_size) { err = EINVAL; err_size(backend, st.st_size); goto out; } append = new - st.st_size; sz = sizeof(a); while (append > 0) { if (append < sz) sz = append; ssz = write(bfd, a, sz); if (ssz != sz) { err = errno; perror(backend); goto out; } append -= sz; } err = fsync(bfd); if (err) { err = errno; perror(backend); goto out; } } err = ioctl(fd, LOOP_SET_CAPACITY, new); if (err) { err = errno; perror("ioctl LOOP_SET_CAPACITY"); } goto out; err: usage(out); out: return err; } Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp> Signed-off-by: Tomas Matejicek <tomas@slax.org> Cc: <util-linux-ng@vger.kernel.org> Cc: Karel Zak <kzak@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# f028f3b2	23-Mar-2009	Nikanth Karthikesan <knikanth@suse.de>	loop: fix circular locking in loop_clr_fd() With CONFIG_PROVE_LOCKING enabled $ losetup /dev/loop0 file $ losetup -o 32256 /dev/loop1 /dev/loop0 $ losetup -d /dev/loop1 $ losetup -d /dev/loop0 triggers a [ INFO: possible circular locking dependency detected ] I think this warning is a false positive. Open/close on a loop device acquires bd_mutex of the device before acquiring lo_ctl_mutex of the same device. For ioctl(LOOP_CLR_FD) after acquiring lo_ctl_mutex, fput on the backing_file might acquire the bd_mutex of a device, if backing file is a device and this is the last reference to the file being dropped . But it is guaranteed that it is impossible to have a circular list of backing devices.(say loop2->loop1->loop0->loop2 is not possible), which guarantees that this can never deadlock. So this warning should be suppressed. It is very difficult to annotate lockdep not to warn here in the correct way. A simple way to silence lockdep could be to mark the lo_ctl_mutex in ioctl to be a sub class, but this might mask some other real bugs. @@ -1164,7 +1164,7 @@ static int lo_ioctl(struct block_device bdev, fmode_t mode, struct loop_device lo = bdev->bd_disk->private_data; int err; - mutex_lock(&lo->lo_ctl_mutex); + mutex_lock_nested(&lo->lo_ctl_mutex, 1); switch (cmd) { case LOOP_SET_FD: err = loop_set_fd(lo, mode, bdev, arg); Or actually marking the bd_mutex after lo_ctl_mutex as a sub class could be a better solution. Luckily it is easy to avoid calling fput on backing file with lo_ctl_mutex held, so no lockdep annotation is required. If you do not like the special handling of the lo_ctl_mutex just for the LOOP_CLR_FD ioctl in lo_ioctl(), the mutex handling could be moved inside each of the individual ioctl handlers and I could send you another patch. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 68db1961	23-Mar-2009	Nikanth Karthikesan <knikanth@suse.de>	loop: support barrier writes Honour barrier requests in the loop back block device driver. In case of barrier bios, flush the backing file once before processing the barrier and once after to guarantee ordering. In case of filesystems that does not support fsync, barrier bios would be failed with -EOPNOTSUPP. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# a3941ec1	05-Mar-2009	Roel Kluin <roel.kluin@gmail.com>	loop: don't increment p->offset with (size_t) -EINVAL Upon a 'transfer error block' size is set to -EINVAL, but this becomes positive since size is unsigned: p->offset still gets incremented. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 8ae30b89	12-Dec-2008	Milan Broz <mbroz@redhat.com>	loop: Do not call loop_unplug for not configured loop device. In loop_unplug() function is expected that mapping is set and lo->lo_backing_file is not NULL. Unfortunately loop_set_fd() set the request queue unplug function, but loop_clr_fd() doesn't clear that. Loop device allows open of non-configured loop in some situations. If the unplug on request queue is called, loop module oopses because of missing lo_backing_file. Simple reproducer: losetup /dev/loop0 /xxx losetup -d /dev/loop0 dmsetup create x --table "0 1 linear /dev/loop0 0" EIP is at loop_unplug+0x1d/0x3b ... Call Trace: blk_unplug+0x57/0x5e dm_table_unplug_all+0x34/0x77 [dm_mod] destroy_inode+0x27/0x38 generic_delete_inode+0xd5/0xd9 iput+0x4b/0x4e dm_resume+0xca/0xfe [dm_mod] dev_suspend+0x143/0x165 [dm_mod] dm_ctl_ioctl+0x18e/0x1cf [dm_mod] dev_suspend+0x0/0x165 [dm_mod] dm_ctl_ioctl+0x0/0x1cf [dm_mod] vfs_ioctl+0x22/0x69 do_vfs_ioctl+0x39d/0x3c7 trace_hardirqs_on+0xb/0xd remove_vma+0x50/0x56 do_munmap+0x21c/0x237 sys_ioctl+0x2c/0x45 sysenter_do_call+0x12/0x31 Several reports here http://www.kerneloops.org/search.php?search=loop_unplug Fix it by simply clear unplug function together with removing of backing file. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 14f27939	12-Dec-2008	Milan Broz <mbroz@redhat.com>	loop: Flush possible running bios when loop device is released. When there are still queued bios and reference count drops to zero, loop device must flush all queued bios. Otherwise it can lead to situation that caller closes the device, but some bios are still running and endio() function call later OOpses when uses unallocated mempool. This happens for example when running dm-crypt over loop, here is typical oops backtrace: Oops: 0000 [#1] PREEMPT SMP EIP is at mempool_free+0x12/0x6b ... crypt_dec_pending+0x50/0x54 [dm_crypt] crypt_endio+0x9f/0xa7 [dm_crypt] crypt_endio+0x0/0xa7 [dm_crypt] bio_endio+0x2b/0x2e loop_thread+0x37a/0x3b1 do_lo_send_aops+0x0/0x165 autoremove_wake_function+0x0/0x33 loop_thread+0x0/0x3b1 kthread+0x3b/0x61 kthread+0x0/0x61 kernel_thread_helper+0x7/0x10 (But crash is reproducible with different dm targets running over loop device too.) Patch fixes it by flushing the bios in release call, reusing the flush mechanism for switching backing store. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# b0fafa81	13-Nov-2008	David Howells <dhowells@redhat.com>	CRED: Wrap task credential accesses in the block loopback driver Wrap access to task credentials so that they can be separated more easily from the task_struct during the introduction of COW creds. Change most current->(\|e\|s\|fs)[ug]id to current_(\|e\|s\|fs)[ug]id(). Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more sense to use RCU directly rather than a convenient wrapper; these will be addressed by later patches. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: James Morris <jmorris@namei.org>
# 4e02ed4b	29-Oct-2008	Nick Piggin <npiggin@suse.de>	fs: remove prepare_write/commit_write Nothing uses prepare_write or commit_write. Remove them from the tree completely. [akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 511de73f	07-Oct-2007	Al Viro <viro@zeniv.linux.org.uk>	[PATCH] kill the unused bsize on the send side of /dev/loop Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# bb214884	02-Mar-2008	Al Viro <viro@zeniv.linux.org.uk>	[PATCH] switch loop ioctl doesn't need BKL here Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# d4430d62	02-Mar-2008	Al Viro <viro@zeniv.linux.org.uk>	[PATCH] beginning of methods conversion To keep the size of changesets sane we split the switch by drivers; to keep the damn thing bisectable we do the following: 1) rename the affected methods, add ones with correct prototypes, make (few) callers handle both. That's this changeset. 2) for each driver convert to new methods. ALL drivers are converted in this series. 3) kill the old (renamed) methods. Note that it _is_ a flagday; all in-tree drivers are converted and by the end of this series no trace of old methods remain. The only reason why we do that this way is to keep the damn thing bisectable and allow per-driver debugging if anything goes wrong. New methods: open(bdev, mode) release(disk, mode) ioctl(bdev, mode, cmd, arg) /* Called without BKL / compat_ioctl(bdev, mode, cmd, arg) locked_ioctl(bdev, mode, cmd, arg) / Called with BKL, legacy */ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 75ad23bc	29-Apr-2008	Nick Piggin <npiggin@suse.de>	block: make queue flags non-atomic We can save some atomic ops in the IO path, if we clearly define the rules of how to modify the queue flags. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 476a4813	25-Mar-2008	Laurent Vivier <Laurent.Vivier@bull.net>	loop: manage partitions in disk image This patch allows to use loop device with partitionned disk image. Original behavior of loop is not modified. A new parameter is introduced to define how many partition we want to be able to manage per loop device. This parameter is "max_part". For instance, to manage 63 partitions / loop device, we will do: # modprobe loop max_part=63 # ls -l /dev/loop?* brw-rw---- 1 root disk 7, 0 2008-03-05 14:55 /dev/loop0 brw-rw---- 1 root disk 7, 64 2008-03-05 14:55 /dev/loop1 brw-rw---- 1 root disk 7, 128 2008-03-05 14:55 /dev/loop2 brw-rw---- 1 root disk 7, 192 2008-03-05 14:55 /dev/loop3 brw-rw---- 1 root disk 7, 256 2008-03-05 14:55 /dev/loop4 brw-rw---- 1 root disk 7, 320 2008-03-05 14:55 /dev/loop5 brw-rw---- 1 root disk 7, 384 2008-03-05 14:55 /dev/loop6 brw-rw---- 1 root disk 7, 448 2008-03-05 14:55 /dev/loop7 And to attach a raw partitionned disk image, the original losetup is used: # losetup -f etch.img # ls -l /dev/loop?* brw-rw---- 1 root disk 7, 0 2008-03-05 14:55 /dev/loop0 brw-rw---- 1 root disk 7, 1 2008-03-05 14:57 /dev/loop0p1 brw-rw---- 1 root disk 7, 2 2008-03-05 14:57 /dev/loop0p2 brw-rw---- 1 root disk 7, 5 2008-03-05 14:57 /dev/loop0p5 brw-rw---- 1 root disk 7, 64 2008-03-05 14:55 /dev/loop1 brw-rw---- 1 root disk 7, 128 2008-03-05 14:55 /dev/loop2 brw-rw---- 1 root disk 7, 192 2008-03-05 14:55 /dev/loop3 brw-rw---- 1 root disk 7, 256 2008-03-05 14:55 /dev/loop4 brw-rw---- 1 root disk 7, 320 2008-03-05 14:55 /dev/loop5 brw-rw---- 1 root disk 7, 384 2008-03-05 14:55 /dev/loop6 brw-rw---- 1 root disk 7, 448 2008-03-05 14:55 /dev/loop7 # mount /dev/loop0p1 /mnt # ls /mnt bench cdrom home lib mnt root srv usr bin dev initrd lost+found opt sbin sys var boot etc initrd.img media proc selinux tmp vmlinuz # umount /mnt # losetup -d /dev/loop0 Of course, the same behavior can be done using kpartx on a loop device, but modifying loop avoids to stack several layers of block device (loop + device mapper), this is a very light modification (40% of modifications are to manage the new parameter). Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 96c58655	06-Feb-2008	David Woodhouse <dwmw2@infradead.org>	Allow auto-destruction of loop devices This allows a flag to be set on loop devices so that when they are closed for the last time, they'll self-destruct. In general, so that we can automatically allocate loop devices (as with losetup -f) and have them disappear when we're done with them. In particular, right now, so that we can stop relying on the hackish special-case in umount(8) which kills off loop devices which were set up by 'mount -oloop'. That means we can stop putting crap in /etc/mtab which doesn't belong there, which means it can be a symlink to /proc/mounts, which means yet another writable file on the root filesystem is eliminated and the 'stateless' folks get happier... and OLPC trac #356 can be closed. The mount(8) side of that is at http://marc.info/?l=util-linux-ng&m=119362955431694&w=2 [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: David Woodhouse <dwmw2@infradead.org> Cc: Bernardo Innocenti <bernie@codewiz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# a24eab1e	11-Jan-2008	Jens Axboe <jens.axboe@oracle.com>	loop: fix bad bio_alloc() nr_iovec request Don't allocate room for an iovec when it is not needed. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 96de0e25	19-Oct-2007	Jan Engelhardt <jengelh@gmx.de>	Convert files to UTF-8 and some cleanups * Convert files to UTF-8. * Also correct some people's names (one example is Eißfeldt, which was found in a source file. Given that the author used an ß at all in a source file indicates that the real name has in fact a 'ß' and not an 'ss', which is commonly used as a substitute for 'ß' when limited to 7bit.) * Correct town names (Goettingen -> Göttingen) * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313) Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
# 759d7c6c	17-Oct-2007	Diego Woitasen <diego@woitasen.com.ar>	Remove unneeded lock_kernel() in driver/block/loop.c Signed-off-by: Diego Woitasen <diego@woitasen.com.ar> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 8268f5a7	16-Oct-2007	Dmitry Monakhov <dmonakhov@sw.ru>	deny partial write for loop dev fd Partial write can be easily supported by LO_CRYPT_NONE mode, but it is not easy in LO_CRYPT_CRYPTOAPI case, because of its block nature. I don't know who still used cryptoapi, but theoretically it is possible. So let's leave things as they are. Loop device doesn't support partial write before Nick's "write_begin/write_end" patch set, and let's it behave the same way after. Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# afddba49	16-Oct-2007	Nick Piggin <npiggin@suse.de>	fs: introduce write_begin, write_end, and perform_write aops These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). [mark.fasheh@oracle.com: API design contributions, code review and fixes] [akpm@linux-foundation.org: various fixes] [dmonakhov@sw.ru: new aop block_write_begin fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 6712ecf8	26-Sep-2007	NeilBrown <neilb@suse.de>	Drop 'size' argument from bio_endio and bi_end_io As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 165125e1	24-Jul-2007	Jens Axboe <jens.axboe@oracle.com>	[BLOCK] Get rid of request_queue_t typedef Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# 00d59405	17-Jul-2007	Akinobu Mita <akinobu.mita@gmail.com>	unregister_blkdev() delete redundant messages in callers No need to warn unregister_blkdev() failure by the callers. (The previous patch makes unregister_blkdev() print error message in error case) Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 83144186	17-Jul-2007	Rafael J. Wysocki <rjw@rjwysocki.net>	Freezer: make kernel threads nonfreezable by default Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# cac36bb0	14-Jun-2007	Jens Axboe <jens.axboe@oracle.com>	pipe: change the ->pin() operation to ->confirm() The name 'pin' was badly chosen, it doesn't pin a pipe buffer in the most commonly used sense in the kernel. So change the name to 'confirm', after debating this issue with Hugh Dickins a bit. A good return from ->confirm() means that the buffer is really there, and that the contents are good. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# d6b29d7c	04-Jun-2007	Jens Axboe <jens.axboe@oracle.com>	splice: divorce the splice structure/function definitions from the pipe header We need to move even more stuff into the header so that folks can use the splice_to_pipe() implementation instead of open-coding a lot of pipe knowledge (see relay implementation), so move to our own header file finally. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# fd582140	12-Jun-2007	Jens Axboe <jens.axboe@oracle.com>	loop: convert to using splice_direct_to_actor() instead of sendfile() This gets rid of the dependency on ->sendfile() for receiving data and converts loop to ->splice_read() instead. Also includes an IV offset fix from Hugh Dickins. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
# a47653fc	08-Jun-2007	Ken Chen <kenchen@google.com>	loop: preallocate eight loop devices The kernel on-demand loop device instantiation breaks several user space tools as the tools are not ready to cope with the "on-demand feature". Fix it by instantiate default 8 loop devices and also reinstate max_loop module parameter. Signed-off-by: Ken Chen <kenchen@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 705962cc	13-May-2007	Al Viro <viro@zeniv.linux.org.uk>	fix deadlock in loop.c ... doh Jeremy Fitzhardinge noted that the recent loop.c cleanups worked, but cause lockdep to complain. Ouch. OK, the deadlock is real and yes, I'm an idiot. Speaking of which, we probably want to s/lock/pin/ in drivers/base/map.c to avoid such brainos again. And yes, this stuff needs clear documentation. Will try to put one together once I get some sleep... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 07002e99	12-May-2007	Al Viro <viro@zeniv.linux.org.uk>	fix the dynamic allocation and probe in loop.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Ken Chen <kenchen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 01f2705d	09-May-2007	Nate Diller <nate.diller@gmail.com>	fs: convert core functions to zero_user_page It's very common for file systems to need to zero part or all of a page, the simplist way is just to use kmap_atomic() and memset(). There's actually a library function in include/linux/highmem.h that does exactly that, but it's confusingly named memclear_highpage_flush(), which is descriptive of how it does the work rather than what the purpose is. So this patchset renames the function to zero_user_page(), and calls it from the various places that currently open code it. This first patch introduces the new function call, and converts all the core kernel callsites, both the open-coded ones and the old memclear_highpage_flush() ones. Following this patch is a series of conversions for each file system individually, per AKPM, and finally a patch deprecating the old call. The diffstat below shows the entire patchset. [akpm@linux-foundation.org: fix a few things] Signed-off-by: Nate Diller <nate.diller@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 73285082	08-May-2007	Ken Chen <kenchen@google.com>	remove artificial software max_loop limit Remove artificial maximum 256 loop device that can be created due to a legacy device number limit. Searching through lkml archive, there are several instances where users complained about the artificial limit that the loop driver impose. There is no reason to have such limit. This patch rid the limit entirely and make loop device and associated block queue instantiation on demand. With on-demand instantiation, it also gives the benefit of not wasting memory if these devices are not in use (compare to current implementation that always create 8 loop devices), a net improvement in both areas. This version is both tested with creation of large number of loop devices and is compatible with existing losetup/mount user land tools. There are a number of people who worked on this and provided valuable suggestions, in no particular order, by: Jens Axboe Jan Engelhardt Christoph Hellwig Thomas M Signed-off-by: Ken Chen <kenchen@google.com> Cc: Jan Engelhardt <jengelh@linux01.gwdg.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# f98393a6	06-May-2007	Peter Zijlstra <a.p.zijlstra@chello.nl>	mm: remove destroy_dirty_buffers from invalidate_bdev() Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't been used in 6 years (so akpm says). find * -name \.[ch] \| xargs grep -l invalidate_bdev \| while read file; do quilt add $file; sed -ie 's/invalidate_bdev($[^,]$,[^)]*)/invalidate_bdev(\1)/g' $file; done Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 6c648be6	08-Dec-2006	Josef Sipek <jsipek@fsl.cs.sunysb.edu>	[PATCH] struct path: convert block_drivers Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# ba674cfc	10-Oct-2006	Al Viro <viro@ftp.linux.org.uk>	[PATCH] __user annotations: loop.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 98ae6ccd	10-Oct-2006	Al Viro <viro@ftp.linux.org.uk>	[PATCH] fix misannotations in loop.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 863d5b82	29-Aug-2006	David Howells <dhowells@redhat.com>	[PATCH] BLOCK: Move the loop device ioctl compat stuff to the loop driver [try #6] Move the loop device ioctl compat stuff from fs/compat_ioctl.c to the loop driver so that the loop header file doesn't need to be included. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a7422bf8	29-Sep-2006	Serge E. Hallyn <serue@us.ibm.com>	[PATCH] loop: forward-port resource leak checks from Solar Forward port of the patch by Solar and ported by Julio. Compiles, boots, and passes my looptorturetest.sh. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Julio Auto <mindvortex@gmail.com> Cc: Solar Designer <solar@openwall.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 6c997918	29-Sep-2006	Serge E. Hallyn <serue@us.ibm.com>	[PATCH] kthread: convert loop.c to kthread Convert loop.c from the deprecated kernel_thread to kthread. This patch simplifies the code quite a bit and passes similar testing to the previous submission on both emulated x86 and s390. Changes since last submission: switched to using a rather simple loop based on wait_event_interruptible. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# ba52de12	27-Sep-2006	Theodore Ts'o <tytso@mit.edu>	[PATCH] inode-diet: Eliminate i_blksize from the inode structure This eliminates the i_blksize field from struct inode. Filesystems that want to provide a per-inode st_blksize can do so by providing their own getattr routine instead of using the generic_fillattr() function. Note that some filesystems were providing pretty much random (and incorrect) values for i_blksize. [bunk@stusta.de: cleanup] [akpm@osdl.org: generic_fillattr() fix] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 6ab3d562	30-Jun-2006	Jörn Engel <joern@wohnheim.fh-wedel.de>	Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
# f5e54d6e	28-Jun-2006	Christoph Hellwig <hch@lst.de>	[PATCH] mark address_space_operations const Same as with already do with the file operations: keep them in .rodata and prevents people from doing runtime patching. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# ce7b0f46	20-Jun-2005	Greg Kroah-Hartman <gregkh@suse.de>	[PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed And remove the now unneeded number field. Also fixes all drivers that set these fields. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
# ff23eca3	20-Jun-2005	Greg Kroah-Hartman <gregkh@suse.de>	[PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree Also fixes up all files that #include it. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
# 8ab5e4c1	20-Jun-2005	Greg Kroah-Hartman <gregkh@suse.de>	[PATCH] devfs: Remove devfs_remove() function from the kernel tree Removes the devfs_remove() function and all callers of it. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
# 95dc112a	20-Jun-2005	Greg Kroah-Hartman <gregkh@suse.de>	[PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree Removes the devfs_mk_dir() function and all callers of it. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
# 09c0dc68	26-Jun-2006	Linus Torvalds <torvalds@g5.osdl.org>	Revert "[PATCH] kthread: update loop.c to use kthread" This reverts commit c7b2eff059fcc2d1b7085ee3d84b79fd657a537b. Hugh Dickins explains: "It seems too little tested: "losetup -d /dev/loop0" fails with EINVAL because nothing sets lo_thread; but even when you patch loop_thread() to set lo->lo_thread = current, it can't survive more than a few dozen iterations of the loop below (with a tmpfs mounted on /tst): j=0 cp /dev/zero /tst while : do let j=j+1 echo "Doing pass $j" losetup /dev/loop0 /tst/zero mkfs -t ext2 -b 1024 /dev/loop0 >/dev/null 2>&1 mount -t ext2 /dev/loop0 /mnt umount /mnt losetup -d /dev/loop0 done it collapses with failed ioctl then BUG_ON(!bio). I think the original lo_done completion was more subtle and safe than the kthread conversion has allowed for." Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# c7b2eff0	25-Jun-2006	Serge E. Hallyn <serue@us.ibm.com>	[PATCH] kthread: update loop.c to use kthread Update loop.c to use a kthread instead of a deprecated kernel_thread for loop devices. [akpm@osdl.org: don't change the thread's name] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# eefe85ee	23-Jun-2006	Constantine Sapuntzakis <csapuntz@gmail.com>	[PATCH] drivers/block/loop.c: don't return garbage if LOOP_SET_STATUS not called While writing a version of losetup, I ran into the problem that the loop device was returning total garbage. It turns out the problem was that this losetup was only issuing the LOOP_SET_FD ioctl and not issuing a subsequent LOOP_SET_STATUS ioctl. This losetup didn't have any special status to set, so it left out the call. The deeper cause is that loop_set_fd sets the transfer function to NULL, which causes no transfer to happen lo_do_transfer. This patch fixes the problem by setting transfer to transfer_none in loop_set_fd. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 3e88c17d	26-Mar-2006	Herbert Poetzl <herbert@13thfloor.at>	[PATCH] loop: potential kernel hang waiting for kthread Check that kernel_thread() succeeded, so we don't wait for something which cannot happen. Signed-off-by: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# f85221dd	23-Mar-2006	Ingo Molnar <mingo@elte.hu>	[PATCH] sem2mutex: drivers/block/loop.c Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 1312f40e	12-Mar-2006	Al Viro <viro@zeniv.linux.org.uk>	[PATCH] regularize blk_cleanup_queue() use Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# 858119e1	14-Jan-2006	Arjan van de Ven <arjan@infradead.org>	[PATCH] Unlinline a bunch of other functions Remove the "inline" keyword from a bunch of big functions in the kernel with the goal of shrinking it by 30kb to 40kb Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 11b751ae	09-Jan-2006	Ingo Molnar <mingo@elte.hu>	[PATCH] mutex subsystem, semaphore to completion: drivers/block/loop.c convert the block loop device from semaphores to completions. Signed-off-by: Ingo Molnar <mingo@elte.hu>
# 1b1dcc1b	09-Jan-2006	Jes Sorensen <jes@sgi.com>	[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
# 994fc28c	15-Dec-2005	Zach Brown <zach.brown@oracle.com>	[PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATE readpage(), prepare_write(), and commit_write() callers are updated to understand the special return code AOP_TRUNCATED_PAGE in the style of writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that the callee has unlocked the page and that the operation should be tried again with a new page. OCFS2 uses this to detect and work around a lock inversion in its aop methods. There should be no change in behaviour for methods that don't return AOP_TRUNCATED_PAGE. WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are made enums so that kerneldoc can be used to document their semantics. Signed-off-by: Zach Brown <zach.brown@oracle.com>
# b4e3ca1a	21-Oct-2005	Al Viro <viro@zeniv.linux.org.uk>	[PATCH] gfp_t: remaining bits of drivers/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 35a82d1a	23-Jun-2005	Nick Piggin <nickpiggin@yahoo.com.au>	[PATCH] optimise loop driver a bit Looks like locking can be optimised quite a lot. Increase lock widths slightly so lo_lock is taken fewer times per request. Also it was quite trivial to cover lo_pending with that lock, and remove the atomic requirement. This also makes memory ordering explicitly correct, which is nice (not that I particularly saw any mem ordering bugs). Test was reading 4 250MB files in parallel on ext2-on-tmpfs filesystem (1K block size, 4K page size). System is 2 socket Xeon with HT (4 thread). intel:/home/npiggin# umount /dev/loop0 ; mount /dev/loop0 /mnt/loop ; /usr/bin/time ./mtloop.sh Before: 0.24user 5.51system 0:02.84elapsed 202%CPU (0avgtext+0avgdata 0maxresident)k 0.19user 5.52system 0:02.88elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k 0.19user 5.57system 0:02.89elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k 0.22user 5.51system 0:02.90elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k 0.19user 5.44system 0:02.91elapsed 193%CPU (0avgtext+0avgdata 0maxresident)k After: 0.07user 2.34system 0:01.68elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k 0.06user 2.37system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k 0.06user 2.39system 0:01.68elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k 0.06user 2.36system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k 0.06user 2.42system 0:01.68elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
# 1da177e4	16-Apr-2005	Linus Torvalds <torvalds@ppc970.osdl.org>	Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!