History log of /linux-master/drivers/md/md.c
Revision Date Author Comments
# 396799eb 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: remove mddev->queue

Just use the request_queue from the gendisk pointer in the relatively
few places that sill need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de


# 81a16e19 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: don't initialize queue limits

Initial queue limits are now set from ->run. Remove the superfluous
initialization in md_alloc and level_store.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-10-hch@lst.de


# e305fce1 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add queue limit helpers

Add a few helpers that wrap the block queue limits API for use in MD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-5-hch@lst.de


# 176df894 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add a mddev_is_dm helper

Add a helper to check for a DM-mapped MD device instead of using
the obfuscated ->gendisk or ->queue NULL checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de


# 28be4fd3 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add a mddev_add_trace_msg helper

Add a small wrapper around blk_add_trace_msg that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de


# c396b90e 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add a mddev_trace_remap helper

Add a helper to trace bio remapping that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de


# 41425f96 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com


# 16c4770c 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid: really frozen sync_thread during suspend

1) commit f52f5c71f3d4 ("md: fix stopping sync thread") remove
MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that
dm-raid relies on __md_stop_writes() to frozen sync_thread
indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in
md_stop_writes(), and since stop_sync_thread() is only used for
dm-raid in this case, also move stop_sync_thread() to
md_stop_writes().
2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen,
it only prevent new sync_thread to start, and it can't stop the
running sync thread; In order to frozen sync_thread, after seting the
flag, stop_sync_thread() should be used.
3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use
it as condition for md_stop_writes() in raid_postsuspend() doesn't
look correct. Consider that reentrant stop_sync_thread() do nothing,
always call md_stop_writes() in raid_postsuspend().
4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime,
and if MD_RECOVERY_FROZEN is cleared while the array is suspended,
new sync_thread can start unexpected. Fix this by disallow
raid_message() to change sync_thread status during suspend.

Note that after commit f52f5c71f3d4 ("md: fix stopping sync thread"), the
test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(),
and with previous fixes, the test won't hang there anymore, however, the
test will still fail and complain that ext4 is corrupted. And with this
patch, the test won't hang due to stop_sync_thread() or fail due to ext4
is corrupted anymore. However, there is still a deadlock related to
dm-raid456 that will be fixed in following patches.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com/
Fixes: 1af2048a3e87 ("dm raid: fix deadlock caused by premature md_stop_writes()")
Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-6-yukuai1@huaweicloud.com


# 314e9af0 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

md: export helper md_is_rdwr()

There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com


# 7a2347e2 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

md: export helpers to stop sync_thread

Add new helpers:

void md_idle_sync_thread(struct mddev *mddev);
void md_frozen_sync_thread(struct mddev *mddev);
void md_unfrozen_sync_thread(struct mddev *mddev);

The helpers will be used in dm-raid in later patches to fix regressions
and prevent calling md_reap_sync_thread() directly.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com


# 2f03d0c2 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume

After commit 9dbd1aa3a81c ("dm raid: add reshaping support to the
target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and
expect to keep array frozen until resume. However, md_run() will clear
the flag by setting mddev->recovery to 0.

Before commit 1baae052cccd ("md: Don't ignore suspended array in
md_check_recovery()"), dm-raid actually relied on suspending to prevent
starting new sync_thread.

Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in
md_run().

Fixes: 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()")
Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-2-yukuai1@huaweicloud.com


# e9b0a155 25-Feb-2024 Li Nan <linan122@huawei.com>

md: check mddev->pers before calling md_set_readonly()

If 'mddev->pers' is NULL, there is nothing to do in md_set_readonly().
Except for md_ioctl(), the other two callers of md_set_readonly() have
already checked 'mddev->pers'. To simplify the code, move the check of
'mddev->pers' to the caller.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-10-linan666@huaweicloud.com


# 650b2e69 25-Feb-2024 Li Nan <linan122@huawei.com>

md: clean up openers check in do_md_stop() and md_set_readonly()

Before stopping or setting readonly, mddev_set_closing_and_sync_blockdev()
is always called to check the openers. So no longer need to check it again
in do_md_stop() and md_set_readonly(). Clean it up.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-9-linan666@huaweicloud.com


# 99b902ac 25-Feb-2024 Li Nan <linan122@huawei.com>

md: sync blockdev before stopping raid or setting readonly

Commit a05b7ea03d72 ("md: avoid crash when stopping md array races
with closing other open fds.") added sync_block before stopping raid and
setting readonly. Later in commit 260fa034ef7a ("md: avoid deadlock when
dirty buffers during md_stop.") it is moved to ioctl. array_state_store()
was ignored. Add sync blockdev to array_state_store() now.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-8-linan666@huaweicloud.com


# f74aaf61 25-Feb-2024 Li Nan <linan122@huawei.com>

md: factor out a helper to sync mddev

There are no functional changes, prepare to sync mddev in
array_state_store().

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-7-linan666@huaweicloud.com


# 9674f54e 25-Feb-2024 Li Nan <linan122@huawei.com>

md: Don't clear MD_CLOSING when the raid is about to stop

The raid should not be opened anymore when it is about to be stopped.
However, other processes can open it again if the flag MD_CLOSING is
cleared before exiting. From now on, this flag will not be cleared when
the raid will be stopped.

Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-6-linan666@huaweicloud.com


# 91b26a39 25-Feb-2024 Li Nan <linan122@huawei.com>

md: return directly before setting did_set_md_closing

There is nothing to do at 'out' before setting 'did_set_md_closing'
in md_ioctl(). Return directly, and it will help us to remove
'did_set_md_closing' later.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-5-linan666@huaweicloud.com


# 9dd8702e 25-Feb-2024 Li Nan <linan122@huawei.com>

md: clean up invalid BUG_ON in md_ioctl

'disk->private_data' is set to mddev in md_alloc() and never set to NULL,
and users need to open mddev before submitting ioctl. So mddev must not
have been freed during ioctl, and there is no need to check mddev here.
Clean up it.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-4-linan666@huaweicloud.com


# 4e265939 25-Feb-2024 Li Nan <linan122@huawei.com>

md: changed the switch of RAID_VERSION to if

There is only one case of this 'switch'. Change it to 'if'.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-3-linan666@huaweicloud.com


# 2fe4ffc3 25-Feb-2024 Li Nan <linan122@huawei.com>

md: merge the check of capabilities into md_ioctl_valid()

There is no functional change. Just to make code cleaner.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240226031444.3606764-2-linan666@huaweicloud.com


# a28d893e 23-Jan-2024 Christian Brauner <brauner@kernel.org>

md: port block device access to file

Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-4-adbd023e19cc@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 74fa8f9c 15-Feb-2024 Christoph Hellwig <hch@lst.de>

block: pass a queue_limits argument to blk_alloc_disk

Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Also change blk_alloc_disk to return an ERR_PTR instead of just NULL
which can't distinguish errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6cf35065 08-Feb-2024 Li Nan <linan122@huawei.com>

md: fix kmemleak of rdev->serial

If kobject_add() is fail in bind_rdev_to_array(), 'rdev->serial' will be
alloc not be freed, and kmemleak occurs.

unreferenced object 0xffff88815a350000 (size 49152):
comm "mdadm", pid 789, jiffies 4294716910
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc f773277a):
[<0000000058b0a453>] kmemleak_alloc+0x61/0xe0
[<00000000366adf14>] __kmalloc_large_node+0x15e/0x270
[<000000002e82961b>] __kmalloc_node.cold+0x11/0x7f
[<00000000f206d60a>] kvmalloc_node+0x74/0x150
[<0000000034bf3363>] rdev_init_serial+0x67/0x170
[<0000000010e08fe9>] mddev_create_serial_pool+0x62/0x220
[<00000000c3837bf0>] bind_rdev_to_array+0x2af/0x630
[<0000000073c28560>] md_add_new_disk+0x400/0x9f0
[<00000000770e30ff>] md_ioctl+0x15bf/0x1c10
[<000000006cfab718>] blkdev_ioctl+0x191/0x3f0
[<0000000085086a11>] vfs_ioctl+0x22/0x60
[<0000000018b656fe>] __x64_sys_ioctl+0xba/0xe0
[<00000000e54e675e>] do_syscall_64+0x71/0x150
[<000000008b0ad622>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

Fixes: 963c555e75b0 ("md: introduce mddev_create/destroy_wb_pool for the change of member device")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240208085556.2412922-1-linan666@huaweicloud.com


# 570b9147 04-Jan-2024 Li Lingfeng <lilingfeng3@huawei.com>

md: use RCU lock to protect traversal in md_spares_need_change()

Since md_start_sync() will be called without the protect of mddev_lock,
and it can run concurrently with array reconfiguration, traversal of rdev
in it should be protected by RCU lock.
Commit bc08041b32ab ("md: suspend array in md_start_sync() if array need
reconfiguration") added md_spares_need_change() to md_start_sync(),
casusing use of rdev without any protection.
Fix this by adding RCU lock in md_spares_need_change().

Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: stable@vger.kernel.org # 6.7+
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240104133629.1277517-1-lilingfeng@huaweicloud.com


# 9cfcf99e 29-Dec-2023 Li Lingfeng <lilingfeng3@huawei.com>

md: get rdev->mddev with READ_ONCE()

Users may get rdev->mddev by sysfs while rdev is releasing.
So use both READ_ONCE() and WRITE_ONCE() to prevent load/store tearing
and to read/write mddev atomically.

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231229070500.3602712-1-lilingfeng@huaweicloud.com


# faeaf210 28-Dec-2023 Yu Kuai <yukuai3@huawei.com>

md: remove redundant md_wakeup_thread()

On the one hand, mddev_unlock() will call md_wakeup_thread()
unconditionally; on the other hand, md_check_recovery() can't make
progress if 'reconfig_mutex' can't be grabbed. Hence, it really doesn't
make sense to wake up daemon thread while 'reconfig_mutex' is still
grabbed.

Remove all the md_wakup_thread() for 'mddev->thread' while
'reconfig_mtuex' is still grabbed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231228125553.2697765-3-yukuai1@huaweicloud.com


# 61c90765 28-Dec-2023 Yu Kuai <yukuai3@huawei.com>

md: remove redundant check of 'mddev->sync_thread'

The lifetime of sync_thread:

1) Set MD_RECOVERY_NEEDED and wake up daemon thread (by ioctl/sysfs or
other events);
2) Daemon thread woke up, md_check_recovery() found that
MD_RECOVERY_NEEDED is set:
a) try to grab reconfig_mutex;
b) set MD_RECOVERY_RUNNING;
c) clear MD_RECOVERY_NEEDED, and then queue sync_work;
3) md_start_sync() choose sync_action, then register sync_thread;
4) md_do_sync() is done, set MD_RECOVERY_DONE and wake up daemon thread;
5) Daemon thread woke up, md_check_recovery() found that
MD_RECOVERY_DONE is set:
a) try to grab reconfig_mutex;
b) unregister sync_thread;
c) clear MD_RECOVERY_RUNNING and MD_RECOVERY_DONE;

Hence there is no such case that MD_RECOVERY_RUNNING is not set, while
sync_thread is registered.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231228125553.2697765-2-yukuai1@huaweicloud.com


# 9e46c70e 01-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md: Don't suspend the array for interrupted reshape

md_start_sync() will suspend the array if there are spares that can be
added or removed from conf, however, if reshape is still in progress,
this won't happen at all or data will be corrupted(remove_and_add_spares
won't be called from md_choose_sync_action for reshape), hence there is
no need to suspend the array if reshape is not done yet.

Meanwhile, there is a potential deadlock for raid456:

1) reshape is interrupted;

2) set one of the disk WantReplacement, and add a new disk to the array,
however, recovery won't start until the reshape is finished;

3) then issue an IO across reshpae position, this IO will wait for
reshape to make progress;

4) continue to reshape, then md_start_sync() found there is a spare disk
that can be added to conf, mddev_suspend() is called;

Step 4 and step 3 is waiting for each other, deadlock triggered. Noted
this problem is found by code review, and it's not reporduced yet.

Fix this porblem by don't suspend the array for interrupted reshape,
this is safe because conf won't be changed until reshape is done.

Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240201092559.910982-6-yukuai1@huaweicloud.com


# ad39c081 01-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md: Don't register sync_thread for reshape directly

Currently, if reshape is interrupted, then reassemble the array will
register sync_thread directly from pers->run(), in this case
'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee
that md_do_sync() will be executed, hence stop_sync_thread() will hang
because 'MD_RECOVERY_RUNNING' can't be cleared.

Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE,
however, following hang can still be triggered by dm-raid test
shell/lvconvert-raid-reshape.sh occasionally:

[root@fedora ~]# cat /proc/1982/stack
[<0>] stop_sync_thread+0x1ab/0x270 [md_mod]
[<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod]
[<0>] raid_presuspend+0x1e/0x70 [dm_raid]
[<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod]
[<0>] __dm_destroy+0x2a5/0x310 [dm_mod]
[<0>] dm_destroy+0x16/0x30 [dm_mod]
[<0>] dev_remove+0x165/0x290 [dm_mod]
[<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod]
[<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod]
[<0>] vfs_ioctl+0x21/0x60
[<0>] __x64_sys_ioctl+0xb9/0xe0
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

Meanwhile mddev->recovery is:
MD_RECOVERY_RUNNING |
MD_RECOVERY_INTR |
MD_RECOVERY_RESHAPE |
MD_RECOVERY_FROZEN

Fix this problem by remove the code to register sync_thread directly
from raid10 and raid5. And let md_check_recovery() to register
sync_thread.

Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240201092559.910982-5-yukuai1@huaweicloud.com


# 82ec0ae5 01-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md: Make sure md_do_sync() will set MD_RECOVERY_DONE

stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must
set MD_RECOVERY_DONE, so that follow up md_check_recovery() will
unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up
stop_sync_thread().

If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will
return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4
("md: fix stopping sync thread"), dm-raid switch from
md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread
from md_stop() and md_stop_writes(), causing the test
shell/lvconvert-raid-reshape.sh hang.

We shouldn't switch back to md_reap_sync_thread() because it's
problematic in the first place. Fix the problem by making sure
md_do_sync() will set MD_RECOVERY_DONE.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/ece2b06f-d647-6613-a534-ff4c9bec1142@redhat.com/
Fixes: d5d885fd514f ("md: introduce new personality funciton start()")
Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240201092559.910982-4-yukuai1@huaweicloud.com


# 55a48ad2 01-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md: Don't ignore read-only array in md_check_recovery()

Usually if the array is not read-write, md_check_recovery() won't
register new sync_thread in the first place. And if the array is
read-write and sync_thread is registered, md_set_readonly() will
unregister sync_thread before setting the array read-only. md/raid
follow this behavior hence there is no problem.

After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following
hang can be triggered by test shell/integrity-caching.sh:

1) array is read-only. dm-raid update super block:
rs_update_sbs
ro = mddev->ro
mddev->ro = 0
-> set array read-write
md_update_sb

2) register new sync thread concurrently.

3) dm-raid set array back to read-only:
rs_update_sbs
mddev->ro = ro

4) stop the array:
raid_dtr
md_stop
stop_sync_thread
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_wakeup_thread_directly(mddev->sync_thread);
wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

5) sync thread done:
md_do_sync
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
md_wakeup_thread(mddev->thread);

6) daemon thread can't unregister sync thread:
md_check_recovery
if (!md_is_rdwr(mddev) &&
!test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
return;
-> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang;

The root cause is that dm-raid manipulate 'mddev->ro' by itself,
however, dm-raid really should stop sync thread before setting the
array read-only. Unfortunately, I need to read more code before I
can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix
the problem the easy way for now to prevent dm-raid regression.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/9801e40-8ac7-e225-6a71-309dcf9dc9aa@redhat.com/
Fixes: ecbfb9f118bc ("dm raid: add raid level takeover support")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240201092559.910982-3-yukuai1@huaweicloud.com


# 1baae052 01-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md: Don't ignore suspended array in md_check_recovery()

mddev_suspend() never stop sync_thread, hence it doesn't make sense to
ignore suspended array in md_check_recovery(), which might cause
sync_thread can't be unregistered.

After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following
hang can be triggered by test shell/integrity-caching.sh:

1) suspend the array:
raid_postsuspend
mddev_suspend

2) stop the array:
raid_dtr
md_stop
__md_stop_writes
stop_sync_thread
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_wakeup_thread_directly(mddev->sync_thread);
wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

3) sync thread done:
md_do_sync
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
md_wakeup_thread(mddev->thread);

4) daemon thread can't unregister sync thread:
md_check_recovery
if (mddev->suspended)
return; -> return directly
md_read_sync_thread
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang;

This problem is not just related to dm-raid, fix it by ignoring
suspended array in md_check_recovery(). And follow up patches will
improve dm-raid better to frozen sync thread during suspend.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/8fb335e-6d2c-dbb5-d7-ded8db5145a@redhat.com/
Fixes: 68866e425be2 ("MD: no sync IO while suspended")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240201092559.910982-2-yukuai1@huaweicloud.com


# 855678ed 01-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md: Fix missing release of 'active_io' for flush

submit_flushes
atomic_set(&mddev->flush_pending, 1);
rdev_for_each_rcu(rdev, mddev)
atomic_inc(&mddev->flush_pending);
bi->bi_end_io = md_end_flush
submit_bio(bi);
/* flush io is done first */
md_end_flush
if (atomic_dec_and_test(&mddev->flush_pending))
percpu_ref_put(&mddev->active_io)
-> active_io is not released

if (atomic_dec_and_test(&mddev->flush_pending))
-> missing release of active_io

For consequence, mddev_suspend() will wait for 'active_io' to be zero
forever.

Fix this problem by releasing 'active_io' in submit_flushes() if
'flush_pending' is decreased to zero.

Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration")
Cc: stable@vger.kernel.org # v6.1+
Reported-by: Blazej Kucman <blazej.kucman@linux.intel.com>
Closes: https://lore.kernel.org/lkml/20240130172524.0000417b@linux.intel.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240201092559.910982-7-yukuai1@huaweicloud.com


# f9cfe7e7 09-Jan-2024 Yu Kuai <yukuai3@huawei.com>

md: Fix md_seq_ops() regressions

Commit cf1b6d4441ff ("md: simplify md_seq_ops") introduce following
regressions:

1) If list all_mddevs is emptly, personalities and unused devices won't
be showed to user anymore.
2) If seq_file buffer overflowed from md_seq_show(), then md_seq_start()
will be called again, hence personalities will be showed to user
again.
3) If seq_file buffer overflowed from md_seq_stop(), seq_read_iter()
doesn't handle this, hence unused devices won't be showed to user.

Fix above problems by printing personalities and unused devices in
md_seq_show().

Fixes: cf1b6d4441ff ("md: simplify md_seq_ops")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240109133957.2975272-1-yukuai1@huaweicloud.com


# d8730f0c 14-Dec-2023 Song Liu <song@kernel.org>

md: Remove deprecated CONFIG_MD_MULTIPATH

md-multipath has been marked as deprecated for 2.5 years. Remove it.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Mateusz Grzonka <mateusz.grzonka@intel.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20231214222107.2016042-3-song@kernel.org


# 849d18e2 14-Dec-2023 Song Liu <song@kernel.org>

md: Remove deprecated CONFIG_MD_LINEAR

md-linear has been marked as deprecated for 2.5 years. Remove it.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Mateusz Grzonka <mateusz.grzonka@intel.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org


# dc1cc22e 13-Dec-2023 Alex Lyakas <alex.lyakas@zadara.com>

md: Whenassemble the array, consult the superblock of the freshest device

Upon assembling the array, both kernel and mdadm allow the devices to have event
counter difference of 1, and still consider them as up-to-date.
However, a device whose event count is behind by 1, may in fact not be up-to-date,
and array resync with such a device may cause data corruption.
To avoid this, consult the superblock of the freshest device about the status
of a device, whose event counter is behind by 1.

Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/1702470271-16073-1-git-send-email-alex.lyakas@zadara.com


# fa2bbff7 28-Nov-2023 Yu Kuai <yukuai3@huawei.com>

md: synchronize flush io with array reconfiguration

Currently rcu is used to protect iterating rdev from submit_flushes():

submit_flushes remove_and_add_spares
synchronize_rcu
pers->hot_remove_disk()
rcu_read_lock()
rdev_for_each_rcu
if (rdev->raid_disk >= 0)
rdev->radi_disk = -1;
atomic_inc(&rdev->nr_pending)
rcu_read_unlock()
bi = bio_alloc_bioset()
bi->bi_end_io = md_end_flush
bi->private = rdev
submit_bio
// issue io for removed rdev

Fix this problem by grabbing 'acive_io' before iterating rdev, make sure
that remove_and_add_spares() won't concurrent with submit_flushes().

Fixes: a2826aa92e2e ("md: support barrier requests on all personalities.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231129020234.1586910-1-yukuai1@huaweicloud.com


# c891f1fd 25-Nov-2023 Yu Kuai <yukuai3@huawei.com>

md: remove flag RemoveSynchronized

rcu is not used correctly here, because synchronize_rcu() is called
before replacing old value, for example:

remove_and_add_spares // other path
synchronize_rcu
// called before replacing old value
set_bit(RemoveSynchronized)
rcu_read_lock()
rdev = conf->mirros[].rdev
pers->hot_remove_disk
conf->mirros[].rdev = NULL;
if (!test_bit(RemoveSynchronized))
synchronize_rcu
/*
* won't be called, and won't wait
* for concurrent readers to be done.
*/
// access rdev after remove_and_add_spares()
rcu_read_unlock()

Fortunately, there is a separate rcu protection to prevent such rdev
to be freed:

md_kick_rdev_from_array //other path
rcu_read_lock()
rdev = conf->mirros[].rdev
list_del_rcu(&rdev->same_set)

rcu_read_unlock()
/*
* rdev can be removed from conf, but
* rdev won't be freed.
*/
synchronize_rcu()
free rdev

Hence remove this useless flag and prepare to remove rcu protection to
access rdev from 'conf'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com


# d6e035aa 08-Nov-2023 Junxiao Bi <junxiao.bi@oracle.com>

md: bypass block throttle for superblock update

commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
introduced a hung bug and will be reverted in next patch, since the issue
that commit is fixing is due to md superblock write is throttled by wbt,
to fix it, we can have superblock write bypass block layer throttle.

Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
Cc: stable@vger.kernel.org # v5.19+
Suggested-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231108182216.73611-1-junxiao.bi@oracle.com


# db29d79b 24-Nov-2023 Yu Kuai <yukuai3@huawei.com>

dm-raid: delay flushing event_work() after reconfig_mutex is released

After commit db5e653d7c9f ("md: delay choosing sync action to
md_start_sync()"), md_start_sync() will hold 'reconfig_mutex', however,
in order to make sure event_work is done, __md_stop() will flush
workqueue with reconfig_mutex grabbed, hence if sync_work is still
pending, deadlock will be triggered.

Fortunately, former pacthes to fix stopping sync_thread already make sure
all sync_work is done already, hence such deadlock is not possible
anymore. However, in order not to cause confusions for people by this
implicit dependency, delay flushing event_work to dm-raid where
'reconfig_mutex' is not held, and add some comments to emphasize that
the workqueue can't be flushed with 'reconfig_mutex'.

Fixes: db5e653d7c9f ("md: delay choosing sync action to md_start_sync()")
Depends-on: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# b3911334 06-Dec-2023 Yu Kuai <yukuai3@huawei.com>

md: split MD_RECOVERY_NEEDED out of mddev_resume

New mddev_resume() calls are added to synchronize IO with array
reconfiguration, however, this introduces a performance regression while
adding it in md_start_sync():

1) someone sets MD_RECOVERY_NEEDED first;
2) daemon thread grabs reconfig_mutex, then clears MD_RECOVERY_NEEDED and
queues a new sync work;
3) daemon thread releases reconfig_mutex;
4) in md_start_sync
a) check that there are spares that can be added/removed, then suspend
the array;
b) remove_and_add_spares may not be called, or called without really
add/remove spares;
c) resume the array, then set MD_RECOVERY_NEEDED again!

Loop between 2 - 4, then mddev_suspend() will be called quite often, for
consequence, normal IO will be quite slow.

Fix this problem by don't set MD_RECOVERY_NEEDED again in md_start_sync(),
hence the loop will be broken.

Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Suggested-by: Song Liu <song@kernel.org>
Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231207020724.2797445-1-yukuai1@huaweicloud.com


# f52f5c71 05-Dec-2023 Yu Kuai <yukuai3@huawei.com>

md: fix stopping sync thread

Currently sync thread is stopped from multiple contex:
- idle_sync_thread
- frozen_sync_thread
- __md_stop_writes
- md_set_readonly
- do_md_stop

And there are some problems:
1) sync_work is flushed while reconfig_mutex is grabbed, this can
deadlock because the work function will grab reconfig_mutex as well.
2) md_reap_sync_thread() can't be called directly while md_do_sync() is
not finished yet, for example, commit 130443d60b1b ("md: refactor
idle/frozen_sync_thread() to fix deadlock").
3) If MD_RECOVERY_RUNNING is not set, there is no need to stop
sync_thread at all because sync_thread must not be registered.

Factor out a helper stop_sync_thread(), so that above contex will behave
the same. Fix 1) by flushing sync_work after reconfig_mutex is released,
before waiting for sync_thread to be done; Fix 2) bt letting daemon thread
to unregister sync_thread; Fix 3) by always checking MD_RECOVERY_RUNNING
first.

Fixes: db5e653d7c9f ("md: delay choosing sync action to md_start_sync()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231205094215.1824240-4-yukuai1@huaweicloud.com


# c9f7cb5b 05-Dec-2023 Yu Kuai <yukuai3@huawei.com>

md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()

If md_set_readonly() failed, the array could still be read-write, however
'MD_RECOVERY_FROZEN' could still be set, which leave the array in an
abnormal state that sync or recovery can't continue anymore.
Hence make sure the flag is cleared after md_set_readonly() returns.

Fixes: 88724bfa68be ("md: wait for pending superblock updates before switching to read-only")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231205094215.1824240-3-yukuai1@huaweicloud.com


# f2d87a75 05-Dec-2023 Yu Kuai <yukuai3@huawei.com>

md: fix missing flush of sync_work

Commit ac619781967b ("md: use separate work_struct for md_start_sync()")
use a new sync_work to replace del_work, however, stop_sync_thread() and
__md_stop_writes() was trying to wait for sync_thread to be done, hence
they should switch to use sync_work as well.

Noted that md_start_sync() from sync_work will grab 'reconfig_mutex',
hence other contex can't held the same lock to flush work, and this will
be fixed in later patches.

Fixes: ac619781967b ("md: use separate work_struct for md_start_sync()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231205094215.1824240-2-yukuai1@huaweicloud.com


# 45b47895 17-Nov-2023 Song Liu <song@kernel.org>

md: fix bi_status reporting in md_end_clone_io

md_end_clone_io() may overwrite error status in orig_bio->bi_status with
BLK_STS_OK. This could happen when orig_bio has BIO_CHAIN (split by
md_submit_bio => bio_split_to_limits, for example). As a result, upper
layer may miss error reported from md (or the device) and consider the
failed IO was successful.

Fix this by only update orig_bio->bi_status when current bio reports
error and orig_bio is BLK_STS_OK. This is the same behavior as
__bio_chain_endio().

Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5")
Cc: stable@vger.kernel.org # v5.14+
Reported-by: Bhanu Victor DiCara <00bvd0+linux@gmail.com>
Closes: https://lore.kernel.org/regressions/5727380.DvuYhMxLoT@bvd0/
Signed-off-by: Song Liu <song@kernel.org>
Tested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>


# 78b7b13f 16-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: cleanup pers->prepare_suspend()

pers->prepare_suspend() is not used anymore and can be removed.

Reverts following three commit:

- commit 431e61257d63 ("md: export md_is_rdwr() and is_md_suspended()")
- commit 3e00777d5157 ("md: add a new api prepare_suspend() in
md_personality")
- commit 868bba54a3bc ("md/raid5: fix a deadlock in the case that reshape
is interrupted")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231016100240.540474-1-yukuai1@huaweicloud.com


# dd6291c5 02-Oct-2023 Joel Granados <j.granados@samsung.com>

raid: Remove now superfluous sentinel element from ctl_table array

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel from raid_table

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>


# 2b16a525 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: rename __mddev_suspend/resume() back to mddev_suspend/resume()

Now that the old apis are removed, __mddev_suspend/resume() can be
renamed to their original names.

This is done by:

sed -i "s/__mddev_suspend/mddev_suspend/g" *.[ch]
sed -i "s/__mddev_resume/mddev_resume/g" *.[ch]

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com


# 4717c028 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: remove old apis to suspend the array

Now that mddev_suspend() and mddev_resume() is not used anywhere, remove
them, and remove 'MD_ALLOW_SB_UPDATE' and 'MD_UPDATING_SB' as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-19-yukuai1@huaweicloud.com


# bc08041b 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: suspend array in md_start_sync() if array need reconfiguration

So that io won't concurrent with array reconfiguration, and it's safe to
suspend the array directly because normal io won't rely on
md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-18-yukuai1@huaweicloud.com


# b4128c00 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: cleanup mddev_create/destroy_serial_pool()

Now that except for stopping the array, all the callers already suspend
the array, there is no need to suspend anymore, hence remove the second
parameter.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-15-yukuai1@huaweicloud.com


# 58226942 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: use new apis to suspend array before mddev_create/destroy_serial_pool

mddev_create/destroy_serial_pool() will be called from several places
where mddev_suspend() will be called later.

Prepare to remove the mddev_suspend() from
mddev_create/destroy_serial_pool().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-14-yukuai1@huaweicloud.com


# 1b0a2d95 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: use new apis to suspend array for ioctls involed array reconfiguration

'reconfig_mutex' will be grabbed before these ioctls, suspend array
before holding the lock, so that io won't concurrent with array
reconfiguration through ioctls.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-13-yukuai1@huaweicloud.com


# cfa078c8 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: use new apis to suspend array for adding/removing rdev from state_store()

User can write 'remove' and 're-add' to trigger array reconfiguration
through sysfs, suspend array in this case so that io won't concurrent
with array reconfiguration.

And now that all the caller of add_bound_rdev() alread suspend the
array, remove mddev_suspend/resume() from add_bound_rdev() as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-12-yukuai1@huaweicloud.com


# 205669f3 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: use new apis to suspend array for sysfs apis

Convert to use new apis in following sysfs apis:
- level_store
- suspend_lo_store
- suspend_hi_store
- serialize_policy_store

These are not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-11-yukuai1@huaweicloud.com


# 714d2015 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: add new helpers to suspend/resume array

Advantages for new apis:
- reconfig_mutex is not required;
- the weird logical that suspend array hold 'reconfig_mutex' for
mddev_check_recovery() to update superblock is not needed;
- the specail handling, 'pers->prepare_suspend', for raid456 is not
needed;
- It's safe to be called at any time once mddev is allocated, and it's
designed to be used from slow path where array configuration is changed;
- the new helpers is designed to be called before mddev_lock(), hence
it support to be interrupted by user as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-5-yukuai1@huaweicloud.com


# 2e82248b 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()

Prepare to cleanup pers->prepare_suspend(), which is used to fix a
deadlock in raid456 by returning error for io that is waiting for
reshape to make progress in mddev_suspend().

This change will allow reshape to make progress while waiting for io to
be done in mddev_suspend() in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-4-yukuai1@huaweicloud.com


# 617787f1 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'

Protect 'suspend_lo' and 'suspend_hi' with READ_ONCE/WRITE_ONCE to prevent
reading abnormal values.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-2-yukuai1@huaweicloud.com


# 09f894af 28-Sep-2023 Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

md: do not require mddev_lock() for all options in array_state_store()

We don't need to lock device to reject not supported request
in array_state_store(). No functional changes intended.

There are differences between ioctl and sysfs handling during stopping.
With this change, it will be easier to add additional steps which needs
to be completed before mddev_lock() is taken.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230928125517.12356-1-mariusz.tkaczyk@linux.intel.com


# cf1b6d44 27-Sep-2023 Yu Kuai <yukuai3@huawei.com>

md: simplify md_seq_ops

Before this patch, the implementation is hacky and hard to understand:

1) md_seq_start set pos to 1;
2) md_seq_show found pos is 1, then print Personalities;
3) md_seq_next found pos is 1, then it update pos to the first mddev;
4) md_seq_show found pos is not 1 or 2, show mddev;
5) md_seq_next found pos is not 1 or 2, update pos to next mddev;
6) loop 4-5 until the last mddev, then md_seq_next update pos to 2;
7) md_seq_show found pos is 2, then print unused devices;
8) md_seq_next found pos is 2, stop;

This patch remove the magic value and use seq_list_start/next/stop()
directly, and move printing "Personalities" to md_seq_start(),
"unsed devices" to md_seq_stop():

1) md_seq_start print Personalities, and then set pos to first mddev;
2) md_seq_show show mddev;
3) md_seq_next update pos to next mddev;
4) loop 2-3 until the last mddev;
5) md_seq_stop print unsed devices;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230927061241.1552837-3-yukuai1@huaweicloud.com


# 3d8d3287 27-Sep-2023 Yu Kuai <yukuai3@huawei.com>

md: factor out a helper from mddev_put()

There are no functional changes, prepare to simplify md_seq_ops in next
patch.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230927061241.1552837-2-yukuai1@huaweicloud.com


# ceb04163 25-Sep-2023 Justin Stitt <justinstitt@google.com>

md: replace deprecated strncpy with memcpy

`strncpy` is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

There are three such strncpy uses that this patch addresses:

The respective destination buffers are:
1) mddev->clevel
2) clevel
3) mddev->metadata_type

We expect mddev->clevel to be NUL-terminated due to its use with format
strings:
| ret = sprintf(page, "%s\n", mddev->clevel);

Furthermore, we can see that mddev->clevel is not expected to be
NUL-padded as `md_clean()` merely set's its first byte to NULL -- not
the entire buffer:
| static void md_clean(struct mddev *mddev)
| {
| mddev->array_sectors = 0;
| mddev->external_size = 0;
| ...
| mddev->level = LEVEL_NONE;
| mddev->clevel[0] = 0;
| ...

A suitable replacement for this instance is `memcpy` as we know the
number of bytes to copy and perform manual NUL-termination at a
specified offset. This really decays to just a byte copy from one buffer
to another. `strscpy` is also a considerable replacement but using
`slen` as the length argument would result in truncation of the last
byte unless something like `slen + 1` was provided which isn't the most
idiomatic strscpy usage.

For the next case, the destination buffer `clevel` is expected to be
NUL-terminated based on its usage within kstrtol() which expects
NUL-terminated strings. Note that, in context, this code removes a
trailing newline which is seemingly not required as kstrtol() can handle
trailing newlines implicitly. However, there exists further usage of
clevel (or buf) that would also like to have the newline removed. All in
all, with similar reasoning to the first case, let's just use memcpy as
this is just a byte copy and NUL-termination is handled manually.

The third and final case concerning `mddev->metadata_type` is more or
less the same as the other two. We expect that it be NUL-terminated
based on its usage with seq_printf:
| seq_printf(seq, " super external:%s",
| mddev->metadata_type);
... and we can surmise that NUL-padding isn't required either due to how
it is handled in md_clean():
| static void md_clean(struct mddev *mddev)
| {
| ...
| mddev->metadata_type[0] = 0;
| ...

So really, all these instances have precisely calculated lengths and
purposeful NUL-termination so we can just use memcpy to remove ambiguity
surrounding strncpy.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230925-strncpy-drivers-md-md-c-v1-1-2b0093b89c2b@google.com


# 54d21eb6 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: don't check 'mddev->pers' and 'pers->quiesce' from suspend_lo_store()

Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to remove such checking.

This will also allow the array to be suspended even before the array
is ran.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-8-yukuai1@huaweicloud.com


# a2a9f168 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: don't check 'mddev->pers' from suspend_hi_store()

Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to remove such checking.

This will also allow the array to be suspended even before the array
is ran.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-7-yukuai1@huaweicloud.com


# b721e788 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: don't rely on 'mddev->pers' to be set in mddev_suspend()

'active_io' used to be initialized while the array is running, and
'mddev->pers' is set while the array is running as well. Hence caller
must hold 'reconfig_mutex' and guarantee 'mddev->pers' is set before
calling mddev_suspend().

Now that 'active_io' is initialized when mddev is allocated, such
restriction doesn't exist anymore. In the meantime, follow up patches
will refactor mddev_suspend(), hence add checking for 'mddev->pers' to
prevent null-ptr-deref.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-4-yukuai1@huaweicloud.com


# b8494823 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: initialize 'writes_pending' while allocating mddev

Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.

On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:

array_state_store
set_in_sync
if (!mddev->in_sync) -> in_sync is used for all levels
// access writes_pending

There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.

By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com


# d58eff83 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: initialize 'active_io' while allocating mddev

'active_io' is used for mddev_suspend() and it's initialized in
md_run(), this restrict that 'reconfig_mutex' must be held and
"mddev->pers" must be set before calling mddev_suspend().

Initialize 'active_io' early so that mddev_suspend() is safe to call
once mddev is allocated, this will be helpful to refactor
mddev_suspend() in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com


# 81e2ce1b 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: delay remove_and_add_spares() for read only array to md_start_sync()

Before this patch, for read-only array:

md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.

After this patch:

1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;

This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.

Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:

1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com


# a0ae7e4e 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: factor out a helper rdev_addable() from remove_and_add_spares()

There are no functional changes, just to make the code simpler and
prepare to delay remove_and_add_spares() to md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-7-yukuai1@huaweicloud.com


# b172a070 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: factor out a helper rdev_is_spare() from remove_and_add_spares()

There are no functional changes, just to make the code simpler and
prepare to delay remove_and_add_spares() to md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-6-yukuai1@huaweicloud.com


# 3389d57f 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: factor out a helper rdev_removeable() from remove_and_add_spares()

There are no functional changes, just to make the code simpler and
prepare to delay remove_and_add_spares() to md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-5-yukuai1@huaweicloud.com


# db5e653d 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: delay choosing sync action to md_start_sync()

Before this patch, for read-write array:

1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;

2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().

3) md_start_sync() register sync_thread;

After this patch,

1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();

Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.

Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().

The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com


# 897c62a1 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: factor out a helper to choose sync action from md_check_recovery()

There are no functional changes, on the one hand make the code cleaner,
on the other hand prevent following checkpatch error in the next patch to
delay choosing sync action to md_start_sync().

ERROR: do not use assignment in if condition
+ } else if ((spares = remove_and_add_spares(mddev, NULL))) {

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-3-yukuai1@huaweicloud.com


# ac619781 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: use separate work_struct for md_start_sync()

It's a little weird to borrow 'del_work' for md_start_sync(), declare
a new work_struct 'sync_work' for md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com


# 9f0f5a30 27-Sep-2023 Jan Kara <jack@suse.cz>

md: Convert to bdev_open_by_dev()

Convert md to use bdev_open_by_dev() and pass the handle around. We also
don't need the 'Holder' flag anymore so remove it.

CC: linux-raid@vger.kernel.org
CC: Song Liu <song@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-11-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>


# c8870379 14-Sep-2023 Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

md: Put the right device in md_seq_next

If there are multiple arrays in system and one mddevice is marked
with MD_DELETED and md_seq_next() is called in the middle of removal
then it _get()s proper device but it may _put() deleted one. As a result,
active counter may never be zeroed for mddevice and it cannot
be removed.

Put the device which has been _get with previous md_seq_next() call.

Cc: stable@vger.kernel.org
Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed")
Reported-by: AceLan Kao <acelan@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217798
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230914152416.10819-1-mariusz.tkaczyk@linux.intel.com


# 99892147 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: fix warning for holder mismatch from export_rdev()

Commit a1d767191096 ("md: use mddev->external to select holder in
export_rdev()") fix the problem that 'claim_rdev' is used for
blkdev_get_by_dev() while 'rdev' is used for blkdev_put().

However, if mddev->external is changed from 0 to 1, then 'rdev' is used
for blkdev_get_by_dev() while 'claim_rdev' is used for blkdev_put(). And
this problem can be reporduced reliably by following:

New file: mdadm/tests/23rdev-lifetime

devname=${dev0##*/}
devt=`cat /sys/block/$devname/dev`
pid=""
runtime=2

clean_up_test() {
pill -9 $pid
echo clear > /sys/block/md0/md/array_state
}

trap 'clean_up_test' EXIT

add_by_sysfs() {
while true; do
echo $devt > /sys/block/md0/md/new_dev
done
}

remove_by_sysfs(){
while true; do
echo remove > /sys/block/md0/md/dev-${devname}/state
done
}

echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed"

add_by_sysfs &
pid="$pid $!"

remove_by_sysfs &
pid="$pid $!"

sleep $runtime
exit 0

Test cmd:

./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime

Test result:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 960 at block/bdev.c:618 blkdev_put+0x27c/0x330
Modules linked in: multipath md_mod loop
CPU: 0 PID: 960 Comm: test Not tainted 6.5.0-rc2-00121-g01e55c376936-dirty #50
RIP: 0010:blkdev_put+0x27c/0x330
Call Trace:
<TASK>
export_rdev.isra.23+0x50/0xa0 [md_mod]
mddev_unlock+0x19d/0x300 [md_mod]
rdev_attr_store+0xec/0x190 [md_mod]
sysfs_kf_write+0x52/0x70
kernfs_fop_write_iter+0x19a/0x2a0
vfs_write+0x3b5/0x770
ksys_write+0x74/0x150
__x64_sys_write+0x22/0x30
do_syscall_64+0x40/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fix the problem by recording if 'rdev' is used as holder.

Fixes: a1d767191096 ("md: use mddev->external to select holder in export_rdev()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825025532.1523008-3-yukuai1@huaweicloud.com


# 7deac114 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: don't dereference mddev after export_rdev()

Except for initial reference, mddev->kobject is referenced by
rdev->kobject, and if the last rdev is freed, there is no guarantee that
mddev is still valid. Hence mddev should not be used anymore after
export_rdev().

This problem can be triggered by following test for mdadm at very
low rate:

New file: mdadm/tests/23rdev-lifetime

devname=${dev0##*/}
devt=`cat /sys/block/$devname/dev`
pid=""
runtime=2

clean_up_test() {
pill -9 $pid
echo clear > /sys/block/md0/md/array_state
}

trap 'clean_up_test' EXIT

add_by_sysfs() {
while true; do
echo $devt > /sys/block/md0/md/new_dev
done
}

remove_by_sysfs(){
while true; do
echo remove > /sys/block/md0/md/dev-${devname}/state
done
}

echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed"

add_by_sysfs &
pid="$pid $!"

remove_by_sysfs &
pid="$pid $!"

sleep $runtime
exit 0

Test cmd:

./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime

Test result:

general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bcb: 0000 [#4] PREEMPT SMP
CPU: 0 PID: 1292 Comm: test Tainted: G D W 6.5.0-rc2-00121-g01e55c376936 #562
RIP: 0010:md_wakeup_thread+0x9e/0x320 [md_mod]
Call Trace:
<TASK>
mddev_unlock+0x1b6/0x310 [md_mod]
rdev_attr_store+0xec/0x190 [md_mod]
sysfs_kf_write+0x52/0x70
kernfs_fop_write_iter+0x19a/0x2a0
vfs_write+0x3b5/0x770
ksys_write+0x74/0x150
__x64_sys_write+0x22/0x30
do_syscall_64+0x40/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fix this problem by don't dereference mddev after export_rdev().

Fixes: 3ce94ce5d05a ("md: fix duplicate filename for rdev")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825025532.1523008-2-yukuai1@huaweicloud.com


# 7eb8ff02 03-Aug-2023 Li Lingfeng <lilingfeng3@huawei.com>

md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread

Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in
action_store"") removed the scenario of calling md_unregister_thread()
without holding mddev->reconfig_mutex, so add a lock holding check before
acquiring mddev->sync_thread by passing mdev to md_unregister_thread().

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>


# e24ed043 27-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: restore 'noio_flag' for the last mddev_resume()

memalloc_noio_save() is called for the first mddev_suspend(), and
repeated mddev_suspend() only increase 'suspended'. However,
memalloc_noio_restore() is also called for the first mddev_resume(),
which means that memory reclaim will be enabled before the last
mddev_resume() is called, while the array is still suspended.

Fix this problem by restore 'noio_flag' for the last mddev_resume().

Fixes: 78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>


# b39f35eb 27-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: don't quiesce in mddev_suspend()

Some levels doesn't implement "pers->quiesce", for example
raid0_quiesce() is empty, and now that all levels will drop 'active_io'
until io is done, wait for 'active_io' to be 0 is enough to make sure all
normal io is done, and percpu_ref_kill() for 'active_io' will make sure
no new normal io can be dispatched. There is no need to call
"pers->quiesce" anymore from mddev_suspend().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230628012931.88911-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>


# c687297b 21-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: also clone new io if io accounting is disabled

Currently, 'active_io' is grabbed before make_reqeust() is called, and
it's dropped immediately make_reqeust() returns. Hence 'active_io'
actually means io is dispatching, not io is inflight.

For raid0 and raid456 that io accounting is enabled, 'active_io' will
also be grabbed when bio is cloned for io accounting, and this 'active_io'
is dropped until io is done.

Always clone new bio so that 'active_io' will mean that io is inflight,
raid1 and raid10 will switch to use this method in later patches.

Now that bio will be cloned even if io accounting is disabled, also
rename related structure from '*_acct_*' to '*_clone_*'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com


# c567c86b 21-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: move initialization and destruction of 'io_acct_set' to md.c

'io_acct_set' is only used for raid0 and raid456, prepare to use it for
raid1 and raid10, so that io accounting from different levels can be
consistent.

By the way, follow up patches will also use this io clone mechanism to
make sure 'active_io' represents in flight io, not io that is dispatching,
so that mddev_suspend will wait for io to be done as designed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com


# 0ae1c9d3 15-Jun-2023 Christoph Hellwig <hch@lst.de>

md: deprecate bitmap file support

The support for bitmaps on files is a very bad idea abusing various kernel
APIs, and fundamentally requires the file to not be on the actual array
without a way to check that this is actually the case. Add a deprecation
warning to see if we might be able to eventually drop it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-12-hch@lst.de


# a34d4ef8 15-Jun-2023 Christoph Hellwig <hch@lst.de>

md: make bitmap file support optional

The support for write intent bitmaps in files on an external files in md
is a hot mess that abuses ->bmap to map file offsets into physical device
objects, and also abuses buffer_heads in a creative way.

Make this code optional so that MD can be built into future kernels
without buffer_head support, and so that we can eventually deprecate it.

Note this does not affect the internal bitmap support, which has none of
the problems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-11-hch@lst.de


# f71209b1 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md: enhance checking in md_check_recovery()

For md_check_recovery():

1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread.
2) if 'MD_RECOVERY_RUNING' is set:
a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for
md_do_sync() to be done.
b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code
expects that sync_thread is not NULL, otherwise new sync_thread will
be registered, which will corrupt the array.

Make sure md_check_recovery() won't register new sync_thread if
'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for
the above corruption,

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-7-yukuai1@huaweicloud.com


# 753260ed 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md: wake up 'resync_wait' at last in md_reap_sync_thread()

md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...)
from action_store(), just make sure action_store() will still wait for
everything to be done in md_reap_sync_thread().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-6-yukuai1@huaweicloud.com


# 130443d6 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md: refactor idle/frozen_sync_thread() to fix deadlock

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry

// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

2) Trigger a recovery:

raid10_sync_request
raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

action_store
mddev_lock
md_unregister_thread
kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.

[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com


# 6f56f0c4 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md: add a mutex to synchronize idle and frozen in action_store()

Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.

Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com


# 64e5e09a 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md: refactor action_store() for 'idle' and 'frozen'

Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-3-yukuai1@huaweicloud.com


# a865b96c 29-May-2023 Yu Kuai <yukuai3@huawei.com>

Revert "md: unlock mddev before reap sync_thread in action_store"

This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.

Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:

list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18

This is because work is requeued while it's still inside workqueue:

t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock

t3:

// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted

The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com


# 7d5fff89 08-Jul-2023 Yu Kuai <yukuai3@huawei.com>

dm raid: protect md_stop() with 'reconfig_mutex'

__md_stop_writes() and __md_stop() will modify many fields that are
protected by 'reconfig_mutex', and all the callers will grab
'reconfig_mutex' except for md_stop().

Also, update md_stop() to make certain 'reconfig_mutex' is held using
lockdep_assert_held().

Fixes: 9d09e663d550 ("dm: raid456 basic support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 4934b640 21-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: fix 'delete_mutex' deadlock

Commit 3ce94ce5d05a ("md: fix duplicate filename for rdev") introduce a
new lock 'delete_mutex', and trigger a new deadlock:

t1: remove rdev t2: sysfs writer

rdev_attr_store rdev_attr_store
mddev_lock
state_store
md_kick_rdev_from_array
lock delete_mutex
list_add mddev->deleting
unlock delete_mutex
mddev_unlock
mddev_lock
...
lock delete_mutex
kobject_del
// wait for sysfs writers to be done
mddev_unlock
lock delete_mutex
// wait for delete_mutex, deadlock

'delete_mutex' is used to protect the list 'mddev->deleting', turns out
that this list can be protected by 'reconfig_mutex' directly, and this
lock can be removed.

Fix this problem by removing the lock, and use 'reconfig_mutex' to
protect the list. mddev_unlock() will move this list to a local list to
be handled after 'reconfig_mutex' is dropped.

Fixes: 3ce94ce5d05a ("md: fix duplicate filename for rdev")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621142933.1395629-1-yukuai1@huaweicloud.com


# a1d76719 16-Jun-2023 Song Liu <song@kernel.org>

md: use mddev->external to select holder in export_rdev()

mdadm test "10ddf-create-fail-rebuild" triggers warnings like the following

[ 215.526357] ------------[ cut here ]------------
[ 215.527243] WARNING: CPU: 18 PID: 1264 at block/bdev.c:617 blkdev_put+0x269/0x350
[ 215.528334] Modules linked in:
[ 215.528806] CPU: 18 PID: 1264 Comm: mdmon Not tainted 6.4.0-rc2+ #768
[ 215.529863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
[ 215.531464] RIP: 0010:blkdev_put+0x269/0x350
[ 215.532167] Code: ff ff 49 8d 7d 10 e8 56 bf b8 ff 4d 8b 65 10 49 8d bc
24 58 05 00 00 e8 05 be b8 ff 41 83 ac 24 58 05 00 00 01 e9 44 ff ff ff
<0f> 0b e9 52 fe ff ff 0f 0b e9 6b fe ff ff1
[ 215.534780] RSP: 0018:ffffc900040bfbf0 EFLAGS: 00010283
[ 215.535635] RAX: ffff888174001000 RBX: ffff88810b1c3b00 RCX: ffffffff819a4061
[ 215.536645] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88810b1c3ba0
[ 215.537657] RBP: ffff88810dbde800 R08: fffffbfff0fca983 R09: fffffbfff0fca983
[ 215.538674] R10: ffffc900040bfbf0 R11: fffffbfff0fca982 R12: ffff88810b1c3b38
[ 215.539687] R13: ffff88810b1c3b10 R14: ffff88810dbdecb8 R15: ffff88810b1c3b00
[ 215.540833] FS: 00007f2aabdff700(0000) GS:ffff888dfb400000(0000) knlGS:0000000000000000
[ 215.541961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 215.542775] CR2: 00007fa19a85d934 CR3: 000000010c076006 CR4: 0000000000370ee0
[ 215.543814] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 215.544840] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 215.545885] Call Trace:
[ 215.546257] <TASK>
[ 215.546608] export_rdev.isra.63+0x71/0xe0
[ 215.547338] mddev_unlock+0x1b1/0x2d0
[ 215.547898] array_state_store+0x28d/0x450
[ 215.548519] md_attr_store+0xd7/0x150
[ 215.549059] ? __pfx_sysfs_kf_write+0x10/0x10
[ 215.549702] kernfs_fop_write_iter+0x1b9/0x260
[ 215.550351] vfs_write+0x491/0x760
[ 215.550863] ? __pfx_vfs_write+0x10/0x10
[ 215.551445] ? __fget_files+0x156/0x230
[ 215.552053] ksys_write+0xc0/0x160
[ 215.552570] ? __pfx_ksys_write+0x10/0x10
[ 215.553141] ? ktime_get_coarse_real_ts64+0xec/0x100
[ 215.553878] do_syscall_64+0x3a/0x90
[ 215.554403] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 215.555125] RIP: 0033:0x7f2aade11847
[ 215.555696] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec
10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05
<48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 448
[ 215.558398] RSP: 002b:00007f2aabdfeba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ 215.559516] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f2aade11847
[ 215.560515] RDX: 0000000000000005 RSI: 0000000000438b8b RDI: 0000000000000010
[ 215.561512] RBP: 0000000000438b8b R08: 0000000000000000 R09: 00007f2aaecf0060
[ 215.562511] R10: 000000000e3ba40b R11: 0000000000000293 R12: 0000000000000005
[ 215.563647] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000c70750
[ 215.564693] </TASK>
[ 215.565029] irq event stamp: 15979
[ 215.565584] hardirqs last enabled at (15991): [<ffffffff811a7432>] __up_console_sem+0x52/0x60
[ 215.566806] hardirqs last disabled at (16000): [<ffffffff811a7417>] __up_console_sem+0x37/0x60
[ 215.568022] softirqs last enabled at (15716): [<ffffffff8277a2db>] __do_softirq+0x3eb/0x531
[ 215.569239] softirqs last disabled at (15711): [<ffffffff810d8f45>] irq_exit_rcu+0x115/0x160
[ 215.570434] ---[ end trace 0000000000000000 ]---

This means export_rdev() calls blkdev_put with a different holder than the
one used by blkdev_get_by_dev(). This is because mddev->major_version == -2
is not a good check for external metadata. Fix this by using
mddev->external instead.

Also, do not clear mddev->external in md_clean(), as the flag might be used
later in export_rdev().

Fixes: 2736e8eeb0cc ("block: use the holder as indication for exclusive opens")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230617052405.305871-1-song@kernel.org


# a022325a 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md/md-bitmap: add a new helper to unplug bitmap asynchrously

If bitmap is enabled, bitmap must update before submitting write io, this
is why unplug callback must move these io to 'conf->pending_io_list' if
'current->bio_list' is not empty, which will suffer performance
degradation.

A new helper md_bitmap_unplug_async() is introduced to submit bitmap io
in a kworker, so that submit bitmap io in raid10_unplug() doesn't require
that 'current->bio_list' is empty.

This patch prepare to limit the number of plugged bio.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com


# 4d8a5754 27-May-2023 Li Nan <linan122@huawei.com>

md/raid10: clean up md_add_new_disk()

Commit 1a855a060665 ("md: fix bug with re-adding of partially recovered
device.") only add device which is set to In_sync. But it let devices
without metadata cannot be added when they should be.

Commit bf572541ab44 ("md: fix regression with re-adding devices to arrays
with no metadata") fix the above issue, it set device without metadata to
In_sync when add new disk.

However, after commit f466722ca614 ("md: Change handling of save_raid_disk
and metadata update during recovery.") deletes changes of the first patch,
setting In_sync for devcie without metadata is meanless because the flag
will be cleared soon and will not be used during this period. Clean it up.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com


# 44693154 22-May-2023 Yu Kuai <yukuai3@huawei.com>

md: protect md_thread with rcu

Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:

1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().

Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.

Fix this problem by protecting all md_thread with rcu.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com


# e5e9b9cb 22-May-2023 Yu Kuai <yukuai3@huawei.com>

md: factor out a helper to wake up md_thread directly

md_wakeup_thread() can't wakeup md_thread->tsk if md_thread->run is
still in progress, and in some cases md_thread->tsk need to be woke up
directly, like md_set_readonly() and do_md_stop().

Commit 9dfbdafda3b3 ("md: unlock mddev before reap sync_thread in
action_store") introduce a new scenario where unregister sync_thread is
not protected by 'reconfig_mutex', this can cause null-ptr-deference in
theroy:

t1: md_set_readonly t2: action_store
md_unregister_thread
// 'reconfig_mutex' is not held
// 'reconfig_mutex' is held by caller
if (mddev->sync_thread)
thread = *threadp
*threadp = NULL
wake_up_process(mddev->sync_thread->tsk)
// null-ptr-deference

Fix this problem by factoring out a helper to wake up md_thread directly,
so that 'sync_thread' won't be accessed multiple times from the reader
side. This helper also prepare to protect md_thread with rcu.

Noted that later patches is going to fix that unregister sync_thread is
not protected by 'reconfig_mutex' from action_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-2-yukuai1@huaweicloud.com


# 3ce94ce5 22-May-2023 Yu Kuai <yukuai3@huawei.com>

md: fix duplicate filename for rdev

Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.

Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():

1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);

So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.

flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.

sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.

A new mdadm regression test is proposed as well([1]).

[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/

Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com


# f8b20a40 22-May-2023 Li Nan <linan122@huawei.com>

md/raid10: fix wrong setting of max_corr_read_errors

There is no input check when echo md/max_read_errors and overflow might
occur. Add check of input number.

Fixes: 1e50915fe0bb ("raid: improve MD/raid10 handling of correctable read errors.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230522072535.1523740-3-linan666@huaweicloud.com


# 6beb489b 22-May-2023 Li Nan <linan122@huawei.com>

md/raid10: fix overflow of md/safe_mode_delay

There is no input check when echo md/safe_mode_delay in safe_delay_store().
And msec might also overflow when HZ < 1000 in safe_delay_show(), Fix it by
checking overflow in safe_delay_store() and use unsigned long conversion in
safe_delay_show().

Fixes: 72e02075a33f ("md: factor out parsing of fixed-point numbers")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230522072535.1523740-2-linan666@huaweicloud.com


# 868bba54 11-May-2023 Yu Kuai <yukuai3@huawei.com>

md/raid5: fix a deadlock in the case that reshape is interrupted

If reshape is in progress and io across reshape_position is issued, such
io will wait for reshape to make progress(see details in the case that
make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY).

It has been reported several times that if system reboot while growing
raid5 to raid6, array assemble will hang infinitely([1, 2]). This is
because following deadlock is triggered:

1) a normal io is waiting for reshape to progress, this io can be from
system-udevd or mdadm.
2) while assemble, mdadm tries to suspend the array, hence
'reconfig_mutex' is held and mddev_suspend() must wait for normal io
to be done.
3) daemon thread can't start reshape because 'reconfig_mutex' can't be
held.

1) and 3) is unbreakable because they're foundation design. In order to
break 2), following is possible solutions that I can think of:

a) Let mddev_suspend() fail is not a good option, because this will
break many scenarios since mddev_suspend() doesn't fail before.
b) Fail the io that is waiting for reshape to make progress from
mddev_suspend().
c) Return false for the io that is waiting for reshape to make
progress from raid5_make_request(), and these io will wait for
suspend to be done in md_handle_request(), where 'active_io' is
not grabbed.

c) sounds better than b), however, b) is used because it's easy and
straightforward, and it's verified that mdadm can assemble in this case.
On the other hand, c) breaks the logic that mddev_suspend() will wait
for submitted io to be completely handled.

Fix the problem by checking reshape in mddev_suspend(), if reshape can't
make progress and there are still some io waiting for reshape, fail
those io.

[1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/
[2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/

Reported-by: Jove <jovetoo@gmail.com>
Reported-by: David Gilmour <dgilmour76@gmail.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-6-yukuai1@huaweicloud.com


# 3e00777d 11-May-2023 Yu Kuai <yukuai3@huawei.com>

md: add a new api prepare_suspend() in md_personality

There are no functional changes, the new api will be used later to do
special handling for raid456 in md_suspend().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-5-yukuai1@huaweicloud.com


# 431e6125 11-May-2023 Yu Kuai <yukuai3@huawei.com>

md: export md_is_rdwr() and is_md_suspended()

The two apis will be used later to fix a deadlock in raid456, there are
no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-4-yukuai1@huaweicloud.com


# 873f50ec 11-May-2023 Yu Kuai <yukuai3@huawei.com>

md: fix data corruption for raid456 when reshape restart while grow up

Currently, if reshape is interrupted, echo "reshape" to sync_action will
restart reshape from scratch, for example:

echo frozen > sync_action
echo reshape > sync_action

This will corrupt data before reshape_position if the array is growing,
fix the problem by continue reshape from reshape_position.

Reported-by: Peter Neuwirth <reddunur@online.de>
Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@online.de/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-3-yukuai1@huaweicloud.com


# 05bdb996 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: replace fmode_t with a block-specific type for block open flags

The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2736e8ee 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: use the holder as indication for exclusive opens

The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ae220766 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: remove the unused mode argument to ->release

The mode argument to the ->release block_device_operation is never used,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d32e2bf8 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: pass a gendisk to ->open

->open is only called on the whole device. Make that explicit by
passing a gendisk instead of the block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 444aa2c5 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: pass a gendisk on bdev_check_media_change

bdev_check_media_change should only ever be called for the whole device.
Pass a gendisk to make that explicit and rename the function to
disk_check_media_change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0718afd4 01-Jun-2023 Christoph Hellwig <hch@lst.de>

block: introduce holder ops

Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims. It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3c383235 31-May-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

md: use __bio_add_page to add single page

The md-raid superblock writing code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Signed-of_-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/ca196f5e650e318106dbb4496eb6cbac4bc800bd.1685532726.git.johannes.thumshirn@wdc.com

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9adcf9d3 02-Mar-2023 Luis Chamberlain <mcgrof@kernel.org>

md: simplify sysctl registration

register_sysctl_table() is a deprecated compatibility wrapper.
register_sysctl() can do the directory creation for you so just use
that.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Song Liu <song@kernel.org>


# 6efddf1e 10-Mar-2023 Yu Kuai <yukuai3@huawei.com>

md: fix soft lockup in status_resync

status_resync() will calculate 'curr_resync - recovery_active' to show
user a progress bar like following:

[============>........] resync = 61.4%

'curr_resync' and 'recovery_active' is updated in md_do_sync(), and
status_resync() can read them concurrently, hence it's possible that
'curr_resync - recovery_active' can overflow to a huge number. In this
case status_resync() will be stuck in the loop to print a large amount
of '=', which will end up soft lockup.

Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case,
this way resync in progress will be reported to user.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com


# c31fea2f 06-Mar-2023 Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

md: add error_handlers for raid0 and linear

After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10")
MD_BROKEN must be set if array is failed because state_store() checks it.
If it is set then -EBUSY is returned to userspace.

For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
mdadm is unable to trigger clean-up actions. It is a regression.

This patch adds appropriate error_handler for raid0 and linear. The
error handler sets MD_BROKEN for this device.

Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com


# 4d72a9de 13-Feb-2023 Thomas Weißschuh <linux@weissschuh.net>

md: make kobj_type structures constant

Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230214-kobj_type-md-v1-1-d6853f707f11@weissschuh.net


# 433279be 28-Mar-2023 Yu Kuai <yukuai3@huawei.com>

md: fix regression for null-ptr-deference in __md_stop()

Commit 3e453522593d ("md: Free resources in __md_stop") tried to fix
null-ptr-deference for 'active_io' by moving percpu_ref_exit() to
__md_stop(), however, the commit also moving 'writes_pending' to
__md_stop(), and this will cause mdadm tests broken:

BUG: kernel NULL pointer dereference, address: 0000000000000038
Oops: 0000 [#1] PREEMPT SMP
CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37
RIP: 0010:free_percpu+0x465/0x670
Call Trace:
<TASK>
__percpu_ref_exit+0x48/0x70
percpu_ref_exit+0x1a/0x90
__md_stop+0xe9/0x170
do_md_stop+0x1e1/0x7b0
md_ioctl+0x90c/0x1aa0
blkdev_ioctl+0x19b/0x400
vfs_ioctl+0x20/0x50
__x64_sys_ioctl+0xba/0xe0
do_syscall_64+0x6c/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0xcd

And the problem can be reporduced 100% by following test:

mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force
echo inactive > /sys/block/md0/md/array_state
echo read-auto > /sys/block/md0/md/array_state
echo inactive > /sys/block/md0/md/array_state

Root cause:

// start raid
raid1_run
mddev_init_writes_pending
percpu_ref_init

// inactive raid
array_state_store
do_md_stop
__md_stop
percpu_ref_exit

// start raid again
array_state_store
do_md_run
raid1_run
mddev_init_writes_pending
if (mddev->writes_pending.percpu_count_ptr)
// won't reinit

// inactive raid again
...
percpu_ref_exit
-> null-ptr-deference

Before the commit, 'writes_pending' is exited when mddev is freed, and
it's safe to restart raid because mddev_init_writes_pending() already make
sure that 'writes_pending' will only be initialized once.

Fix the prblem by moving 'writes_pending' back, it's a litter hard to find
the relationship between alloc memory and free memory, however, code
changes is much less and we lived with this for a long time already.

Fixes: 3e453522593d ("md: Free resources in __md_stop")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com


# 3bc57292 05-Mar-2023 NeilBrown <neilb@suse.de>

md: avoid signed overflow in slot_store()

slot_store() uses kstrtouint() to get a slot number, but stores the
result in an "int" variable (by casting a pointer).
This can result in a negative slot number if the unsigned int value is
very large.

A negative number means that the slot is empty, but setting a negative
slot number this way will not remove the device from the array. I don't
think this is a serious problem, but it could cause confusion and it is
best to fix it.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <song@kernel.org>


# 3e453522 21-Feb-2023 Xiao Ni <xni@redhat.com>

md: Free resources in __md_stop

If md_run() fails after ->active_io is initialized, then percpu_ref_exit
is called in error path. However, later md_free_disk will call
percpu_ref_exit again which leads to a panic because of null pointer
dereference. It can also trigger this bug when resources are initialized
but are freed in error path, then will be freed again in md_free_disk.

BUG: kernel NULL pointer dereference, address: 0000000000000038
Oops: 0000 [#1] PREEMPT SMP
Workqueue: md_misc mddev_delayed_delete
RIP: 0010:free_percpu+0x110/0x630
Call Trace:
<TASK>
__percpu_ref_exit+0x44/0x70
percpu_ref_exit+0x16/0x90
md_free_disk+0x2f/0x80
disk_release+0x101/0x180
device_release+0x84/0x110
kobject_put+0x12a/0x380
kobject_put+0x160/0x380
mddev_delayed_delete+0x19/0x30
process_one_work+0x269/0x680
worker_thread+0x266/0x640
kthread+0x151/0x1b0
ret_from_fork+0x1f/0x30

For creating raid device, md raid calls do_md_run->md_run, dm raid calls
md_run. We alloc those memory in md_run. For stopping raid device, md raid
calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
free those memory resources in __md_stop.

Fixes: 72adae23a72c ("md: Change active_io to percpu")
Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# 76fed014 02-Feb-2023 Xiao Ni <xni@redhat.com>

md: account io_acct_set usage with active_io

io_acct_set was enabled for raid0/raid5 io accounting. bios that contain
md_io_acct are allocated in the i/o path. There isn't a good method to
monitor if these bios are all finished and freed. In the takeover process,
io_acct_set (which is used for bios with md_io_acct) need to be freed.
However, if some bios finish after io_acct_set is freed, it may trigger
the following panic:

[ 6973.767999] RIP: 0010:mempool_free+0x52/0x80
[ 6973.786098] Call Trace:
[ 6973.786549] md_end_io_acct+0x31/0x40
[ 6973.787227] blk_update_request+0x224/0x380
[ 6973.787994] blk_mq_end_request+0x1a/0x130
[ 6973.788739] blk_complete_reqs+0x35/0x50
[ 6973.789456] __do_softirq+0xd7/0x2c8
[ 6973.790114] ? sort_range+0x20/0x20
[ 6973.790763] run_ksoftirqd+0x2a/0x40
[ 6973.791400] smpboot_thread_fn+0xb5/0x150
[ 6973.792114] kthread+0x10b/0x130
[ 6973.792724] ? set_kthread_struct+0x50/0x50
[ 6973.793491] ret_from_fork+0x1f/0x40

Fix this by increasing and decreasing active_io for each bio with
md_io_acct so that mddev_suspend() will wait until all bios from
io_acct_set finish before freeing io_acct_set.

Reported-by: Fine Fan <ffan@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# ed821cf8 01-Feb-2023 Hou Tao <houtao1@huawei.com>

md: use MD_RESYNC_* whenever possible

Just replace magic numbers by MD_RESYNC_* enumerations.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>


# 07dbb135 20-Jan-2023 Xiao Ni <xni@redhat.com>

md: Free writes_pending in md_stop

dm raid calls md_stop to stop the raid device. It needs to
free the writes_pending here.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# 72adae23 30-Jan-2023 Xiao Ni <xni@redhat.com>

md: Change active_io to percpu

Now the type of active_io is atomic. It's used to count how many ios are
in the submitting process and it's added and decreased very time. But it
only needs to check if it's zero when suspending the raid. So we can
switch atomic to percpu to improve the performance.

After switching active_io to percpu type, we use the state of active_io
to judge if the raid device is suspended. And we don't need to wake up
->sb_wait in md_handle_request anymore. It's done in the callback function
which is registered when initing active_io. The argument mddev->suspended
is only used to count how many users are trying to set raid to suspend
state.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# d1932913 30-Jan-2023 Xiao Ni <xni@redhat.com>

md: Factor out is_md_suspended helper

This helper function will be used in next patch. It's easy for
understanding.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# 1d1f25bf 31-Jan-2023 Hou Tao <houtao1@huawei.com>

md: don't update recovery_cp when curr_resync is ACTIVE

Don't update recovery_cp when curr_resync is MD_RESYNC_ACTIVE, otherwise
md may skip the resync of the first 3 sectors if the resync procedure is
interrupted before the first calling of ->sync_request() as shown below:

md_do_sync thread control thread
// setup resync
mddev->recovery_cp = 0
j = 0
mddev->curr_resync = MD_RESYNC_ACTIVE

// e.g., set array as idle
set_bit(MD_RECOVERY_INTR, &&mddev_recovery)
// resync loop
// check INTR before calling sync_request
!test_bit(MD_RECOVERY_INTR, &mddev->recovery

// resync interrupted
// update recovery_cp from 0 to 3
// the resync of three 3 sectors will be skipped
mddev->recovery_cp = 3

Fixes: eac58d08d493 ("md: Use enum for overloaded magic numbers used by mddev->curr_resync")
Cc: stable@vger.kernel.org # 6.0+
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>


# b0907cad 09-Jan-2023 Adrian Huang <ahuang12@lenovo.com>

md: fix incorrect declaration about claim_rdev in md_import_device

Commit fb541ca4c365 ("md: remove lock_bdev / unlock_bdev") removes
wrappers for blkdev_get/blkdev_put. However, the uninitialized local
static variable of pointer type 'claim_rdev' in md_import_device()
is NULL, which leads to the following warning call trace:

WARNING: CPU: 22 PID: 1037 at block/bdev.c:577 bd_prepare_to_claim+0x131/0x150
CPU: 22 PID: 1037 Comm: mdadm Not tainted 6.2.0-rc3+ #69
..
RIP: 0010:bd_prepare_to_claim+0x131/0x150
..
Call Trace:
<TASK>
? _raw_spin_unlock+0x15/0x30
? iput+0x6a/0x220
blkdev_get_by_dev.part.0+0x4b/0x300
md_import_device+0x126/0x1d0
new_dev_store+0x184/0x240
md_attr_store+0x80/0xf0
kernfs_fop_write_iter+0x128/0x1c0
vfs_write+0x2be/0x3c0
ksys_write+0x5f/0xe0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

It turns out the md device cannot be used:

md: could not open device unknown-block(259,0).
md: md127 stopped.

Fix the issue by declaring the local static variable of struct type
and passing the pointer of the variable to blkdev_get_by_dev().

Fixes: fb541ca4c365 ("md: remove lock_bdev / unlock_bdev")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 613b1488 04-Jan-2023 Jens Axboe <axboe@kernel.dk>

block: handle bio_split_to_limits() NULL return

This can't happen right now, but in preparation for allowing
bio_split_to_limits() returning NULL if it ended the bio, check for it
in all the callers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b5c1acf0 29-Nov-2022 Christoph Hellwig <hch@lst.de>

md: fold unbind_rdev_from_array into md_kick_rdev_from_array

unbind_rdev_from_array is only called from md_kick_rdev_from_array, so
merge it into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# d57d9d69 29-Nov-2022 Christoph Hellwig <hch@lst.de>

md: mark md_kick_rdev_from_array static

md_kick_rdev_from_array is only used in md.c, so unexport it and mark
the symbol static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# fb541ca4 29-Nov-2022 Christoph Hellwig <hch@lst.de>

md: remove lock_bdev / unlock_bdev

These wrappers for blkdev_get / blkdev_put just horribly confuse the
code with their odd naming. Remove them and improve the error unwinding
in md_import_device with the now folded code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 341097ee 04-Nov-2022 Mikulas Patocka <mpatocka@redhat.com>

md: fix a crash in mempool_free

There's a crash in mempool_free when running the lvm test
shell/lvchange-rebuild-raid.sh.

The reason for the crash is this:
* super_written calls atomic_dec_and_test(&mddev->pending_writes) and
wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
and bio_put(bio).
* so, the process that waited on sb_wait and that is woken up is racing
with bio_put(bio).
* if the process wins the race, it calls bioset_exit before bio_put(bio)
is executed.
* bio_put(bio) attempts to free a bio into a destroyed bio set - causing
a crash in mempool_free.

We fix this bug by moving bio_put before atomic_dec_and_test.

We also move rdev_dec_pending before atomic_dec_and_test as suggested by
Neil Brown.

The function md_end_flush has a similar bug - we must call bio_put before
we decrement the number of in-progress bios.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 11557f0067 P4D 11557f0067 PUD 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: kdelayd flush_expired_bios [dm_delay]
RIP: 0010:mempool_free+0x47/0x80
Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
Call Trace:
<TASK>
clone_endio+0xf4/0x1c0 [dm_mod]
clone_endio+0xf4/0x1c0 [dm_mod]
__submit_bio+0x76/0x120
submit_bio_noacct_nocheck+0xb6/0x2a0
flush_expired_bios+0x28/0x2f [dm_delay]
process_one_work+0x1b4/0x300
worker_thread+0x45/0x3e0
? rescuer_thread+0x380/0x380
kthread+0xc2/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>


# f97a5528 19-Sep-2022 Ye Bin <yebin10@huawei.com>

md: introduce md_ro_state

Introduce md_ro_state for mddev->ro, so it is easy to understand.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>


# 2f6d261e 19-Sep-2022 Ye Bin <yebin10@huawei.com>

md: factor out __md_set_array_info()

Factor out __md_set_array_info(). No functional change.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>


# 568ec936 27-Sep-2022 Christoph Hellwig <hch@lst.de>

block: replace blk_queue_nowait with bdev_nowait

Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3bfc3bcd 08-Sep-2022 Logan Gunthorpe <logang@deltatee.com>

md: Remove extra mddev_get() in md_seq_start()

A regression is seen where mddev devices stay permanently after they
are stopped due to an elevated reference count.

This was tracked down to an extra mddev_get() in md_seq_start().

It only happened rarely because most of the time the md_seq_start()
is called with a zero offset. The path with an extra mddev_get() only
happens when it starts with a non-zero offset.

The commit noted below changed an mddev_get() to check its success
but inadvertently left the original call in. Remove the extra call.

Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>


# 0dd84b31 17-Aug-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

md: call __md_stop_writes in md_stop

From the link [1], we can see raid1d was running even after the path
raid_dtr -> md_stop -> __md_stop.

Let's stop write first in destructor to align with normal md-raid to
fix the KASAN issue.

[1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a

Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>


# 1d258758 17-Aug-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

Revert "md-raid: destroy the bitmap after destroying the thread"

This reverts commit e151db8ecfb019b7da31d076130a794574c89f6f. Because it
obviously breaks clustered raid as noticed by Neil though it fixed KASAN
issue for dm-raid, let's revert it and fix KASAN issue in next commit.

[1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linux.dev/T/#m95ac225cab7409f66c295772483d091084a6d470

Fixes: e151db8ecfb0 ("md-raid: destroy the bitmap after destroying the thread")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>


# 5e8daf90 11-Aug-2022 David Sloan <david.sloan@eideticom.com>

md: Flush workqueue md_rdev_misc_wq in md_alloc()

A race condition still exists when removing and re-creating md devices
in test cases. However, it is only seen on some setups.

The race condition was tracked down to a reference still being held
to the kobject by the rdev in the md_rdev_misc_wq which will be released
in rdev_delayed_delete().

md_alloc() waits for previous deletions by waiting on the md_misc_wq,
but the md_rdev_misc_wq may still be holding a reference to a recently
removed device.

To fix this, also flush the md_rdev_misc_wq in md_alloc().

Signed-off-by: David Sloan <david.sloan@eideticom.com>
[logang@deltatee.com: rewrote commit message]
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>


# 5a97806f 26-Jul-2022 Christoph Hellwig <hch@lst.de>

block: change the blk_queue_split calling convention

The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.

Also give it and the helpers used to implement it more descriptive names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e151db8e 24-Jul-2022 Mikulas Patocka <mpatocka@redhat.com>

md-raid: destroy the bitmap after destroying the thread

When we ran the lvm test "shell/integrity-blocksize-3.sh" on a kernel with
kasan, we got failure in write_page.

The reason for the failure is that md_bitmap_destroy is called before
destroying the thread and the thread may be waiting in the function
write_page for the bio to complete. When the thread finishes waiting, it
executes "if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))", which
triggers the kasan warning.

Note that the commit 48df498daf62 that caused this bug claims that it is
neede for md-cluster, you should check md-cluster and possibly find
another bugfix for it.

BUG: KASAN: use-after-free in write_page+0x18d/0x680 [md_mod]
Read of size 8 at addr ffff889162030c78 by task mdX_raid1/5539

CPU: 10 PID: 5539 Comm: mdX_raid1 Not tainted 5.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report.cold+0x45/0x57a
? __lock_text_start+0x18/0x18
? write_page+0x18d/0x680 [md_mod]
kasan_report+0xa8/0xe0
? write_page+0x18d/0x680 [md_mod]
kasan_check_range+0x13f/0x180
write_page+0x18d/0x680 [md_mod]
? super_sync+0x4d5/0x560 [dm_raid]
? md_bitmap_file_kick+0xa0/0xa0 [md_mod]
? rs_set_dev_and_array_sectors+0x2e0/0x2e0 [dm_raid]
? mutex_trylock+0x120/0x120
? preempt_count_add+0x6b/0xc0
? preempt_count_sub+0xf/0xc0
md_update_sb+0x707/0xe40 [md_mod]
md_reap_sync_thread+0x1b2/0x4a0 [md_mod]
md_check_recovery+0x533/0x960 [md_mod]
raid1d+0xc8/0x2a20 [raid1]
? var_wake_function+0xe0/0xe0
? psi_group_change+0x411/0x500
? preempt_count_sub+0xf/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? raid1_end_read_request+0x2a0/0x2a0 [raid1]
? preempt_count_sub+0xf/0xc0
? _raw_spin_unlock_irqrestore+0x19/0x40
? del_timer_sync+0xa9/0x100
? try_to_del_timer_sync+0xc0/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? __list_del_entry_valid+0x68/0xa0
? finish_wait+0xa3/0x100
md_thread+0x161/0x260 [md_mod]
? unregister_md_personality+0xa0/0xa0 [md_mod]
? _raw_spin_lock_irqsave+0x78/0xc0
? prepare_to_wait_event+0x2c0/0x2c0
? unregister_md_personality+0xa0/0xa0 [md_mod]
kthread+0x148/0x180
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>

Allocated by task 5522:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x80/0xa0
md_bitmap_create+0xa8/0xe80 [md_mod]
md_run+0x777/0x1300 [md_mod]
raid_ctr+0x249c/0x4a30 [dm_raid]
dm_table_add_target+0x2b0/0x620 [dm_mod]
table_load+0x1c8/0x400 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Freed by task 5680:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x40
kasan_set_free_info+0x20/0x40
__kasan_slab_free+0xf7/0x140
kfree+0x80/0x240
md_bitmap_free+0x1c3/0x280 [md_mod]
__md_stop+0x21/0x120 [md_mod]
md_stop+0x9/0x40 [md_mod]
raid_dtr+0x1b/0x40 [dm_raid]
dm_table_destroy+0x98/0x1e0 [dm_mod]
__dm_destroy+0x199/0x360 [dm_mod]
dev_remove+0x10c/0x160 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 34cb92c0 23-Jul-2022 Christoph Hellwig <hch@lst.de>

md: return the allocated devices from md_alloc

Two callers of md_alloc want to use the newly allocated devices, so
return it instead of letting them find it cumbersomely after the
allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a1108768 23-Jul-2022 Christoph Hellwig <hch@lst.de>

md: open code md_probe in autorun_devices

autorun_devices should not be limited to the controls for the legacy
probe on open, so just call md_alloc directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c0250d16 21-Jul-2022 Yang Li <yang.lee@linux.alibaba.com>

md: remove unneeded semicolon

Eliminate the following coccicheck warning:
./drivers/md/md.c:8208:2-3: Unneeded semicolon

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2198c51a 20-Jul-2022 Stephen Rothwell <sfr@canb.auug.org.au>

md: fix build failure for !MODULE

After merging the block tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/md/md.c:717:22: error: 'mddev_find' defined but not used [-Werror=unused-function]
717 | static struct mddev *mddev_find(dev_t unit)
| ^~~~~~~~~~
cc1: all warnings being treated as errors

Caused by commit

4500d5c17910 ("md: simplify md_open")

Make mddev_find() available only for non-modular builds.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220721131132.070be166@canb.auug.org.au
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5b26804b 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: simplify md_open

Now that devices are on the all_mddevs list until the gendisk is freed,
there can't be any duplicates. Remove the global list lookup and just
grab a reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 12a6caf2 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: only delete entries from all_mddevs when the disk is freed

This ensures device names don't get prematurely reused. Instead add a
deleted flag to skip already deleted devices in mddev_get and other
places that only want to see live mddevs.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 16648bac 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: stop using for_each_mddev in md_exit

Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock and delete the now unused for_each_mddev
macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f2651434 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: stop using for_each_mddev in md_notify_reboot

Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b0e706a1 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: stop using for_each_mddev in md_do_sync

Just do a plain list_for_each that only grabs a mddev reference in
the case where the thread sleeps and restarts the list iteration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2652a1bd 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: factor out the rdev overlaps check from rdev_size_store

This splits the code into nicely readable chunks and also avoids
the refcount inc/dec manipulations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 33b614e3 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: rename md_free to md_kobj_release

The md_free name is rather misleading, so pick a better one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e8c59ac4 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: implement ->free_disk

Ensure that all private data is only freed once all accesses are done.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c57094a6 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: fix error handling in md_alloc

Error handling in md_alloc is a mess. Untangle it to just free the mddev
directly before add_disk is called and thus the gendisk is globally
visible. After that clear the hold flag and let the mddev_put take care
of cleaning up the mddev through the usual mechanisms.

Fixes: 5e55e2f5fc95 ("[PATCH] md: convert compile time warnings into runtime warnings")
Fixes: 9be68dd7ac0e ("md: add error handling support for add_disk()")
Fixes: 7ad1069166c0 ("md: properly unwind when failing to add the kobject in md_alloc")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ca39f750 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: fix mddev->kobj lifetime

Once a kobject is initialized, the containing object should not be
directly freed. So delay initialization until it is added. Also
remove the kobject_del call as the last put will remove the kobject as
well. The explicitly delete isn't needed here, and dropping it will
simplify further fixes.

With this md_free now does not need to check that ->gendisk is non-NULL
as it is always set by the time that kobject_init is called on
mddev->kobj.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9dfbdafd 20-Jun-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

md: unlock mddev before reap sync_thread in action_store

Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.

Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.

1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
not changed accidentally by others.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 05ce7fb9 31-May-2022 Chris Webb <chris@arachsys.com>

md: Explicitly create command-line configured devices

Boot-time assembly of arrays with md= command-line arguments breaks when
CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
calls blkdev_get_by_dev(), assuming this implicitly creates the block
device.

Fix this by attempting to md_alloc() the array first. As in the probe path,
ignore any error as failure is caught by blkdev_get_by_dev() anyway.

Signed-off-by: Chris Webb <chris@arachsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9973f0fa 08-Jun-2022 Logan Gunthorpe <logang@deltatee.com>

md: Notify sysfs sync_completed in md_reap_sync_thread()

The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)

When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".

To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b368856a 08-Jun-2022 Logan Gunthorpe <logang@deltatee.com>

md: Ensure resync is reported after it starts

The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:

mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists

This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.

mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.

The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.

To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# eac58d08 08-Jun-2022 Logan Gunthorpe <logang@deltatee.com>

md: Use enum for overloaded magic numbers used by mddev->curr_resync

Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.

The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4ce4c73f 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

md/core: Combine two sync_page_io() arguments

Improve uniformity in the kernel of handling of request operation and
flags by passing these as a single argument.

Cc: Song Liu <song@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 900d156b 12-Jul-2022 Christoph Hellwig <hch@lst.de>

block: remove bdevname

Replace the remaining calls of bdevname with snprintf using the %pg
format specifier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8b9ab626 19-Jun-2022 Christoph Hellwig <hch@lst.de>

block: remove blk_cleanup_disk

blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d0a18034 06-Jun-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

Revert "md: don't unregister sync_thread with reconfig_mutex held"

The 07reshape5intr test is broke because of below path.

md_reap_sync_thread
-> mddev_unlock
-> md_unregister_thread(&mddev->sync_thread)

And md_check_recovery is triggered by,

mddev_unlock -> md_wakeup_thread(mddev->thread)

then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
reshape_position can't be updated accordingly.

Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>


# 42b805af 12-May-2022 Xiao Ni <xni@redhat.com>

md: fix double free of io_acct_set bioset

Now io_acct_set is alloc and free in personality. Remove the codes that
free io_acct_set in md_free and md_stop.

Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality)
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# 913cce5a 12-May-2022 Christoph Hellwig <hch@lst.de>

md: remove most calls to bdevname

Use the %pg format specifier to save on stack consumption and code size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 1e267742 29-Apr-2022 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: protect md_unregister_thread from reentrancy

Generally, the md_unregister_thread is called with reconfig_mutex, but
raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
so md_unregister_thread can be called simulitaneously from two call sites
in theory.

Then after previous commit which remove the protection of reconfig_mutex
for md_unregister_thread completely, the potential issue could be worse
than before.

Let's take pers_lock at the beginning of function to ensure reentrancy.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>


# 8b48ec23 12-Feb-2021 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: don't unregister sync_thread with reconfig_mutex held

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.

Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <song@kernel.org>


# 9151ad5d 21-Apr-2022 David Sloan <david.sloan@eideticom.com>

md: Replace role magic numbers with defined constants

There are several instances where magic numbers are used in md.c instead
of the defined constants in md_p.h. This patch set improves code
readability by replacing all occurrences of 0xffff, 0xfffe, and 0xfffd when
relating to md roles with their equivalent defined constant.

Signed-off-by: David Sloan <david.sloan@eideticom.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>


# 92d9aac9 31-Mar-2022 Heming Zhao <heming.zhao@suse.com>

md: replace deprecated strlcpy & remove duplicated line

This commit includes two topics:

1> replace deprecated strlcpy

change strlcpy to strscpy for strlcpy is marked as deprecated in
Documentation/process/deprecated.rst

2> remove duplicated strlcpy line

in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the
history:

- commit cf921cc19cf7 ("Add node recovery callbacks") introduced the first
usage of strlcpy().

- commit b97e92574c0b ("Use separate bitmaps for each nodes in the cluster")
introduced the second strlcpy(). this time, the two strlcpy() are same,
we can remove anyone safely.

- commit d3b178adb3a3 ("md: Skip cluster setup for dm-raid") added dm-raid
special handling. And the "nodes" value is the key of this patch. but
from this patch, strlcpy() which was introduced by b97e92574c0bf
become necessary.

- commit 3c462c880b52 ("md: Increment version for clustered bitmaps") used
clustered major version to only handle in clustered env. this patch
could look a polishment for clustered code logic.

So cf921cc19cf7 became useless after d3b178adb3a3a, we could remove it
safely.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>


# 64c54d92 08-Apr-2022 Xiaomeng Tong <xiam0nd.tong@gmail.com>

md: fix an incorrect NULL check in md_reload_sb

The bug is here:
if (!rdev || rdev->desc_nr != nr) {

The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each_rcu(), so it is incorrect to assume that the
iterator value will be NULL if the list is empty or no element
found (In fact, it will be a bogus pointer to an invalid struct
object containing the HEAD). Otherwise it will bypass the check
and lead to invalid memory access passing the check.

To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'pdev' as a dedicated pointer to
point to the found element.

Cc: stable@vger.kernel.org
Fixes: 70bcecdb1534 ("md-cluster: Improve md_reload_sb to be less error prone")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>


# fc873834 08-Apr-2022 Xiaomeng Tong <xiam0nd.tong@gmail.com>

md: fix an incorrect NULL check in does_sb_need_changing

The bug is here:
if (!rdev)

The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element found.
Otherwise it will bypass the NULL check and lead to invalid memory
access passing the check.

To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'rdev' as a dedicated pointer to
point to the found element.

Cc: stable@vger.kernel.org
Fixes: 2aa82191ac36 ("md-cluster: Perform a lazy update")
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Song Liu <song@kernel.org>


# 9631abdb 22-Mar-2022 Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

md: Set MD_BROKEN for RAID1 and RAID10

There is no direct mechanism to determine raid failure outside
personality. It is done by checking rdev->flags after executing
md_error(). If "faulty" flag is not set then -EBUSY is returned to
userspace. -EBUSY means that array will be failed after drive removal.

Mdadm has special routine to handle the array failure and it is executed
if -EBUSY is returned by md.

There are at least two known reasons to not consider this mechanism
as correct:
1. drive can be removed even if array will be failed[1].
2. -EBUSY seems to be wrong status. Array is not busy, but removal
process cannot proceed safe.

-EBUSY expectation cannot be removed without breaking compatibility
with userspace. In this patch first issue is resolved by adding support
for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
next commit.

The idea is to set the MD_BROKEN if we are sure that raid is in failed
state now. This is done in each error_handler(). In md_error() MD_BROKEN
flag is checked. If is set, then -EBUSY is returned to userspace.

As in previous commit, it causes that #mdadm --set-faulty is able to
fail array. Previously proposed workaround is valid if optional
functionality[1] is disabled.

[1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
RAID1/RAID10.")

Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>


# 44abff2c 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD

Secure erase is a very different operation from discard in that it is
a data integrity operation vs hint. Fully split the limits and helper
infrastructure to make the separation more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2]
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 10f0d2a5 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: add a bdev_nonrot helper

Add a helper to check the nonrot flag based on the block_device instead
of having to poke into the block layer internal request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7d959f6e 03-Mar-2022 Eric Dumazet <edumazet@google.com>

md: use msleep() in md_notify_reboot()

Calling mdelay(1000) from process context, even while a reboot
is in progress, does not make sense.

Using msleep() allows other threads to make progress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>


# abfc426d 02-Feb-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device to bio_clone_fast

Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 49add496 24-Jan-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device and opf to bio_init

Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 609be106 24-Jan-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device and opf to bio_alloc_bioset

Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0f9650bd 02-Feb-2022 Song Liu <song@kernel.org>

md: fix NULL pointer deref with nowait but no mddev->queue

Leon reported NULL pointer deref with nowait support:

[ 15.123761] device-mapper: raid: Loading target version 1.15.1
[ 15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[ 15.124192] device-mapper: raid: Choosing default region size of 4MiB
[ 15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060
[ 15.129530] #PF: supervisor write access in kernel mode
[ 15.129533] #PF: error_code(0x0002) - not-present page
[ 15.129535] PGD 0 P4D 0
[ 15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e
[ 15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021
[ 15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20
[ 15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \
00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \
31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00
[ 15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202
[ 15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000
[ 15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d
[ 15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000
[ 15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070
[ 15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001
[ 15.129572] FS: 00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000
[ 15.129575] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0
[ 15.129580] Call Trace:
[ 15.129582] <TASK>
[ 15.129584] md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531]
[ 15.129597] raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63]
[ 15.129604] ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[ 15.129615] dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[ 15.129625] table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[ 15.129635] ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[ 15.129644] ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[ 15.129655] dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[ 15.129663] __x64_sys_ioctl+0x8e/0xd0
[ 15.129667] do_syscall_64+0x5c/0x90
[ 15.129672] ? syscall_exit_to_user_mode+0x23/0x50
[ 15.129675] ? do_syscall_64+0x69/0x90
[ 15.129677] ? do_syscall_64+0x69/0x90
[ 15.129679] ? syscall_exit_to_user_mode+0x23/0x50
[ 15.129682] ? do_syscall_64+0x69/0x90
[ 15.129684] ? do_syscall_64+0x69/0x90
[ 15.129686] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 15.129689] RIP: 0033:0x7fa96ecd559b
[ 15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \
c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \
ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48
[ 15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[ 15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b
[ 15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003
[ 15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27
[ 15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610
[ 15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530
[ 15.129709] </TASK>

This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT
Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check.

Fixes: f51d46d0e7cb ("md: add support for REQ_NOWAIT")
Reported-by: Leon Möller <jkhsjdhjs@totally.rip>
Tested-by: Leon Möller <jkhsjdhjs@totally.rip>
Cc: Vishal Verma <vverma@digitalocean.com>
Signed-off-by: Song Liu <song@kernel.org>


# 1745e857 06-Jan-2022 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

md: use default_groups in kobj_type

There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the md rdev sysfs code to use default_groups field which
has been the preferred way since commit aa30f47cf666 ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.

Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Song Liu <song@kernel.org>


# 0c031fd3 10-Dec-2021 Xiao Ni <xni@redhat.com>

md: Move alloc/free acct bioset in to personality

bioset acct is only needed for raid0 and raid5. Therefore, md_run only
allocates it for raid0 and raid5. However, this does not cover
personality takeover, which may cause uninitialized bioset. For example,
the following repro steps:

mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1
mdadm --wait /dev/md0
mkfs.xfs /dev/md0
mdadm /dev/md0 --grow -l5
mount /dev/md0 /mnt

causes panic like:

[ 225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 225.934903] #PF: supervisor instruction fetch in kernel mode
[ 225.935639] #PF: error_code(0x0010) - not-present page
[ 225.936361] PGD 0 P4D 0
[ 225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
[ 225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706
[ 225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014
[ 225.939922] RIP: 0010:0x0
[ 225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246
[ 225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39
[ 225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800
[ 225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01
[ 225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58
[ 225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040
[ 225.946709] FS: 00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000
[ 225.947814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0
[ 225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 225.951414] Call Trace:
[ 225.951787] <TASK>
[ 225.952120] mempool_alloc+0xe5/0x250
[ 225.952625] ? mempool_resize+0x370/0x370
[ 225.953187] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 225.953862] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 225.954464] ? sched_clock_cpu+0x15/0x120
[ 225.955019] ? find_held_lock+0xac/0xd0
[ 225.955564] bio_alloc_bioset+0x1ed/0x2a0
[ 225.956080] ? lock_downgrade+0x3a0/0x3a0
[ 225.956644] ? bvec_alloc+0xc0/0xc0
[ 225.957135] bio_clone_fast+0x19/0x80
[ 225.957651] raid5_make_request+0x1370/0x1b70
[ 225.958286] ? sched_clock_cpu+0x15/0x120
[ 225.958797] ? __lock_acquire+0x8b2/0x3510
[ 225.959339] ? raid5_get_active_stripe+0xce0/0xce0
[ 225.959986] ? lock_is_held_type+0xd8/0x130
[ 225.960528] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 225.961135] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 225.961703] ? sched_clock_cpu+0x15/0x120
[ 225.962232] ? lock_release+0x27a/0x6c0
[ 225.962746] ? do_wait_intr_irq+0x130/0x130
[ 225.963302] ? lock_downgrade+0x3a0/0x3a0
[ 225.963815] ? lock_release+0x6c0/0x6c0
[ 225.964348] md_handle_request+0x342/0x530
[ 225.964888] ? set_in_sync+0x170/0x170
[ 225.965397] ? blk_queue_split+0x133/0x150
[ 225.965988] ? __blk_queue_split+0x8b0/0x8b0
[ 225.966524] ? submit_bio_checks+0x3b2/0x9d0
[ 225.967069] md_submit_bio+0x127/0x1c0
[...]

Fix this by moving alloc/free of acct bioset to pers->run and pers->free.

While we are on this, properly handle md_integrity_register() error in
raid0_run().

Fixes: daee2024715d (md: check level before create and exit io_acct_set)
Cc: stable@vger.kernel.org
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# dd3dc5f4 25-Dec-2021 Randy Dunlap <rdunlap@infradead.org>

md: fix spelling of "its"

Use the possessive "its" instead of the contraction "it's"
in printed messages.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>


# f51d46d0 21-Dec-2021 Vishal Verma <vverma@digitalocean.com>

md: add support for REQ_NOWAIT

commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support
for checking whether a given bdev supports handling of REQ_NOWAIT or not.
Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable
it for linear target") added support for REQ_NOWAIT for dm. This uses
a similar approach to incorporate REQ_NOWAIT for md based bios.

This patch was tested using t/io_uring tool within FIO. A nvme drive
was partitioned into 2 partitions and a simple raid 0 configuration
/dev/md0 was created.

md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]
937423872 blocks super 1.2 512k chunks

Before patch:

$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100

Running top while the above runs:

$ ps -eL | grep $(pidof io_uring)

38396 38396 pts/2 00:00:00 io_uring
38396 38397 pts/2 00:00:15 io_uring
38396 38398 pts/2 00:00:13 iou-wrk-38397

We can see iou-wrk-38397 io worker thread created which gets created
when io_uring sees that the underlying device (/dev/md0 in this case)
doesn't support nowait.

After patch:

$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100

Running top while the above runs:

$ ps -eL | grep $(pidof io_uring)

38341 38341 pts/2 00:10:22 io_uring
38341 38342 pts/2 00:10:37 io_uring

After running this patch, we don't see any io worker thread
being created which indicated that io_uring saw that the
underlying device does support nowait. This is the exact behaviour
noticed on a dm device which also supports nowait.

For all the other raid personalities except raid0, we would need
to train pieces which involves make_request fn in order for them
to correctly handle REQ_NOWAIT.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Vishal Verma <vverma@digitalocean.com>
Signed-off-by: Song Liu <song@kernel.org>


# 1ebe2e5f 22-Nov-2021 Christoph Hellwig <hch@lst.de>

block: remove GENHD_FL_EXT_DEVT

All modern drivers can support extra partitions using the extended
dev_t. In fact except for the ioctl method drivers never even see
partitions in normal operation.

So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 07641b5f 15-Nov-2021 zhangyue <zhangyue1@kylinos.cn>

md: fix double free of mddev->private in autorun_array()

In driver/md/md.c, if the function autorun_array() is called,
the problem of double free may occur.

In function autorun_array(), when the function do_md_run() returns an
error, the function do_md_stop() will be called.

The function do_md_run() called function md_run(), but in function
md_run(), the pointer mddev->private may be freed.

The function do_md_stop() called the function __md_stop(), but in
function __md_stop(), the pointer mddev->private also will be freed
without judging null.

At this time, the pointer mddev->private will be double free, so it
needs to be judged null or not.

Signed-off-by: zhangyue <zhangyue1@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 55df1ce0 16-Nov-2021 Markus Hochholdinger <markus@hochholdinger.net>

md: fix update super 1.0 on rdev size change

The superblock of version 1.0 doesn't get moved to the new position on a
device size change. This leads to a rdev without a superblock on a known
position, the raid can't be re-assembled.

The line was removed by mistake and is re-added by this patch.

Fixes: d9c0fa509eaf ("md: fix max sectors calculation for super 1.0")
Cc: stable@vger.kernel.org
Signed-off-by: Markus Hochholdinger <markus@hochholdinger.net>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 8b9e2291 13-Oct-2021 Xiao Ni <xni@redhat.com>

md: update superblock after changing rdev flags in state_store

When the in memory flag is changed, we need to persist the change in the
rdev superblock flags. This is needed for "writemostly" and "failfast".

Reviewed-by: Li Feng <fengli@smartx.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 54679486 04-Oct-2021 Guoqing Jiang <guoqing.jiang@linux.dev>

md: remove unused argument from md_new_event

Actually, mddev is not used by md_new_event.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7ad10691 01-Sep-2021 Christoph Hellwig <hch@lst.de>

md: properly unwind when failing to add the kobject in md_alloc

Add proper error handling to delete the gendisk when failing to add
the md kobject and clean up the error unwinding in general.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 94f3cd7d 01-Sep-2021 Christoph Hellwig <hch@lst.de>

md: extend disks_mutex coverage

disks_mutex is intended to serialize md_alloc. Extended it to also cover
the kobject_uevent call and getting the sysfs dirent to help reducing
error handling complexity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 51238e7f 01-Sep-2021 Christoph Hellwig <hch@lst.de>

md: add the bitmap group to the default groups for the md kobject

Replace the deprecated default_attrs with the default_groups mechanism,
and add the always visible bitmap group to the groups created add
kobject_add time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9be68dd7 01-Sep-2021 Luis Chamberlain <mcgrof@kernel.org>

md: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

We just do the unwinding of what was not done before, and are
sure to unlock prior to bailing.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0fe80347 17-Oct-2021 Christoph Hellwig <hch@lst.de>

md: use bdev_nr_sectors instead of open coding it

Use the proper helper to read the block device size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3e08773c 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: switch polling to be bio based

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fe45e630 20-Sep-2021 Christoph Hellwig <hch@lst.de>

block: move integrity handling out of <linux/blkdev.h>

Split the integrity/metadata handling definitions out into a new header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b81e0c23 20-Sep-2021 Christoph Hellwig <hch@lst.de>

block: drop unused includes in <linux/genhd.h>

Drop various include not actually used in genhd.h itself, and
move the remaning includes closer together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7df835a3 01-Sep-2021 Christoph Hellwig <hch@lst.de>

md: fix a lock order reversal in md_alloc

Commit b0140891a8cea3 ("md: Fix race when creating a new md device.")
not only moved assigning mddev->gendisk before calling add_disk, which
fixes the races described in the commit log, but also added a
mddev->open_mutex critical section over add_disk and creation of the
md kobj. Adding a kobject after add_disk is racy vs deleting the gendisk
right after adding it, but md already prevents against that by holding
a mddev->active reference.

On the other hand taking this lock added a lock order reversal with what
is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
added) for partition devices, which need that lock for the internal open
for the partition scan, and a recent commit also takes it for
non-partitioned devices, leading to further lockdep splatter.

Fixes: b0140891a8ce ("md: Fix race when creating a new md device.")
Fixes: d62633873590 ("block: support delayed holder registration")
Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# de3ea66e 03-Jun-2021 Guoqing Jiang <jgq516@gmail.com>

md: add comments in md_integrity_register

Given it is not obvious for the error handling, let's try to add some
comments here to make it clear.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>


# daee2024 03-Jun-2021 Guoqing Jiang <jgq516@gmail.com>

md: check level before create and exit io_acct_set

The bio_set (io_acct_set) is used by personalities to clone bio and
trace the timestamp of bio. Some personalities such as raid1/10 don't
need the bio_set, so add check to not create it unconditionally.

Also update the comment for md_account_bio to make it more clear.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>


# c32dc040 28-May-2021 Rikard Falkeborn <rikard.falkeborn@gmail.com>

md: Constify attribute_group structs

The attribute_group structs are never modified, they're only passed to
sysfs_create_group() and sysfs_remove_group(). Make them const to allow
the compiler to put them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>


# 10764815 25-May-2021 Guoqing Jiang <jgq516@gmail.com>

md: add io accounting for raid0 and raid5

We introduce a new bioset (io_acct_set) for raid0 and raid5 since they
don't own clone infrastructure to accounting io. And the bioset is added
to mddev instead of to raid0 and raid5 layer, because with this way, we
can put common functions to md.h and reuse them in raid0 and raid5.

Also struct md_io_acct is added accordingly which includes io start_time,
the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct
to get related io status.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>


# ad3fc798 25-May-2021 Guoqing Jiang <jgq516@gmail.com>

md: revert io stats accounting

The commit 41d2d848e5c0 ("md: improve io stats accounting") could cause
double fault problem per the report [1], and also it is not correct to
change ->bi_end_io if md don't own it, so let's revert it.

And io stats accounting will be replemented in later commits.

[1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t

Fixes: 41d2d848e5c0 ("md: improve io stats accounting")
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>


# 0f1d2e06 20-May-2021 Christoph Hellwig <hch@lst.de>

md: convert to blk_alloc_disk/blk_cleanup_disk

Convert the md driver to use the blk_alloc_disk and blk_cleanup_disk
helpers to simplify gendisk and request_queue allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f7c7a2f9 08-Apr-2021 Heming Zhao <heming.zhao@suse.com>

md-cluster: fix use-after-free issue when removing rdev

md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).

```
node2=192.168.0.3

for i in {1..20}; do
echo ==== $i `date` ====;

mdadm -Ss && ssh ${node2} "mdadm -Ss"
wipefs -a /dev/sda /dev/sdb

mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
/dev/sdb --assume-clean
ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
mdadm --wait /dev/md0
ssh ${node2} "mdadm --wait /dev/md0"

mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
sleep 1
done
```

Crash stack:

```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS: 0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
raid1d+0x5c/0xd40 [raid1]
? finish_task_switch+0x75/0x2a0
? lock_timer_base+0x67/0x80
? try_to_del_timer_sync+0x4d/0x80
? del_timer_sync+0x41/0x50
? schedule_timeout+0x254/0x2d0
? md_start_sync+0xe0/0xe0 [md_mod]
? md_thread+0x127/0x160 [md_mod]
md_thread+0x127/0x160 [md_mod]
? wait_woken+0x80/0x80
kthread+0x10d/0x130
? kthread_park+0xa0/0xa0
ret_from_fork+0x1f/0x40
```

Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: stable@vger.kernel.org
Reviewed-by: Gang He <ghe@suse.com>
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>


# 0d809b38 12-Apr-2021 Christoph Hellwig <hch@lst.de>

md: do not return existing mddevs from mddev_find_or_alloc

Instead of returning an existing mddev, just for it to be discarded
later directly return -EEXIST. Rename the function to mddev_alloc now
that it doesn't find an existing mddev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# d144fe6f 12-Apr-2021 Christoph Hellwig <hch@lst.de>

md: refactor mddev_find_or_alloc

Allocate the new mddev first speculatively, which greatly simplifies
the code flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 85c8c3c1 12-Apr-2021 Christoph Hellwig <hch@lst.de>

md: factor out a mddev_alloc_unit helper from mddev_find

Split out a self contained helper to find a free minor for the md
"unit" number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 65aa97c4 03-Apr-2021 Christoph Hellwig <hch@lst.de>

md: split mddev_find

Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.

This turns out to fix this bug reported by Zhao Heming.

----------------------------- snip ------------------------------
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger 1 time with 10 tests

`1 node1="15sp3-mdcluster1"
2 node2="15sp3-mdcluster2"
3
4 mdadm -Ss
5 ssh ${node2} "mdadm -Ss"
6 wipefs -a /dev/sda /dev/sdb
7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
/dev/sdb --assume-clean
8
9 for i in {1..100}; do
10 echo ==== $i ====;
11
12 echo "test ...."
13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14 sleep 1
15
16 echo "clean ....."
17 ssh ${node2} "mdadm -Ss"
18 done
`
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

`ID: 2831 TASK: ffff8dd7223b5040 CPU: 0 COMMAND: "mdadm"
#0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
#1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
#2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
#3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
#4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
#5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
#6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
#7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
#8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
#9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c

or:
[ 884.226509] mddev_put+0x1c/0xe0 [md_mod]
[ 884.226515] md_open+0x3c/0xe0 [md_mod]
[ 884.226518] __blkdev_get+0x30d/0x710
[ 884.226520] ? bd_acquire+0xd0/0xd0
[ 884.226522] blkdev_get+0x14/0x30
[ 884.226524] do_dentry_open+0x204/0x3a0
[ 884.226531] path_openat+0x2fc/0x1520
[ 884.226534] ? seq_printf+0x4e/0x70
[ 884.226536] do_filp_open+0x9b/0x110
[ 884.226542] ? md_release+0x20/0x20 [md_mod]
[ 884.226543] ? seq_read+0x1d8/0x3e0
[ 884.226545] ? kmem_cache_alloc+0x18a/0x270
[ 884.226547] ? do_sys_open+0x1bd/0x260
[ 884.226548] do_sys_open+0x1bd/0x260
[ 884.226551] do_syscall_64+0x5b/0x1e0
[ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9
`
*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

`<thread 1>: "mdadm -Ss"
md_release
mddev_put deletes mddev from all_mddevs
queue_work for mddev_delayed_delete
at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
+ mddev_find can't find mddev of /dev/md0, and create a new mddev and
| return.
+ trigger "if (mddev->gendisk != bdev->bd_disk)" and return
-ERESTARTSYS.
`
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------

Cc: stable@vger.kernel.org
Fixes: d3374825ce57 ("md: make devices disappear when they are no longer needed.")
Reported-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 8b57251f 03-Apr-2021 Christoph Hellwig <hch@lst.de>

md: factor out a mddev_find_locked helper from mddev_find

Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".

Cc: stable@vger.kernel.org
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 6a4db2a6 02-Apr-2021 Zhao Heming <heming.zhao@suse.com>

md: md_open returns -EBUSY when entering racing area

commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.

This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).

For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1 node1="mdcluster1"
2 node2="mdcluster2"
3
4 mdadm -Ss
5 ssh ${node2} "mdadm -Ss"
6 wipefs -a /dev/sda /dev/sdb
7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
/dev/sdb --assume-clean
8
9 for i in {1..10}; do
10 echo ==== $i ====;
11
12 echo "test ...."
13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14 sleep 1
15
16 echo "clean ....."
17 ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[ 884.226509] mddev_put+0x1c/0xe0 [md_mod]
[ 884.226515] md_open+0x3c/0xe0 [md_mod]
[ 884.226518] __blkdev_get+0x30d/0x710
[ 884.226520] ? bd_acquire+0xd0/0xd0
[ 884.226522] blkdev_get+0x14/0x30
[ 884.226524] do_dentry_open+0x204/0x3a0
[ 884.226531] path_openat+0x2fc/0x1520
[ 884.226534] ? seq_printf+0x4e/0x70
[ 884.226536] do_filp_open+0x9b/0x110
[ 884.226542] ? md_release+0x20/0x20 [md_mod]
[ 884.226543] ? seq_read+0x1d8/0x3e0
[ 884.226545] ? kmem_cache_alloc+0x18a/0x270
[ 884.226547] ? do_sys_open+0x1bd/0x260
[ 884.226548] do_sys_open+0x1bd/0x260
[ 884.226551] do_syscall_64+0x5b/0x1e0
[ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
<thread 1>: "mdadm -Ss"
md_release
mddev_put deletes mddev from all_mddevs
queue_work for mddev_delayed_delete
at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
+ mddev_find can't find mddev of /dev/md0, and create a new mddev and
| return.
+ trigger "if (mddev->gendisk != bdev->bd_disk)" and return
-ERESTARTSYS.
```

In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.

Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>


# 7abfabaf 17-Mar-2021 Jan Glauber <jglauber@digitalocean.com>

md: Fix missing unused status line of /proc/mdstat

Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.

So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
like: unused devices: <none>

Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# cf78408f 04-Feb-2021 Xiao Ni <xni@redhat.com>

md: add md_submit_discard_bio() for submitting discard bio

Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# a42e0d70 01-Feb-2021 Christoph Hellwig <hch@lst.de>

md: use rdev_read_only in restart_array

Make the read-only check in restart_array identical to the other two
read-only checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d7a47838 01-Feb-2021 Christoph Hellwig <hch@lst.de>

md: check for NULL ->meta_bdev before calling bdev_read_only

->meta_bdev is optional and not set for most arrays. Add a
rdev_read_only helper that calls bdev_read_only for both devices
in a safe way.

Fixes: 6f0d9689b670 ("block: remove the NULL bdev check in bdev_read_only")
Reported-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6a596569 26-Jan-2021 Christoph Hellwig <hch@lst.de>

md: remove md_bio_alloc_sync

md_bio_alloc_sync is never called with a NULL mddev, and ->sync_set is
initialized in md_run, so it always must be initialized as well. Just
open code the remaining call to bio_alloc_bioset.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 32637385 26-Jan-2021 Christoph Hellwig <hch@lst.de>

md: simplify sync_page_io

Use an on-stack bio and biovec for the single page synchronous I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a78f18da 26-Jan-2021 Christoph Hellwig <hch@lst.de>

md: remove bio_alloc_mddev

bio_alloc_mddev is never called with a NULL mddev, and ->bio_set is
initialized in md_run, so it always must be initialized as well. Just
open code the remaining call to bio_alloc_bioset.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 99dfc43e 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: use ->bi_bdev for bio based I/O accounting

Rework the I/O accounting for bio based drivers to use ->bi_bdev. This
means all drivers can now simply use bio_start_io_acct to start
accounting, and it will take partitions into account automatically. To
end I/O account either bio_end_io_acct can be used if the driver never
remaps I/O to a different device, or bio_end_io_acct_remapped if the
driver did remap the I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 309dca30 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: store a block_device pointer in struct bio

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# dc5d17a3 09-Dec-2020 Xiao Ni <xni@redhat.com>

md: Set prev_flush_start and flush_bio in an atomic way

One customer reports a crash problem which causes by flush request. It
triggers a warning before crash.

/* new request after previous flush is completed */
if (ktime_after(req_start, mddev->prev_flush_start)) {
WARN_ON(mddev->flush_bio);
mddev->flush_bio = bio;
bio = NULL;
}

The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
flush_bio in md_flush_request. But there is no lock protection in
md_submit_flush_data. It can set flush_bio to NULL first because of
compiler reordering write instructions.

For example, flush bio1 sets flush bio to NULL first in
md_submit_flush_data. An interrupt or vmware causing an extended stall
happen between updating flush_bio and prev_flush_start. Because flush_bio
is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
flush bio1 updates prev_flush_start after the interrupt or extended stall.

Then flush bio3 enters in md_flush_request. The start time req_start is
behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
again. INIT_WORK() will re-initialize the list pointers in the
work_struct, which then can result in a corrupted work list and the
work_struct queued a second time. With the work list corrupted, it can
lead in invalid work items being used and cause a crash in
process_one_work.

We need to make sure only one flush bio can be handled at one same time.
So add spin lock in md_submit_flush_data to protect prev_flush_start and
flush_bio in an atomic way.

Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 57a0f3a8 09-Dec-2020 Song Liu <songliubraving@fb.com>

Revert "md: add md_submit_discard_bio() for submitting discard bio"

This reverts commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0.

Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/
Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 1c02fca6 03-Dec-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue argument to the block_bio_remap tracepoint

The request_queue can trivially be derived from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8446fe92 24-Nov-2020 Christoph Hellwig <hch@lst.de>

block: switch partition lookup to use struct block_device

Use struct block_device to lookup partitions on a disk. This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cb8432d6 26-Nov-2020 Christoph Hellwig <hch@lst.de>

block: allocate struct hd_struct as part of struct bdev_inode

Allocate hd_struct together with struct block_device to pre-load
the lifetime rule changes in preparation of merging the two structures.

Note that part0 was previously embedded into struct gendisk, but is
a separate allocation now, and already points to the block_device instead
of the hd_struct. The lifetime of struct gendisk is still controlled by
the struct device embedded in the part0 hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8d65269f 17-Nov-2020 Christoph Hellwig <hch@lst.de>

block: add a bdev_kobj helper

Add a little helper to find the kobject for a struct block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bca5b065 19-Nov-2020 Zhao Heming <heming.zhao@suse.com>

md/cluster: fix deadlock when node is doing resync job

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA nodeB
-------------------- --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
b.
md_do_sync
resync_info_update
send RESYNCING
+ set MD_CLUSTER_SEND_LOCK
+ wait for holding token_lockres:EX

c.
mdadm /dev/md0 --remove /dev/sdg
+ held reconfig_mutex
+ send REMOVE
+ wait_event(MD_CLUSTER_SEND_LOCK)

d.
recv_daemon //METADATA_UPDATED from A
process_metadata_update
+ (mddev_trylock(mddev) ||
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
//this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
This will be blocked by step <c>: holding mddev lock, it makes
wait_event can't hold mddev lock. (btw,
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
# dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D 0 5329 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? _cond_resched+0x2d/0x40
? schedule+0x4a/0xb0
? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
? wait_woken+0x80/0x80
? process_recvd_msg+0x113/0x1d0 [md_cluster]
? recv_daemon+0x9e/0x120 [md_cluster]
? md_thread+0x94/0x160 [md_mod]
? wait_woken+0x80/0x80
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40

mdadm D 0 5423 1 0x00004004
Call Trace:
__schedule+0x1f6/0x560
? __schedule+0x1fe/0x560
? schedule+0x4a/0xb0
? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
? wait_woken+0x80/0x80
? remove_disk+0x4f/0x90 [md_cluster]
? hot_remove_disk+0xb1/0x1b0 [md_mod]
? md_ioctl+0x50c/0xba0 [md_mod]
? wait_woken+0x80/0x80
? blkdev_ioctl+0xa2/0x2a0
? block_ioctl+0x39/0x40
? ksys_ioctl+0x82/0xc0
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x5f/0x150
? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync D 0 5425 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? schedule+0x4a/0xb0
? dlm_lock_sync+0xa1/0xd0 [md_cluster]
? wait_woken+0x80/0x80
? lock_token+0x2d/0x90 [md_cluster]
? resync_info_update+0x95/0x100 [md_cluster]
? raid1_sync_request+0x7d3/0xa40 [raid1]
? md_do_sync.cold+0x737/0xc8f [md_mod]
? md_thread+0x94/0x160 [md_mod]
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# a8da01f7 19-Nov-2020 Zhao Heming <heming.zhao@suse.com>

md/cluster: block reshape with remote resync job

Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.

Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.

The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
# on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# a23f2aae 10-Nov-2020 Pankaj Gupta <pankaj.gupta@cloud.ionos.com>

md: use current request time as base for ktime comparisons

Request coalescing logic uses 'prev_flush_start' as base to
compare the current request start time. 'prev_flush_start' is
updated in other context.

This patch changes this by using ktime comparison base to
'req_start' for better readability of code.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 204d1a64 10-Nov-2020 Pankaj Gupta <pankaj.gupta@cloud.ionos.com>

md: add comments in md_flush_request()

Request coalescing logic is dependent on flush time update in other
context. This patch adds comments to understand the code flow better.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 81ba3c24 10-Nov-2020 Pankaj Gupta <pankaj.gupta@cloud.ionos.com>

md: improve variable names in md_flush_request()

This patch improves readability by using better variable names
in flush request coalescing logic.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# c731b84b 21-Oct-2020 Dae R. Jeong <dae.r.jeong@kaist.ac.kr>

md: fix a warning caused by a race between concurrent md_ioctl()s

Syzkaller reports a warning as belows.
WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
...
Call Trace:
...
RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
RSP: 0018:ffff888096027950 EFLAGS: 00010293
RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
__blkdev_driver_ioctl block/ioctl.c:304 [inline]
blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
block_ioctl+0xee/0x130 fs/block_dev.c:1930
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:509 [inline]
do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is caused by a race between two concurrenct md_ioctl()s closing
the array.
CPU1 (md_ioctl()) CPU2 (md_ioctl())
------ ------
set_bit(MD_CLOSING, &mddev->flags);
did_set_md_closing = true;
WARN_ON_ONCE(test_bit(MD_CLOSING,
&mddev->flags));
if(did_set_md_closing)
clear_bit(MD_CLOSING, &mddev->flags);

Fix the warning by returning immediately if the MD_CLOSING bit is set
in &mddev->flags which indicates that the array is being closed.

Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 2c247c51 16-Nov-2020 Christoph Hellwig <hch@lst.de>

md: use set_capacity_and_notify

Use set_capacity_and_notify to set the size of both the disk and block
device. This also gets the uevent notifications for the resize for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 28144f99 29-Oct-2020 Christoph Hellwig <hch@lst.de>

md: use __register_blkdev to allocate devices on demand

Use the simpler mechanism attached to major_name to allocate a md device
when a currently unregistered minor is accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 118cf084 03-Nov-2020 Christoph Hellwig <hch@lst.de>

md: implement ->set_read_only to hook into BLKROSET processing

Implement the ->set_read_only method instead of parsing the actual
ioctl command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cf0b9b48 07-Oct-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: fix the checking of wrong work queue

It should check md_rdev_misc_wq instead of md_misc_wq.

Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev")
Cc: <stable@vger.kernel.org> # v5.8+
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 4245e52d 02-Sep-2020 Christoph Hellwig <hch@lst.de>

md: don't detour through bd_contains for the gendisk

bd_disk is set on all block devices, including those for partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 61a27e1f 02-Sep-2020 Christoph Hellwig <hch@lst.de>

md: compare bd_disk instead of bd_contains

To check for partitions of the same disk bd_contains works as well, but
bd_disk is way more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2628089b 24-Aug-2020 Xiao Ni <xni@redhat.com>

md: add md_submit_discard_bio() for submitting discard bio

Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 00fe60ea 31-Aug-2020 Song Liu <songliubraving@fb.com>

md: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acct

This enables proper statistics in /proc/diskstats for md partitions.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 818077d6 08-Sep-2020 Christoph Hellwig <hch@lst.de>

md: use bdev_check_media_change

The md driver does not have a ->revalidate_disk method, so it can just
use bdev_check_media_change without any additional changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 659e56ba 01-Sep-2020 Christoph Hellwig <hch@lst.de>

block: add a new revalidate_disk_size helper

revalidate_disk is a relative awkward helper for driver use, as it first
calls an optional driver method and then updates the block device size,
while most callers either don't need the method call at all, or want to
keep state between the caller and the called method.

Add a revalidate_disk_size helper that just performs the update of the
block device size from the gendisk one, and switch all drivers that do
not implement ->revalidate_disk to use the new helper instead of
revalidate_disk()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e8efa9b8 04-Aug-2020 Junxiao Bi <junxiao.bi@oracle.com>

md: get sysfs entry after redundancy attr group create

"sync_completed" and "degraded" belongs to redundancy attr group,
it was not exist yet when md device was created.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: e1a86dbbbd6a ("md: fix deadlock causing by sysfs_notify")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# b3db8a21 27-Jul-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: print errno in super_written

It is better to print errno instead of bi_status.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# ec164d07 27-Jul-2020 Sebastian Parschauer <s.parschauer@gmx.de>

md: register new md sysfs file 'uuid' read-only

Report the UUID of the MD array in the following format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

This is useful if you don't want to wait for udev to identify array.
And it is also easy for script to monitor it with the format.

Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de>
[Guoqing: mention the change in md.rst]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# d9c0fa50 30-Jun-2020 Xiao Ni <xni@redhat.com>

md: fix max sectors calculation for super 1.0

To grow size of super 1.0 raid array, it is necessary to check the device
max usable size.

Now it uses rdev->sectors for max usable size. If one disk is 500G and the
raid device only uses the 100GB of this disk. rdev->sectors can't tell the
real max usable size. The max usable size should be

dev_size-(superblock_size+bitmap_size+badblock_size).

Also, remove unnecessary sb_start update in super_1_rdev_size_change().

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# edee9dfe 20-Jul-2020 Zhao Heming <heming.zhao@suse.com>

md-cluster: fix rmmod issue when md_cluster convert bitmap to none

update_array_info misses calling module_put when removing cluster bitmap.

steps to reproduce:
```
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
/dev/sdb
mdadm: array /dev/md0 started.
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster 28672 1
dlm 212992 14 md_cluster
configfs 57344 2 dlm
raid1 53248 1
md_mod 176128 2 raid1,md_cluster
node1 # mdadm -G /dev/md0 -b none
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster 28672 1 <== should be zero
dlm 212992 9 md_cluster
configfs 57344 2 dlm
raid1 53248 1
md_mod 176128 2 raid1,md_cluster
node1 # mdadm -G /dev/md0 -b clustered
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster 28672 2 <== increase
dlm 212992 14 md_cluster
configfs 57344 2 dlm
raid1 53248 1
md_mod 176128 2 raid1,md_cluster
node1 # mdadm -G /dev/md0 -b none
node1 # mdadm -G /dev/md0 -b clustered
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster 28672 3 <== increase
dlm 212992 14 md_cluster
configfs 57344 2 dlm
raid1 53248 1
md_mod 176128 2 raid1,md_cluster
```

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 7c9d5c54 20-Jul-2020 Zhao Heming <heming.zhao@suse.com>

md-cluster: fix safemode_delay value when converting to clustered bitmap

When array convert to clustered bitmap, the safe_mode_delay doesn't
clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original
value after changing bitmap type. In safe_delay_store(), the code forbids
setting mddev->safemode_delay when array is clustered. So in cluster-md
env, the expected safemode_delay value should be 0.

Reproducible steps:
```
node1 # mdadm --zero-superblock /dev/sd{b,c,d}
node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
node1 # cat /sys/block/md0/md/safe_mode_delay
0.204
node1 # mdadm -G /dev/md0 -b none
node1 # mdadm --grow /dev/md0 --bitmap=clustered
node1 # cat /sys/block/md0/md/safe_mode_delay
0.204 <== doesn't change, should ZERO for cluster-md

node1 # mdadm --zero-superblock /dev/sd{b,c,d}
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
node1 # cat /sys/block/md0/md/safe_mode_delay
0.000
node1 # mdadm -G /dev/md0 -b none
node1 # cat /sys/block/md0/md/safe_mode_delay
0.000 <== doesn't change, should default value
```

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 7e0adbfc 07-Jun-2020 Christoph Hellwig <hch@lst.de>

md: rewrite md_setup_drive to avoid ioctls

md_setup_drive knows it works with md devices, so it is rather pointless
to open a file descriptor and issue ioctls. Just call directly into the
relevant low-level md routines after getting a handle to the device using
blkdev_get_by_dev instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>


# d82fa81c 06-Jun-2020 Christoph Hellwig <hch@lst.de>

md: replace the RAID_AUTORUN ioctl with a direct function call

Instead of using a spcial RAID_AUTORUN ioctl that only exists for
non-modular builds and is only called from the early init code, just
call the actual function directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e3b8a8d 15-Jul-2020 Damien Le Moal <damien.lemoal@wdc.com>

md: Fix compilation warning

Remove the if statement around the calls to sysfs_link_rdev() to avoid
the compilation warnings:

warning: suggest braces around empty body in an ‘if’ statement

when compiling with W=1. For the call to sysfs_create_link() generating
the same warning, use the err variable to store the function result,
avoiding triggering another warning as the function is declared
as 'warn_unused_result'.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# e1a86dbb 14-Jul-2020 Junxiao Bi <junxiao.bi@oracle.com>

md: fix deadlock causing by sysfs_notify

The following deadlock was captured. The first process is holding 'kernfs_mutex'
and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
this pending bio list would be flushed by second process 'md127_raid1', but
it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
path, removed all of them.

PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file"
#0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
#1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
#2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
#3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
#4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
#5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
#6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
#7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
#8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
#9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
#10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
#11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
#12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
#13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
#14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
#15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
#16 [ffffb87c4df37958] new_slab at ffffffff9a251430
#17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
#18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
#19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
#20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
#21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
#22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
#23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
#24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
#25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
#26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
#27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
#28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
#29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
#30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
#31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
#32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
#33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905
RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890
RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c
R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0
R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1"
#0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
#1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
#2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
#3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
#4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
#5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
#6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
#7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
#8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
#9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
#10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
#11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
#12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
#13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
#14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 41d2d848 03-Jul-2020 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: improve io stats accounting

Use generic io accounting functions to manage io stats. There was an
attempt to do this earlier in commit 18c0b223cf99 ("md: use generic io
stats accounting functions to simplify io stat accounting"), but it did
not include a call to generic_end_io_acct() and caused issues with
tracking in-flight IOs, so it was later removed in commit 74672d069b29
("md: fix md io stats accounting broken").

This patch attempts to fix this by using both disk_start_io_acct() and
disk_end_io_acct(). To make it possible, a struct md_io is allocated for
every new md bio, which includes the io start_time. A new mempool is
introduced for this purpose. We override bio->bi_end_io with our own
callback and call disk_start_io_acct() before passing the bio to
md_handle_request(). When it completes, we call disk_end_io_acct() and
the original bi_end_io callback.

This adds correct statistics about in-flight IOs and IO processing time,
interpreted e.g. in iostat as await, svctm, aqu-sz and %util.

It also fixes a situation where too many IOs where reported if a bio was
re-submitted to the mddev, because io accounting is now performed only
on newly arriving bios.

Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 9a5a8597 01-Jul-2020 Colin Ian King <colin.king@canonical.com>

md: raid0/linear: fix dereference before null check on pointer mddev

Pointer mddev is being dereferenced with a test_bit call before mddev
is being null checked, this may cause a null pointer dereference. Fix
this by moving the null pointer checks to sanity check mddev before
it is dereferenced.

Addresses-Coverity: ("Dereference before null check")
Fixes: 62f7b1989c02 ("md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 21cf8661 01-Jul-2020 Christoph Hellwig <hch@lst.de>

writeback: remove bdi->congested_fn

Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a564e23f 08-Jul-2020 Christoph Hellwig <hch@lst.de>

md: switch to ->check_events for media change notifications

md is the last driver using the legacy media_changed method. Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e556f6ba 26-Jun-2020 Christoph Hellwig <hch@lst.de>

block: remove the bd_queue field from struct block_device

Just use bd_disk->queue instead.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c62b37d9 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: move ->make_request_fn to struct block_device_operations

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f695ca38 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: remove the request_queue argument from blk_queue_split

The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3f99980c 11-May-2020 Xiongfeng Wang <wangxiongfeng2@huawei.com>

md: add a newline when printing parameter 'start_ro' by sysfs

Add a missing newline when printing module parameter 'start_ro' by
sysfs.

Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# e4fc5a74 08-May-2020 Christoph Hellwig <hch@lst.de>

md: stop using ->queuedata

Pointer to mddev is already available in private_data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3024ba2d 09-Apr-2020 Coly Li <colyli@suse.de>

md: remove redundant memalloc scope API usage

In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
and memalloc_noio_restore() are used when allocating memory by calling
mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
raid array suspend context, it is unncessary to explicitly call them
around mempool_create_kmalloc_pool() any longer.

This patch removes the redundant memalloc scope APIs in
mddev_create_serial_pool().

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 78f57ef9 09-Apr-2020 Coly Li <colyli@suse.de>

md: use memalloc scope APIs in mddev_suspend()/mddev_resume()

In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().

The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.

But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.

What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.

This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.

Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.

Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3f79cc22 04-Apr-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: remove the extra line for ->hot_add_disk

It is not not necessary to add a newline for them since they don't exceed
80 characters, and it is not intutive to distinguish ->hot_add_disk() from
hot_add_disk() too.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 78b990cf 04-Apr-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: flush md_rdev_misc_wq for HOT_ADD_DISK case

Since rdev->kobj is removed asynchronously, it is possible that the
rdev->kobj still exists when try to add the rdev again after rdev
is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk
-> bind_rdev_to_array missed it.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# f6766ff6 04-Apr-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: don't flush workqueue unconditionally in md_open

We need to check mddev->del_work before flush workqueu since the purpose
of flush is to ensure the previous md is disappeared. Otherwise the similar
deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
bdev->bd_mutex before flush workqueue.

kernel: [ 154.522645] ======================================================
kernel: [ 154.522647] WARNING: possible circular locking dependency detected
kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O
kernel: [ 154.522651] ------------------------------------------------------
kernel: [ 154.522653] mdadm/2482 is trying to acquire lock:
kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
kernel: [ 154.522673]
kernel: [ 154.522673] but task is already holding lock:
kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522691]
kernel: [ 154.522691] which lock already depends on the new lock.
kernel: [ 154.522691]
kernel: [ 154.522694]
kernel: [ 154.522694] the existing dependency chain (in reverse order) is:
kernel: [ 154.522696]
kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
kernel: [ 154.522704] __mutex_lock+0x87/0x950
kernel: [ 154.522706] __blkdev_get+0x79/0x590
kernel: [ 154.522708] blkdev_get+0x65/0x140
kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40
kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod]
kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod]
kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod]
kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522735] vfs_write+0xad/0x1a0
kernel: [ 154.522737] ksys_write+0xa4/0xe0
kernel: [ 154.522745] do_syscall_64+0x64/0x2b0
kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522749]
kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
kernel: [ 154.522752] __mutex_lock+0x87/0x950
kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod]
kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522763] vfs_write+0xad/0x1a0
kernel: [ 154.522765] ksys_write+0xa4/0xe0
kernel: [ 154.522767] do_syscall_64+0x64/0x2b0
kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522770]
kernel: [ 154.522770] -> #2 (kn->count#253){++++}:
kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0
kernel: [ 154.522778] kernfs_remove+0x1f/0x30
kernel: [ 154.522780] kobject_del+0x28/0x60
kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod]
kernel: [ 154.522786] process_one_work+0x2a7/0x5f0
kernel: [ 154.522788] worker_thread+0x2d/0x3d0
kernel: [ 154.522793] kthread+0x117/0x130
kernel: [ 154.522795] ret_from_fork+0x3a/0x50
kernel: [ 154.522796]
kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
kernel: [ 154.522800] process_one_work+0x27e/0x5f0
kernel: [ 154.522802] worker_thread+0x2d/0x3d0
kernel: [ 154.522804] kthread+0x117/0x130
kernel: [ 154.522806] ret_from_fork+0x3a/0x50
kernel: [ 154.522807]
kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
kernel: [ 154.522813] __lock_acquire+0x1392/0x1690
kernel: [ 154.522816] lock_acquire+0xb4/0x1a0
kernel: [ 154.522818] flush_workqueue+0xab/0x4b0
kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522823] __blkdev_get+0xea/0x590
kernel: [ 154.522825] blkdev_get+0x65/0x140
kernel: [ 154.522828] do_dentry_open+0x1d1/0x380
kernel: [ 154.522831] path_openat+0x567/0xcc0
kernel: [ 154.522834] do_filp_open+0x9b/0x110
kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522838] do_sys_open+0x57/0x80
kernel: [ 154.522840] do_syscall_64+0x64/0x2b0
kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522844]
kernel: [ 154.522844] other info that might help us debug this:
kernel: [ 154.522844]
kernel: [ 154.522846] Chain exists of:
kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
kernel: [ 154.522846]
kernel: [ 154.522850] Possible unsafe locking scenario:
kernel: [ 154.522850]
kernel: [ 154.522852] CPU0 CPU1
kernel: [ 154.522853] ---- ----
kernel: [ 154.522854] lock(&bdev->bd_mutex);
kernel: [ 154.522856] lock(&mddev->reconfig_mutex);
kernel: [ 154.522858] lock(&bdev->bd_mutex);
kernel: [ 154.522860] lock((wq_completion)md_misc);
kernel: [ 154.522861]
kernel: [ 154.522861] *** DEADLOCK ***
kernel: [ 154.522861]
kernel: [ 154.522864] 1 lock held by mdadm/2482:
kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522868]
kernel: [ 154.522868] stack backtrace:
kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25
kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
kernel: [ 154.522878] Call Trace:
kernel: [ 154.522881] dump_stack+0x8f/0xcb
kernel: [ 154.522884] check_noncircular+0x194/0x1b0
kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690
kernel: [ 154.522890] __lock_acquire+0x1392/0x1690
kernel: [ 154.522893] lock_acquire+0xb4/0x1a0
kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522898] flush_workqueue+0xab/0x4b0
kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522910] __blkdev_get+0xea/0x590
kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522914] blkdev_get+0x65/0x140
kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522918] do_dentry_open+0x1d1/0x380
kernel: [ 154.522921] path_openat+0x567/0xcc0
kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690
kernel: [ 154.522926] do_filp_open+0x9b/0x110
kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0
kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630
kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0
kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522944] do_sys_open+0x57/0x80
kernel: [ 154.522946] do_syscall_64+0x64/0x2b0
kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae

And md_alloc also flushed the same workqueue, but the thing is different
here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
the flush is necessary to avoid race condition, so leave it as it is.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# cc1ffe61 04-Apr-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: add new workqueue for delete rdev

Since the purpose of call flush_workqueue in new_dev_store is to ensure
md_delayed_delete() has completed, so we should check rdev->del_work is
pending or not.

To suppress lockdep warning, we have to check mddev->del_work while
md_delayed_delete is attached to rdev->del_work, so it is not aligned
to the purpose of flush workquee. So a new workqueue is needed to avoid
the awkward situation, and introduce a new func flush_rdev_wq to flush
the new workqueue after check if there was pending work.

Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush
workqueue while it holds bdev->bd_mutex, so make the same change applies
to the ioctl to avoid similar lock issue.

And md_delayed_delete actually wants to delete rdev, so rename the function
to rdev_delayed_delete.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 21e0958e 04-Apr-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: add checkings before flush md_misc_wq

Coly reported possible circular locking dependencyi with LOCKDEP enabled,
quote the below info from the detailed report [1].

[ 1607.673903] Chain exists of:
[ 1607.673903] kn->count#256 --> (wq_completion)md_misc -->
(work_completion)(&rdev->del_work)
[ 1607.673903]
[ 1607.827946] Possible unsafe locking scenario:
[ 1607.827946]
[ 1607.898780] CPU0 CPU1
[ 1607.952980] ---- ----
[ 1608.007173] lock((work_completion)(&rdev->del_work));
[ 1608.069690] lock((wq_completion)md_misc);
[ 1608.149887] lock((work_completion)(&rdev->del_work));
[ 1608.242563] lock(kn->count#256);
[ 1608.283238]
[ 1608.283238] *** DEADLOCK ***
[ 1608.283238]
[ 1608.354078] 2 locks held by kworker/5:0/843:
[ 1608.405152] #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
process_one_work+0x42b/0xb30
[ 1608.512399] #1: ffff888a1d3b7e10
((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
[ 1608.632130]

Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
then lockdep_map lock is held if either of them are running, then both of
them try to hold kernfs lock by call kobject_del. Then if new_dev_store
or array_state_store are triggered by write to the related sysfs node, so
the write operation gets kernfs lock, but need the lockdep_map because all
of them would trigger flush_workqueue(md_misc_wq) finally, then the same
lockdep_map lock is needed.

To suppress the lockdep warnning, we should flush the workqueue in case the
related work is pending. And several works are attached to md_misc_wq, so
we need to check which work should be checked:

1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
thread is started if it was starting, so check mddev->del_work is pending
or not since md_start_sync is attached to mddev->del_work.

2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
need the kernfs lock.

3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
completed, this case will be handled in next patch.

4. md_open flushes workqueue to ensure the previous md is disappeared, but
it holds bdev->bd_mutex then try to flush workqueue, so it is better to
check mddev->del_work as well to avoid potential lock issue, this will be
done in another patch.

[1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2

Cc: Coly Li <colyli@suse.de>
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3d745ea5 27-Mar-2020 Christoph Hellwig <hch@lst.de>

block: simplify queue allocation

Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper. Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6a564ff 25-Mar-2020 Christoph Hellwig <hch@lst.de>

block: move the part_stat* helpers from genhd.h to a new header

These macros are just used by a few files. Move them out of genhd.h,
which is included everywhere into a new standalone header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 74cc979c 24-Mar-2020 Christoph Hellwig <hch@lst.de>

block: cleanup how md_autodetect_dev is called

Add a new include/linux/raid/detect.h header to declare the
md_autodetect_dev prototype which can be shared between md and
the partition code. Then use IS_BUILTIN to call it instead of the
ifdef magic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ea3edd4d 24-Mar-2020 Christoph Hellwig <hch@lst.de>

block: remove __bdevname

There is no good reason for __bdevname to exist. Just open code
printing the string in the callers. For three of them the format
string can be trivially merged into existing printk statements,
and in init/do_mounts.c we can at least do the scnprintf once at
the start of the function, and unconditional of CONFIG_BLOCK to
make the output for tiny configfs a little more helpful.

Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6b40bec3 11-Feb-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: check arrays is suspended in mddev_detach before call quiesce operations

Don't call quiesce(1) and quiesce(0) if array is already suspended,
otherwise in level_store, the array is writable after mddev_detach
in below part though the intention is to make array writable after
resume.

mddev_suspend(mddev);
mddev_detach(mddev);
...
mddev_resume(mddev);

And it also causes calltrace as follows in [1].

[48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
[...]
[48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G OE 5.4.10-arch1-1 #1
[48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
[48005.653984] RIP: 0010:kthread_park+0x77/0x90
[48005.654015] Call Trace:
[48005.654039] r5l_quiesce+0x3c/0x70 [raid456]
[48005.654052] raid5_quiesce+0x228/0x2e0 [raid456]
[48005.654073] mddev_detach+0x30/0x70 [md_mod]
[48005.654090] level_store+0x202/0x670 [md_mod]
[48005.654099] ? security_capable+0x40/0x60
[48005.654114] md_attr_store+0x7b/0xc0 [md_mod]
[48005.654123] kernfs_fop_write+0xce/0x1b0
[48005.654132] vfs_write+0xb6/0x1a0
[48005.654138] ksys_write+0x67/0xe0
[48005.654146] do_syscall_64+0x4e/0x140
[48005.654155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[48005.654161] RIP: 0033:0x7fa0c8737497

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 97a32539 03-Feb-2020 Alexey Dobriyan <adobriyan@gmail.com>

proc: convert everything to "struct proc_ops"

The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

llseek => proc_lseek
unlocked_ioctl => proc_ioctl

xxx => proc_xxx

delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 025471f9 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md/raid1: use bucket based mechanism for IO serialization

Since raid1 had already used bucket based mechanism to reduce
the conflict between write IO and resync IO, it is possible to
speed up performance for io serialization with refer to the
same mechanism.

To align with the barrier bucket mechanism, we created arrays
(with the same number of BARRIER_BUCKETS_NR) for spinlock, rb
tree and waitqueue. Then we can reduce lock competition with
multiple spinlocks, boost search performance with multiple rb
trees and also reduce thundering herd problem with multiple
waitqueues.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 69b00b5b 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: introduce a new struct for IO serialization

Obviously, IO serialization could cause the degradation of
performance a lot. In order to reduce the degradation, so a
rb interval tree is added in raid1 to speed up the check of
collision.

So, a rb root is needed in md_rdev, then abstract all the
serialize related members to a new struct (serial_in_rdev),
embed it into md_rdev.

Of course, we need to free the struct if it is not needed
anymore, so rdev/rdevs_uninit_serial are added accordingly.
And they should be called when destroty memory pool or can't
alloc memory.

And we need to consider to call mddev_destroy_serial_pool
in case serialize_policy/write-behind is disabled, bitmap
is destroyed or in __md_stop_writes.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# de31ee94 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: reorgnize mddev_create/destroy_serial_pool

So far, IO serialization is used for two scenarios:

1. raid1 which enables write-behind mode, and there is rdev in the array
which is multi-queue device and flaged with writemostly.
2. IO serialization is enabled or disabled by change serialize_policy.

So introduce rdev_need_serial to check the first scenario. And for 1, IO
serialization is enabled automatically while 2 is controlled manually.

And it is possible that both scenarios are true, so for create serial pool,
rdev/rdevs_init_serial should be separate from check if the pool existed or
not. Then for destroy pool, we need to check if the pool is needed by other
rdevs due to the first scenario.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3938f5fb 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: add serialize_policy sysfs node for raid1

With the new sysfs node, we can use it to control if raid1 array
wants io serialization or not. So mddev_create_serial_pool and
mddev_destroy_serial_pool are called in serialize_policy_store
to enable or disable the serialization.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 11d3a9f6 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: prepare for enable raid1 io serialization

1. The related resources (spin_lock, list and waitqueue) are needed for
address raid1 reorder overlap issue too, in this case, rdev is set to
NULL for mddev_create/destroy_serial_pool which implies all rdevs need
to handle these resources.

And also add "is_suspend" to mddev_destroy_serial_pool since it will
be called under suspended situation, which also makes both create and
destroy pool have same arguments.

2. Introduce rdevs_init_serial which is called if raid1 io serialization
is enabled since all rdevs need to init related stuffs.

3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should
be called between suspend and resume.

No need to export mddev_create_serial_pool since it is only called in
md-mod module.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3e173ab5 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: fix a typo s/creat/create

It actually means create here, so fix the typo.

Reported-by: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 404659cf 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: rename wb stuffs

Previously, wb_info_pool and wb_list stuffs are introduced
to address potential data inconsistence issue for write
behind device.

Now rename them to serial related name, since the same
mechanism will be used to address reorder overlap write
issue for raid1.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3b7436cc 10-Dec-2019 Yufen Yu <yuyufen@huawei.com>

md: make sure desc_nr less than MD_SB_DISKS

For super_90_load, we need to make sure 'desc_nr' less
than MD_SB_DISKS, avoiding invalid memory access of 'sb->disks'.

Fixes: 228fc7d76db6 ("md: avoid invalid memory access for array sb->dev_roles")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 228fc7d7 30-Oct-2019 Yufen Yu <yuyufen@huawei.com>

md: avoid invalid memory access for array sb->dev_roles

we need to gurantee 'desc_nr' valid before access array
of sb->dev_roles.

In addition, we should avoid .load_super always return '0'
when level is LEVEL_MULTIPATH, which is not expected.

Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses")
Fixes: 6a5cb53aaa4e ("md: no longer compare spare disk superblock events in super_load")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 6a5cb53a 16-Oct-2019 Yufen Yu <yuyufen@huawei.com>

md: no longer compare spare disk superblock events in super_load

We have a test case as follow:

mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force

mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda

echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1

echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force

When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.

After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:

[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.

Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.

In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().

To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 775d7831 16-Sep-2019 David Jeffery <djeffery@redhat.com>

md: improve handling of bio with REQ_PREFLUSH in md_flush_request()

If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.

Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function. If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.

Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.

Fixes: 2bc13b83e629 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 33f2c35a 09-Sep-2019 NeilBrown <neilb@suse.de>

md: add feature flag MD_FEATURE_RAID0_LAYOUT

Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibilities.

It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use. It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.

If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.

Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 62f7b198 03-Sep-2019 Guilherme G. Piccoli <gpiccoli@canonical.com>

md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone

Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.

In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.

This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.

A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.

With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.

Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 9d4b45d6 19-Aug-2019 NeilBrown <neilb@suse.com>

md: don't report active array_state until after revalidate_disk() completes.

Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready. As soon as
it appear to be zero, fsck can be run. If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show(). This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared. This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice. This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b67e5 ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 480523fe 19-Aug-2019 NeilBrown <neilb@suse.com>

md: only call set_in_sync() when it is expected to succeed.

Since commit 4ad23a976413 ("MD: use per-cpu counter for
writes_pending"), set_in_sync() is substantially more expensive: it
can wait for a full RCU grace period which can be 10s of milliseconds.

So we should only call it when the cost is justified.

md_check_recovery() currently calls set_in_sync() every time it finds
anything to do (on non-external active arrays). For an array
performing resync or recovery, this will be quite often.
Each call will introduce a delay to the md thread, which can noticeable
affect IO submission latency.

In md_check_recovery() we only need to call set_in_sync() if
'safemode' was non-zero at entry, meaning that there has been not
recent IO. So we save this "safemode was nonzero" state, and only
call set_in_sync() if it was non-zero.

This measurably reduces mean and maximum IO submission latency during
resync/recovery.

Reported-and-tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 0d8ed0e9 24-Jul-2019 Guoqing Jiang <jgq516@gmail.com>

md: don't call spare_active in md_reap_sync_thread if all member devices can't work

When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().

But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.

So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 062f5b2a 24-Jul-2019 Guoqing Jiang <jgq516@gmail.com>

md: don't set In_sync if array is frozen

When a disk is added to array, the following path is called in mdadm.

Manage_subdevs -> sysfs_freeze_array
-> Manage_add
-> sysfs_set_str(&info, NULL, "sync_action","idle")

Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.

Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 9a567843 24-Jul-2019 Guoqing Jiang <jgq516@gmail.com>

md: allow last device to be forcibly removed from RAID1/RAID10.

When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed. This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action. This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.

Signed-off-by: NeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# cf891607 23-Jul-2019 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

md: Convert to use int_pow()

Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# d494549a 14-Jun-2019 Guoqing Jiang <gqjiang@suse.com>

md: add bitmap_abort label in md_run

Now, there are two places need to consider about
the failure of destroy bitmap, so move the common
part between bitmap_abort and abort label.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 963c555e 14-Jun-2019 Guoqing Jiang <gqjiang@suse.com>

md: introduce mddev_create/destroy_wb_pool for the change of member device

Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3e148a32 19-Jun-2019 Guoqing Jiang <gqjiang@suse.com>

md/raid1: fix potential data inconsistency issue with write behind device

For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 9642fa73 13-Jun-2019 Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>

md: fix for divide error in status_resync

Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).

The real problem occurs here after commit 72deb455b5ec ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.

Check if db value can be really counted and replace these macro by
div64_u64() inline.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# e5b521ee 14-Jun-2019 Yufen Yu <yuyufen@huawei.com>

md: fix spelling typo and add necessary space

This patch fix a spelling typo and add necessary space for code.
In addition, the patch get rid of the unnecessary 'if'.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 168b305b 14-Jun-2019 Marcos Paulo de Souza <marcos.souza.org@gmail.com>

md: md.c: Return -ENODEV when mddev is NULL in rdev_attr_show

Commit c42d3240990814eec1e4b2b93fa0487fc4873aed
("md: return -ENODEV if rdev has no mddev assigned") changed
rdev_attr_store to return -ENODEV when rdev->mddev is NULL, now do the
same to rdev_attr_show.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# af1a8899 20-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version you should have received a copy of the gnu general
public license for example usr src linux copying if not write to the
free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ddde2af7 07-May-2019 Roman Gushchin <guro@fb.com>

md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT

Percpu reference counters should now be initialized with the
PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
percpu mode from the atomic mode.
To make percpu_ref_switch_to_percpu() call in set_in_sync()
succeed,let's initialize percpu refcounters with the
PERCU_REF_ALLOW_REINIT flag.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>


# c42d3240 27-Mar-2019 Pawel Baldysiak <pawel.baldysiak@intel.com>

md: return -ENODEV if rdev has no mddev assigned

Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.

If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 2b598ee5 04-Apr-2019 Christoph Hellwig <hch@lst.de>

md: mark md_cluster_mod static

Sparse complains that it has no external declaration, and it turns out
that it is never even used outside of md.c. So just mark it static
and drop the export.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# ae50640b 04-Apr-2019 Christoph Hellwig <hch@lst.de>

md: use correct type in super_1_sync

If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 00485d09 04-Apr-2019 Christoph Hellwig <hch@lst.de>

md: use correct type in super_1_load

If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# ed4d0a4e 04-Apr-2019 Christoph Hellwig <hch@lst.de>

md: add a missing endianness conversion in check_sb_changes

The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.

Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode")
Cc: <stable@vger.kernel.org> # 4.20+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# ee37e621 02-Apr-2019 Yufen Yu <yuyufen@huawei.com>

md: add mddev->pers to avoid potential NULL pointer dereference

When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().

Fixes: a6da4ef85cef ("md: re-add a failed disk")
Cc: Xiao Ni <xni@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 72deb455 05-Apr-2019 Christoph Hellwig <hch@lst.de>

block: remove CONFIG_LBDAF

Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures. These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time. Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2bc13b83 29-Mar-2019 NeilBrown <neilb@suse.com>

md: batch flush requests.

Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 10.509 78.305
Flush 2078376 0.013 10.094
Close 21787697 0.019 18.821
LockX 96580 0.007 3.184
Mkdir 384 0.008 0.062
Rename 1255883 0.191 23.534
ReadX 46495589 0.020 14.230
WriteX 14790591 7.123 60.706
Unlink 5989118 0.440 54.551
UnlockX 96580 0.005 2.736
FIND_FIRST 10393845 0.042 12.079
SET_FILE_INFORMATION 2415558 0.129 10.088
QUERY_FILE_INFORMATION 4711725 0.005 8.462
QUERY_PATH_INFORMATION 26883327 0.032 21.715
QUERY_FS_INFORMATION 4929409 0.010 8.238
NTCreateX 29660080 0.100 53.268

Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 256 8.326 36.761
Flush 693291 3.974 180.269
Close 7266404 0.009 36.929
LockX 32160 0.006 0.840
Mkdir 128 0.008 0.021
Rename 418755 0.063 29.945
ReadX 15498708 0.007 7.216
WriteX 4932310 22.482 267.928
Unlink 1997557 0.109 47.553
UnlockX 32160 0.004 1.110
FIND_FIRST 3465791 0.036 7.320
SET_FILE_INFORMATION 805825 0.015 1.561
QUERY_FILE_INFORMATION 1570950 0.005 2.403
QUERY_PATH_INFORMATION 8965483 0.013 14.277
QUERY_FS_INFORMATION 1643626 0.009 3.314
NTCreateX 9892174 0.061 41.278

Throughput 345.009 MB/sec (sync open) 128 clients 128 procs
max_latency=267.939 m

3. With patch1 and patch2
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 9.570 54.588
Flush 2061354 0.666 15.102
Close 21604811 0.012 25.697
LockX 95770 0.007 1.424
Mkdir 384 0.008 0.053
Rename 1245411 0.096 12.263
ReadX 46103198 0.011 12.116
WriteX 14667988 7.375 60.069
Unlink 5938936 0.173 30.905
UnlockX 95770 0.005 4.147
FIND_FIRST 10306407 0.041 11.715
SET_FILE_INFORMATION 2395987 0.048 7.640
QUERY_FILE_INFORMATION 4672371 0.005 9.291
QUERY_PATH_INFORMATION 26656735 0.018 19.719
QUERY_FS_INFORMATION 4887940 0.010 7.654
NTCreateX 29410811 0.059 28.551

Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs
max_latency=60.075 ms

Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4bc034d3 29-Mar-2019 NeilBrown <neilb@suse.com>

Revert "MD: fix lock contention for flush bios"

This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
The bios thus submitted will be queued on current->bio_list and not
submitted immediately. As the bios are allocated from a mempool,
this can theoretically result in a deadlock - all the pool of requests
could be in various ->bio_list queues and a subsequent mempool_alloc
could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
It handles this by submitting many requests in parallel - all of which
are identical and so most of which do nothing useful.
It would be more efficient to just send one lower-level request, but
allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6251691a 14-Jan-2019 Marcos Paulo de Souza <marcos.souza.org@gmail.com>

md: Make bio_alloc_mddev use bio_alloc_bioset

bio_alloc_bioset returns a bio pointer or NULL, so we can avoid storing
the returned data into a new variable.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 37b22c28 13-Nov-2018 Chengguang Xu <cgxu519@gmx.com>

md: remvoe redundant condition check

mempool_destroy() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
mempool_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# f91389c8 07-Nov-2018 Yue Haibing <yuehaibing@huawei.com>

md: remove set but not used variable 'bi_rdev'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/md/md.c: In function 'md_integrity_add_rdev':
drivers/md/md.c:2149:24: warning:
variable 'bi_rdev' set but not used [-Wunused-but-set-variable]

It not used any more after commit
1501efadc524 ("md/raid: only permit hot-add of compatible integrity profiles")

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 112f158f 06-Dec-2018 Mike Snitzer <snitzer@redhat.com>

block: stop passing 'cpu' to all percpu stats methods

All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them. Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# af9b926d 19-Oct-2018 Xiao Ni <xni@redhat.com>

MD: Memory leak when flush bio size is zero

flush_pool is leaked when flush bio size is zero

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 6aaa58c9 19-Oct-2018 Jack Wang <jinpu.wang@profitbricks.com>

md: fix memleak for mempool

I noticed kmemleak report memory leak when run create/stop
md in a loop, backtrace:
[<000000001ca975e7>] mempool_create_node+0x86/0xd0
[<0000000095576bcd>] md_run+0x1057/0x1410 [md_mod]
[<000000007b45c5fc>] do_md_run+0x15/0x130 [md_mod]
[<000000001ede9ec0>] md_ioctl+0x1f49/0x25d0 [md_mod]
[<000000004142cacf>] blkdev_ioctl+0x680/0xd00

The root cause is we alloc mddev->flush_pool and
mddev->flush_bio_pool in md_run, but from do_md_stop
will not call into md_stop but __md_stop, move the
mempool_destroy to __md_stop fixes the problem for me.

The bug was introduced in 5a409b4f56d5, the fixes should go to
4.18+

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# cb9ee154 18-Oct-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted

We need to continue the reshaping if it was interrupted in
original node. So original node should call resync_bitmap
in case reshaping is aborted.

Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
node which continues the reshaping should restart reshape from
mddev->reshape_position instead of from the first beginning.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ca1e98e0 18-Oct-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster/raid10: don't call remove_and_add_spares during reshaping stage

remove_and_add_spares is not needed if reshape is
happening in another node, because raid10_add_disk
called inside raid10_start_reshape would handle the
role changes of disk. Plus, remove_and_add_spares
can't deal with the role change due to reshape.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# aefb2e5f 18-Oct-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster/raid10: call update_size in md_reap_sync_thread

We need to change the capacity in all nodes after one node
finishs reshape. And as we did before, we can't change the
capacity directly in md_do_sync, instead, the capacity should
be only changed in update_size or received CHANGE_CAPACITY
msg.

So master node calls update_size after completes reshape in
md_reap_sync_thread, but we need to skip ops->update_size if
MD_CLOSING is set since reshaping could not be finish.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 7564beda 18-Oct-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster/raid10: support add disk under grow mode

For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.

Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.

For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.

Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.

2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 9e753ba9 14-Oct-2018 Shaohua Li <shli@fb.com>

MD: fix invalid stored role for a disk - try2

Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear
hotadd. Let's only fix the role for disks in raid1/10.
Based on Guoqing's original patch.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 059421e0 02-Oct-2018 NeilBrown <neilb@suse.com>

md: allow metadata updates while suspending an array - fix

Commit 35bfc52187f6 ("md: allow metadata update while suspending.")
added support for allowing md_check_recovery() to still perform
metadata updates while the array is entering the 'suspended' state.
This is needed to allow the processes of entering the state to
complete.

Unfortunately, the patch doesn't really work. The test for
"mddev->suspended" at the start of md_check_recovery() means that the
function doesn't try to do anything at all while entering suspend.

This patch moves the code of updating the metadata while suspending to
*before* the test on mddev->suspended.

Reported-by: Jeff Mahoney <jeffm@suse.com>
Fixes: 35bfc52187f6 ("md: allow metadata update while suspending.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# d595567d 01-Oct-2018 Shaohua Li <shli@fb.com>

MD: fix invalid stored role for a disk

If we change the number of array's device after device is removed from array,
then add the device back to array, we can see that device is added as active
role instead of spare which we expected.

Please see the below link for details:
https://marc.info/?l=linux-raid&m=153736982015076&w=2

This is caused by that we prefer to use device's previous role which is
recorded by saved_raid_disk, but we should respect the new number of
conf->raid_disks since it could be changed after device is removed.

Reported-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Tested-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# e64e4018 01-Aug-2018 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

md: Avoid namespace collision with bitmap API

bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>


# 3ed122e6 24-Jul-2018 Christoph Hellwig <hch@lst.de>

md: remove a bogus comment

The function name mentioned doesn't exist, and the code next to it
doesn't match the description either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ddcf35d3 18-Jul-2018 Michael Callahan <michaelcallahan@fb.com>

block: Add and use op_stat_group() for indexing disk_stat fields.

Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.

In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.

Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function. It's now
indexed by op_is_write().

tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 59767fbd 18-Jul-2018 Michael Callahan <michaelcallahan@fb.com>

block: Add part_stat_read_accum to read across field entries.

Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries. For example to sum up the number read and write
sectors completed. In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0357ba27 02-Jul-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster: show array's status more accurate

When resync or recovery is happening in one node,
other nodes don't show the appropriate info now.

For example, when create an array in master node
without "--assume-clean", then assemble the array
in slave nodes, you can see "resync=PENDING" when
read /proc/mdstat in slave nodes. However, the info
is confusing since "PENDING" status is introduced
for start array in read-only mode.

We introduce RESYNCING_REMOTE flag to indicate that
resync thread is running in remote node. The flags
is set when node receive RESYNCING msg. And we clear
the REMOTE flag in following cases:

1. resync or recover is finished in master node,
which means slaves receive msg with both lo
and hi are set to 0.
2. node continues resync/recovery in recover_bitmaps.
3. when resync_finish is called.

Then we show accurate information in status_resync
by check REMOTE flags and with other conditions.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# bfc9dfdc 13-Jun-2018 Shaohua Li <shli@fb.com>

MD: cleanup resources in failure

We need destroy the memory pool in failure

Signed-off-by: Shaohua Li <shli@fb.com>


# 28dec870 07-Jun-2018 Kent Overstreet <kent.overstreet@gmail.com>

md: Unify mddev destruction paths

Previously, mddev_put() had a couple different paths for freeing a
mddev, due to the fact that the kobject wasn't initialized when the
mddev was first allocated. If we move the kobject_init() to when it's
first allocated and just use kobject_add() later, we can clean all this
up.

This also removes a hack in mddev_put() to avoid freeing biosets under a
spinlock, which involved copying biosets on the stack after the reset
bioset_init() changes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# afeee514 20-May-2018 Kent Overstreet <kent.overstreet@gmail.com>

md: convert to bioset_init()/mempool_init()

Convert md to embedded bio sets.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a409b4f 20-May-2018 Xiao Ni <xni@redhat.com>

MD: fix lock contention for flush bios

There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.

Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.

This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:

=================Without the patch============================
Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 11165 167.595 5879.560
Close 107469 1.391 2231.094
LockX 384 0.003 0.019
Rename 5944 2.141 1856.001
ReadX 208121 0.003 0.074
WriteX 98259 1925.402 15204.895
Unlink 25198 13.264 3457.268
UnlockX 384 0.001 0.009
FIND_FIRST 47111 0.012 0.076
SET_FILE_INFORMATION 12966 0.007 0.065
QUERY_FILE_INFORMATION 27921 0.004 0.085
QUERY_PATH_INFORMATION 124650 0.005 5.766
QUERY_FS_INFORMATION 22519 0.003 0.053
NTCreateX 141086 4.291 2502.812

Throughput 3.7181 MB/sec (sync open) 128 clients 128 procs max_latency=15204.905 ms

=================With the patch============================
Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 4500 174.134 406.398
Close 48195 0.060 467.062
LockX 256 0.003 0.029
Rename 2324 0.026 0.360
ReadX 78846 0.004 0.504
WriteX 66832 562.775 1467.037
Unlink 5516 3.665 1141.740
UnlockX 256 0.002 0.019
FIND_FIRST 16428 0.015 0.313
SET_FILE_INFORMATION 6400 0.009 0.520
QUERY_FILE_INFORMATION 17865 0.003 0.089
QUERY_PATH_INFORMATION 47060 0.078 416.299
QUERY_FS_INFORMATION 7024 0.004 0.032
NTCreateX 55921 0.854 1141.452

Throughput 11.744 MB/sec (sync open) 128 clients 128 procs max_latency=1467.041 ms

The test is done on raid1 disk with two rotational disks

V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version

V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.

Fix a bug from v2. It should set flush_pending to 1 at first.

V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe. It is only safe to call rdev_dec_pending() on rdevs
that you previously called
atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.

Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# c42a0e26 04-May-2018 Yufen Yu <yuyufen@huawei.com>

md: fix NULL dereference of mddev->pers in remove_and_add_spares()

We met NULL pointer BUG as follow:

[ 151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[ 151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
[ 151.762039] Oops: 0000 [#1] SMP PTI
[ 151.762406] Modules linked in:
[ 151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
[ 151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
[ 151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
[ 151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
[ 151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
[ 151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
[ 151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
[ 151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
[ 151.769160] FS: 00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
[ 151.769971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
[ 151.771272] Call Trace:
[ 151.771542] md_ioctl+0x1df2/0x1e10
[ 151.771906] ? __switch_to+0x129/0x440
[ 151.772295] ? __schedule+0x244/0x850
[ 151.772672] blkdev_ioctl+0x4bd/0x970
[ 151.773048] block_ioctl+0x39/0x40
[ 151.773402] do_vfs_ioctl+0xa4/0x610
[ 151.773770] ? dput.part.23+0x87/0x100
[ 151.774151] ksys_ioctl+0x70/0x80
[ 151.774493] __x64_sys_ioctl+0x16/0x20
[ 151.774877] do_syscall_64+0x5b/0x180
[ 151.775258] entry_SYSCALL_64_after_hwframe+0x44/0xa9

For raid6, when two disk of the array are offline, two spare disks can
be added into the array. Before spare disks recovery completing,
system reboot and mdadm thinks it is ok to restart the degraded
array by md_ioctl(). Since disks in raid6 is not only_parity(),
raid5_run() will abort, when there is no PPL feature or not setting
'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.

But, mddev->raid_disks has been set and it will not be cleared when
raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
remove a disk by mdadm, which will cause NULL pointer dereference
in remove_and_add_spares() finally.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 011abdc9 25-Apr-2018 NeilBrown <neilb@suse.com>

md: fix two problems with setting the "re-add" device state.

If "re-add" is written to the "state" file for a device
which is faulty, this has an effect similar to removing
and re-adding the device. It should take up the
same slot in the array that it previously had, and
an accelerated (e.g. bitmap-based) rebuild should happen.

The slot that "it previously had" is determined by
rdev->saved_raid_disk.
However this is not set when a device fails (only when a device
is added), and it is cleared when resync completes.
This means that "re-add" will normally work once, but may not work a
second time.

This patch includes two fixes.
1/ when a device fails, record the ->raid_disk value in
->saved_raid_disk before clearing ->raid_disk
2/ when "re-add" is written to a device for which
->saved_raid_disk is not set, fail.

I think this is suitable for stable as it can
cause re-adding a device to be forced to do a full
resync which takes a lot longer and so puts data at
more risk.

Cc: <stable@vger.kernel.org> (v4.1)
Fixes: 97f6cd39da22 ("md-cluster: re-add capabilities")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 0ea9924a 09-Apr-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster: don't update recovery_offset for faulty device

Device could become faulty when clustered array handling
METADATA_UPDATED msg, so we don't need to call read_rdev
for this device.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 8b904b5b 07-Mar-2018 Bart Van Assche <bvanassche@acm.org>

block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()

This patch has been generated as follows:

for verb in set_unlocked clear_unlocked set clear; do
replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
$(git grep -lw queue_flag_${verb} drivers block/bsg*)
done

Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d8115c35 28-Feb-2018 Bart Van Assche <bvanassche@acm.org>

md: Delete gendisk before cleaning up the request queue

Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8876391e 21-Feb-2018 BingJing Chang <bingjingc@synology.com>

md: fix a potential deadlock of raid5/raid10 reshape

There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
write through critical data into raid5 emulated block device. So it
waits for raid5 kernel thread handling stripes in order to finish it
I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
called first in order to reap the raid5 resync thread. That is,
raid5_finish_reshape() will be called. In this function, it will try
to update conf and call VFS revalidate_disk() to grow the raid5
emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

2017/12/29:
Thanks to Guoqing Jiang <gqjiang@suse.com>,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.

Reported-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>


# 39772f0a 02-Feb-2018 NeilBrown <neilb@suse.com>

md: only allow remove_and_add_spares when no sync_thread running.

The locking protocols in md assume that a device will
never be removed from an array during resync/recovery/reshape.
When that isn't happening, rcu or reconfig_mutex is needed
to protect an rdev pointer while taking a refcount. When
it is happening, that protection isn't needed.

Unfortunately there are cases were remove_and_add_spares() is
called when recovery might be happening: is state_store(),
slot_store() and hot_remove_disk().
In each case, this is just an optimization, to try to expedite
removal from the personality so the device can be removed from
the array. If resync etc is happening, we just have to wait
for md_check_recover to find a suitable time to call
remove_and_add_spares().

This optimization and not essential so it doesn't
matter if it fails.
So change remove_and_add_spares() to abort early if
resync/recovery/reshape is happening, unless it is called
from md_check_recovery() as part of a newly started recovery.
The parameter "this" is only NULL when called from
md_check_recovery() so when it is NULL, there is no need to abort.

As this can result in a NULL dereference, the fix is suitable
for -stable.

cc: yuyufen <yuyufen@huawei.com>
Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.")
Cc: stable@ver.kernel.org (v4.8+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>


# 4b6c1060 02-Feb-2018 Heinz Mauelshagen <heinzm@redhat.com>

md: fix md_write_start() deadlock w/o metadata devices

If no metadata devices are configured on raid1/4/5/6/10
(e.g. via dm-raid), md_write_start() unconditionally waits
for superblocks to be written thus deadlocking.

Fix introduces mddev->has_superblocks bool, defines it in md_run()
and checks for it in md_write_start() to conditionally avoid waiting.

Once on it, check for non-existing superblocks in md_super_write().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647
Fixes: cc27b0c78c796 ("md: fix deadlock between mddev_suspend() and md_write_start()")

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>


# b126194c 23-Jan-2018 Xiao Ni <xni@redhat.com>

MD: Free bioset when md_run fails

Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>


# a9a08845 11-Feb-2018 Linus Torvalds <torvalds@linux-foundation.org>

vfs: do bulk POLL* -> EPOLL* replacement

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1532d9e8 27-Dec-2017 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

raid5-ppl: PPL support for disks with write-back cache enabled

In order to provide data consistency with PPL for disks with write-back
cache enabled all data has to be flushed to disks before next PPL
entry. The disks to be flushed are marked in the bitmap. It's modified
under a mutex and it's only read after PPL io unit is submitted.

A limitation of 64 disks in the array has been introduced to keep data
structures and implementation simple. RAID5 arrays with so many disks are
not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

With write-back cache disabled next PPL entry is submitted when data write
for current one completes. Data flush defers next log submission so trigger
it when there are no stripes for handling found.

As PPL assures all data is flushed to disk at request completion, just
acknowledge flush request when PPL is enabled.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>


# d5d885fd 19-Nov-2017 Song Liu <songliubraving@fb.com>

md: introduce new personality funciton start()

In do_md_run(), md threads should not wake up until the array is fully
initialized in md_run(). However, in raid5_run(), raid5-cache may wake
up mddev->thread to flush stripes that need to be written back. This
design doesn't break badly right now. But it could lead to bad bug in
the future.

This patch tries to resolve this problem by splitting start up work
into two personality functions, run() and start(). Tasks that do not
require the md threads should go into run(), while task that require
the md threads go into start().

r5l_load_log() is moved to raid5_start(), so it is not called until
the md threads are started in do_md_run().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# d2e2ec82 30-Nov-2017 Nate Dailey <nate.dailey@stratus.com>

md: limit mdstat resync progress to max_sectors

There is a small window near the end of md_do_sync where mddev->curr_resync
can be equal to MaxSector.

If status_resync is called during this window, the resulting /proc/mdstat
output contains a HUGE number of = signs due to the very large curr_resync:

Personalities : [raid1]
md123 : active raid1 sdd3[2] sdb3[0]
204736 blocks super 1.0 [2/1] [U_]
[=====================================================================
... (82 MB more) ...
================>] recovery =429496729.3% (9223372036854775807/204736)
finish=0.2min speed=12796K/sec
bitmap: 0/1 pages [0KB], 65536KB chunk

Modify status_resync to ensure the resync variable doesn't exceed
the array's max_sectors.

Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# afc9a42b 03-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

the rest of drivers/*: annotate ->poll() instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8376d3c1 16-Oct-2017 Kees Cook <keescook@chromium.org>

md: Convert timers to use timer_setup()

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: linux-bcache@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0202ce8a 08-Nov-2017 Zdenek Kabelac <zkabelac@redhat.com>

md: release allocated bitset sync_set

Patch fixes kmemleak on md_stop() path used likely only by dm-raid wrapper.
Code of md is using mddev_put() where both bitsets are released however this
freeing is not shared.

Also set NULL to bio_set and sync_set pointers just like mddev_put is
doing.

Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# db0505d3 16-Oct-2017 NeilBrown <neilb@suse.com>

md: be cautious about using ->curr_resync_completed for ->recovery_offset

The ->recovery_offset shows how much of a non-InSync device is actually
in sync - how much has been recoveryed.

When performing a recovery, ->curr_resync and ->curr_resync_completed
follow the device address being recovered and so can be used to update
->recovery_offset.

When performing a reshape, ->curr_resync* might follow the device
addresses (raid5) or might follow array addresses (raid10), so cannot
in general be used to set ->recovery_offset. When reshaping backwards,
->curre_resync* measures from the *end* of the array-or-device, so is
particularly unhelpful.

So change the common code in md.c to only use ->curr_resync_complete
for the simple recovery case, and add code to raid5.c to update
->recovery_offset during a forwards reshape.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# b90f6ff0 26-Oct-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: don't check MD_SB_CHANGE_CLEAN in md_allow_write

Only MD_SB_CHANGE_PENDING should be used to wait for transition from
clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can
race with e.g. md_do_sync(). This sporadically causes a hang when
changing consistency policy during resync:

INFO: task mdadm:6183 blocked for more than 30 seconds.
Not tainted 4.14.0-rc3+ #391
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdadm D12752 6183 6022 0x00000000
Call Trace:
__schedule+0x93f/0x990
schedule+0x6b/0x90
md_allow_write+0x100/0x130 [md_mod]
? do_wait_intr_irq+0x90/0x90
resize_stripes+0x3a/0x5b0 [raid456]
? kernfs_fop_write+0xbe/0x180
raid5_change_consistency_policy+0xa6/0x200 [raid456]
consistency_policy_store+0x2e/0x70 [md_mod]
md_attr_store+0x90/0xc0 [md_mod]
sysfs_kf_write+0x42/0x50
kernfs_fop_write+0x119/0x180
__vfs_write+0x28/0x110
? rcu_sync_lockdep_assert+0x12/0x60
? __sb_start_write+0x15a/0x1c0
? vfs_write+0xa3/0x1a0
vfs_write+0xb4/0x1a0
SyS_write+0x49/0xa0
entry_SYSCALL_64_fastpath+0x18/0xad

Fixes: 2214c260c72b ("md: don't return -EAGAIN in md_allow_write for external metadata arrays")
Cc: <stable@vger.kernel.org>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# efa4b77b 18-Oct-2017 Shaohua Li <shli@fb.com>

md: use lockdep_assert_held

lockdep_assert_held is a better way to assert lock held, and it works
for UP.

Signed-off-by: Shaohua Li <shli@fb.com>


# b03e0ccb 18-Oct-2017 NeilBrown <neilb@suse.com>

md: remove special meaning of ->quiesce(.., 2)

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 35bfc521 16-Oct-2017 NeilBrown <neilb@suse.com>

md: allow metadata update while suspending.

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held. So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening. Then allow it
while the quiesce completes, and then wait for it to finish.

Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 9e1cc0a5 16-Oct-2017 NeilBrown <neilb@suse.com>

md: use mddev_suspend/resume instead of ->quiesce()

mddev_suspend() is a more general interface than
calling ->quiesce() and is so more extensible. A
future patch will make use of this.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# b3143b9a 16-Oct-2017 NeilBrown <neilb@suse.com>

md: move suspend_hi/lo handling into core md code

responding to ->suspend_lo and ->suspend_hi is similar
to responding to ->suspended. It is best to wait in
the common core code without incrementing ->active_io.
This allows mddev_suspend()/mddev_resume() to work while
requests are waiting for suspend_lo/hi to change.
This is will be important after a subsequent patch
which uses mddev_suspend() to synchronize updating for
suspend_lo/hi.

So move the code for testing suspend_lo/hi out of raid1.c
and raid5.c, and place it in md.c

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 52a0d49d 16-Oct-2017 NeilBrown <neilb@suse.com>

md: don't call bitmap_create() while array is quiesced.

bitmap_create() allocates memory with GFP_KERNEL and
so can wait for IO.
If called while the array is quiesced, it could wait indefinitely
for write out to the array - deadlock.
So call bitmap_create() before quiescing the array.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4d5324f7 18-Oct-2017 NeilBrown <neilb@suse.com>

md: always hold reconfig_mutex when calling mddev_suspend()

Most often mddev_suspend() is called with
reconfig_mutex held. Make this a requirement in
preparation a subsequent patch. Also require
reconfig_mutex to be held for mddev_resume(),
partly for symmetry and partly to guarantee
no races with incr/decr of mddev->suspend.

Taking the mutex in r5c_disable_writeback_async() is
a little tricky as this is called from a work queue
via log->disable_writeback_work, and flush_work()
is called on that while holding ->reconfig_mutex.
If the work item hasn't run before flush_work()
is called, the work function will not be able to
get the mutex.

So we use mddev_trylock() inside the wait_event() call, and have that
abort when conf->log is set to NULL, which happens before
flush_work() is called.
We wait in mddev->sb_wait and ensure this is woken
when any of the conditions change. This requires
waking mddev->sb_wait in mddev_unlock(). This is only
like to trigger extra wake_ups of threads that needn't
be woken when metadata is being written, and that
doesn't happen often enough that the cost would be
noticeable.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 230b55fa 16-Oct-2017 NeilBrown <neilb@suse.com>

md: forbid a RAID5 from having both a bitmap and a journal.

Having both a bitmap and a journal is pointless.
Attempting to do so can corrupt the bitmap if the journal
replay happens before the bitmap is initialized.
Rather than try to avoid this corruption, simply
refuse to allow arrays with both a bitmap and a journal.
So:
- if raid5_run sees both are present, fail.
- if adding a bitmap finds a journal is present, fail
- if adding a journal finds a bitmap is present, fail.

Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: NeilBrown <neilb@suse.com>
Tested-by: Joshua Kinard <kumba@gentoo.org>
Acked-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Shaohua Li <shli@fb.com>


# e4dca7b7 17-Oct-2017 Kees Cook <keescook@chromium.org>

treewide: Fix function prototypes for module_param_call()

Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:

@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@

module_param_call(_name, _set_func, _get_func, _arg, _mode);

@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@

int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }

@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@

int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }

Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:

drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>


# 6aa7de05 23-Oct-2017 Mark Rutland <mark.rutland@arm.com>

locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 935fe098 10-Oct-2017 Mike Snitzer <snitzer@redhat.com>

md: rename some drivers/md/ files to have an "md-" prefix

Motivated by the desire to illiminate the imprecise nature of
DM-specific patches being unnecessarily sent to both the MD maintainer
and mailing-list. Which is born out of the fact that DM files also
reside in drivers/md/

Now all MD-specific files in drivers/md/ start with either "raid" or
"md-" and the MAINTAINERS file has been updated accordingly.

Shaohua: don't change module name

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# d1d90147 08-Oct-2017 Guoqing Jiang <gqjiang@suse.com>

md: always set THREAD_WAKEUP and wake up wqueue if thread existed

Since commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"),
the wait_queue is only got invoked if THREAD_WAKEUP is not set previously.

With above change, I can see process_metadata_update could always hang on
the wait queue, because mddev->thread could stay on 'D' status and the
THREAD_WAKEUP flag is not cleared since there are lots of place to wake up
mddev->thread. Then deadlock happened as follows:

linux175:~ # ps aux|grep md|grep D
root 20117 0.0 0.0 0 0 ? D 03:45 0:00 [md0_raid1]
root 20125 0.0 0.0 0 0 ? D 03:45 0:00 [md0_cluster_rec]
linux175:~ # cat /proc/20117/stack
[<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster]
[<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster]
[<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster]
[<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod]
[<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod]
[<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod]
[<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140
linux175:~ # cat /proc/20125/stack
[<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140

So let's revert the part of code in the commit to resovle the problem since
we can't get lots of benefits of previous change.

Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# d47c8ad2 04-Oct-2017 NeilBrown <neilb@suse.com>

md: fix deadlock error in recent patch.

A recent patch aimed to cause md_write_start() to fail (rather than
block) when the mddev was suspending, so as to avoid deadlocks.
Unfortunately the test in wait_event() was wrong, and it didn't change
behaviour at all.

We wait_event() must wait until the metadata is written OR the array is
suspending.

Fixes: cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()")
Cc: stable@vger.kernel.org
Reported-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 79bf31a3 21-Sep-2017 Shaohua Li <shli@fb.com>

md: fix a race condition for flush request handling

md_submit_flush_data calls pers->make_request, which missed the suspend check.
Fix it with the new md_handle_request API.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Tested-by: Nate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 393debc2 21-Sep-2017 Shaohua Li <shli@fb.com>

md: separate request handling

With commit cc27b0c78c79, pers->make_request could bail out without handling
the bio. If that happens, we should retry. The commit fixes md_make_request
but not other call sites. Separate the request handling part, so other call
sites can use it.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ddc08823 16-Aug-2017 Pawel Baldysiak <pawel.baldysiak@intel.com>

md: Runtime support for multiple ppls

Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it).

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 26e13043 29-Jul-2017 Cihangir Akturk <cakturk@gmail.com>

md: replace seq_release_private with seq_release

Since commit f15146380d28 ("fs: seq_file - add event counter to simplify
poll() support"), md.c code has been no longer used the private field of
the struct seq_file, but seq_release_private() has been continued to be
used to release the allocated seq_file. This seems to have been
forgotten. So convert it to use seq_release() instead of
seq_release_private().

Signed-off-by: Cihangir Akturk <cakturk@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 5492c46e 28-Jul-2017 Alexey Obitotskiy <aleksey.obitotskiy@intel.com>

md: notify about new spare disk in the container

In case of external metadata arrays spare disks are added to containers
first. mdadm keeps monitoring /proc/mdstat output and when spare disk is
available, it moves it from the container to the array. The problem is
there is no notification of new spare disk in the container and mdadm
waits a long time (until timeout) before it takes the action.

Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 74d46992 23-Aug-2017 Christoph Hellwig <hch@lst.de>

block: replace bi_bdev with a gendisk pointer and partitions index

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# afc1f55c 11-Aug-2017 Shaohua Li <shli@fb.com>

MD: not clear ->safemode for external metadata array

->safemode should be triggered by mdadm for external metadaa array, otherwise
array's state confuses mdadm.

Fixes: 33182d15c6bf(md: always clear ->safemode when md_check_recovery gets the mddev lock.)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 81fe48e9 08-Aug-2017 NeilBrown <neilb@suse.com>

md: fix test in md_write_start()

md_write_start() needs to clear the in_sync flag is it is set, or if
there might be a race with set_in_sync() such that the later will
set it very soon. In the later case it is sufficient to take the
spinlock to synchronize with set_in_sync(), and then set the flag
if needed.

The current test is incorrect.
It should be:
if "flag is set" or "race is possible"

"flag is set" is trivially "mddev->in_sync".
"race is possible" should be tested by "mddev->sync_checkers".

If sync_checkers is 0, then there can be no race. set_in_sync() will
wait in percpu_ref_switch_to_atomic_sync() for an RCU grace period,
and as md_write_start() holds the rcu_read_lock(), set_in_sync() will
be sure ot see the update to writes_pending.

If sync_checkers is > 0, there could be race. If md_write_start()
happened entirely between
if (!mddev->in_sync &&
percpu_ref_is_zero(&mddev->writes_pending)) {
and
mddev->in_sync = 1;
in set_in_sync(), then it would not see that is_sync had been set,
and set_in_sync() would not see that writes_pending had been
incremented.

This bug means that in_sync is sometimes not set when it should be.
Consequently there is a small chance that the array will be marked as
"clean" when in fact it is inconsistent.

Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 33182d15 08-Aug-2017 NeilBrown <neilb@suse.com>

md: always clear ->safemode when md_check_recovery gets the mddev lock.

If ->safemode == 1, md_check_recovery() will try to get the mddev lock
and perform various other checks.
If mddev->in_sync is zero, it will call set_in_sync, and clear
->safemode. However if mddev->in_sync is not zero, ->safemode will not
be cleared.

When md_check_recovery() drops the mddev lock, the thread is woken
up again. Normally it would just check if there was anything else to
do, find nothing, and go to sleep. However as ->safemode was not
cleared, it will take the mddev lock again, then wake itself up
when unlocking.

This results in an infinite loop, repeatedly calling
md_check_recovery(), which RCU or the soft-lockup detector
will eventually complain about.

Prior to commit 4ad23a976413 ("MD: use per-cpu counter for
writes_pending"), safemode would only be set to one when the
writes_pending counter reached zero, and would be cleared again
when writes_pending is incremented. Since that patch, safemode
is set more freely, but is not reliably cleared.

So in md_check_recovery() clear ->safemode before checking ->in_sync.

Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (4.12+)
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Reported-by: David R <david@unsolicited.net>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ed9b66d2 25-Jul-2017 Shaohua Li <shli@fb.com>

MD: fix warnning for UP case

spin_is_locked always returns 0 for UP case, so ignores it

Reported-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Shaohua Li <shli@fb.com>


# 7184ef8b 03-Jul-2017 Shaohua Li <shli@fb.com>

MD: fix sleep in atomic

bioset_free() will take a mutex, so can't get called with spinlock hold.

Fix: 5a85071c2cbc(md: use a separate bio_set for synchronous IO.)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 7f053a6a 23-Jun-2017 Shaohua Li <shli@fb.com>

MD: fix a null dereference

rdev->mddev could be null in start time.

Reported-by: Ming Lei <ming.lei@redhat.com>
Fix: 5a85071c2cbc(md: use a separate bio_set for synchronous IO.)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 5a85071c 20-Jun-2017 NeilBrown <neilb@suse.com>

md: use a separate bio_set for synchronous IO.

md devices allocate a bio_set and use it for two
distinct purposes.
mddev->bio_set is used to clone bios as part of sending
upper level requests down to lower level devices,
and it is also use for synchronous IO such as superblock
and bitmap updates, and for correcting read errors.

This multiple usage can lead to deadlocks. It is likely
that cloned bios might be queued for write and to be
waiting for a metadata update before the write can be permitted.
If the cloning exhausted mddev->bio_set, the metadata update
may not be able to proceed.

This scenario has been seen during heavy testing, with lots of IO and
lots of memory pressure.

Address this by adding a new bio_set specifically for synchronous IO.
All synchronous IO goes directly to the underlying device and is not
queued at the md level, so request using entries from the new
mddev->sync_set will complete in a timely fashion.
Requests that use mddev->bio_set will sometimes need to wait
for synchronous IO, but will no longer risk deadlocking that iO.

Also: small simplification in mddev_put(): there is no need to
wait until the spinlock is released before calling bioset_free().

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 9b10f6a9 17-Jun-2017 NeilBrown <neilb@suse.com>

block: remove bio_clone() and all references.

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 011067b0 17-Jun-2017 NeilBrown <neilb@suse.com>

blk: replace bioset_create_nobvec() with a flags arg to bioset_create()

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# af67c31f 17-Jun-2017 NeilBrown <neilb@suse.com>

blk: remove bio_set arg from blk_queue_split()

blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8df72024 11-Jun-2017 Lidong Zhong <lzhong@suse.com>

md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE

The value for spare spot of sb->dev_roles is changed from
MD_DISK_ROLE_FAULTY to MD_DISK_ROLE_SPARE to keep align
with the value when the superblock is firstly created in
userspace.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# cc27b0c7 05-Jun-2017 NeilBrown <neilb@suse.com>

md: fix deadlock between mddev_suspend() and md_write_start()

If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().

Reported-by: Nix <nix@esperi.org.uk>
Fix: 68866e425be2(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4e4cbee9 03-Jun-2017 Christoph Hellwig <hch@lst.de>

block: switch bios to blk_status_t

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a415c0f1 05-Jun-2017 NeilBrown <neilb@suse.com>

md: initialise ->writes_pending in personality modules.

The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it. This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# e6fd2093 04-May-2017 Amir Goldstein <amir73il@gmail.com>

md: namespace private helper names

The md private helper uuid_equal() collides with a generic helper
of the same name.

Rename the md private helper to md_uuid_equal() and do the same for
md_sb_equal().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>


# 5a8948f8 31-May-2017 Jan Kara <jack@suse.cz>

md: Make flush bios explicitely sync

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

CC: linux-raid@vger.kernel.org
CC: Shaohua Li <shli@kernel.org>
Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Shaohua Li <shli@fb.com>


# 2214c260 08-May-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: don't return -EAGAIN in md_allow_write for external metadata arrays

This essentially reverts commit b5470dc5fc18 ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.

Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 97b20ef7 12-Apr-2017 NeilBrown <neilb@suse.com>

md: handle read-only member devices better.

1/ If an array has any read-only devices when it is started,
the array itself must be read-only
2/ A read-only device cannot be added to an array after it is
started.
3/ Setting an array to read-write should not succeed
if any member devices are read-only

Reported-and-Tested-by: Nanda Kishore Chinnaram <Nanda_Kishore_Chinna@dell.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 78b6350d 12-Apr-2017 NeilBrown <neilb@suse.com>

md: support disabling of create-on-open semantics.

md allows a new array device to be created by simply
opening a device file. This make it difficult to
remove the device and udev is likely to open the device file
as part of processing the REMOVE event.

There is an alternate mechanism for creating arrays
by writing to the new_array module parameter.
When using tools that work with this parameter, it is
best to disable the old semantics.
This new module parameter allows that.

Signed-off-by: NeilBrown <neilb@suse.com>
Acted-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>


# 039b7225 12-Apr-2017 NeilBrown <neilb@suse.com>

md: allow creation of mdNNN arrays via md_mod/parameters/new_array

The intention when creating the "new_array" parameter and the
possibility of having array names line "md_HOME" was to transition
away from the old way of creating arrays and to eventually only use
this new way.

The "old" way of creating array is to create a device node in /dev
and then open it. The act of opening creates the array.
This is problematic because sometimes the device node can be opened
when we don't want to create an array. This can easily happen
when some rule triggered by udev looks at a device as it is being
destroyed. The node in /dev continues to exist for a short period
after an array is stopped, and opening it during this time recreates
the array (as an inactive array).

Unfortunately no clear plan for the transition was created. It is now
time to fix that.

This patch allows devices with numeric names, like "md999" to be
created by writing to "new_array". This will only work if the minor
number given is not already in use. This will allow mdadm to
support the creation of arrays with numbers > 511 (currently not
possible) by writing to new_array.
mdadm can, at some point, use this approach to create *all* arrays,
which will allow the transition to only using the new-way.

Signed-off-by: NeilBrown <neilb@suse.com>
Acted-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>


# b670883b 10-Apr-2017 Zhilong Liu <zlliu@suse.com>

md.c:didn't unlock the mddev before return EINVAL in array_size_store

md.c: it needs to release the mddev lock before
the array_size_store() returns.

Fixes: ab5a98b132fd ("md-cluster: change array_sectors and update size are not supported")

Signed-off-by: Zhilong Liu <zlliu@suse.com>
Reviewed-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 065e519e 05-Apr-2017 NeilBrown <neilb@suse.com>

md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop

if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
needs to be cleared in md_ioctl after any call to md_set_readonly()
or do_md_stop().

Signed-off-by: NeilBrown <neilb@suse.com>
Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
Cc: stable@vger.kernel.org (v4.9+)
Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4ad23a97 14-Mar-2017 NeilBrown <neilb@suse.com>

MD: use per-cpu counter for writes_pending

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.

So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.

We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.

It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.

mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.

md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 55cc39f3 14-Mar-2017 NeilBrown <neilb@suse.com>

md: close a race with setting mddev->in_sync

If ->in_sync is being set just as md_write_start() is being called,
it is possible that set_in_sync() won't see the elevated
->writes_pending, and md_write_start() won't see the set ->in_sync.

To close this race, re-test ->writes_pending after setting ->in_sync,
and add memory barriers to ensure the increment of ->writes_pending
will be seen by the time of this second test, or the new ->in_sync
will be seen by md_write_start().

Add a spinlock to array_state_show() to ensure this temporary
instability is never visible from userspace.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 6497709b 14-Mar-2017 NeilBrown <neilb@suse.com>

md: factor out set_in_sync()

Three separate places in md.c check if the number of active
writes is zero and, if so, sets mddev->in_sync.

There are a few differences, but there shouldn't be:
- it is always appropriate to notify the change in
sysfs_state, and there is no need to do this outside a
spin-locked region.
- we never need to check ->recovery_cp. The state of resync
is not relevant for whether there are any pending writes
or not (which is what ->in_sync reports).

So create set_in_sync() which does the correct tests and
makes the correct changes, and call this in all three
places.

Any behaviour changes here a minor and cosmetic.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 49728050 14-Mar-2017 NeilBrown <neilb@suse.com>

md/raid5: use md_write_start to count stripes, not bios

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.

So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.

add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().

The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 48df498d 13-Mar-2017 Guoqing Jiang <gqjiang@suse.com>

md: move bitmap_destroy to the beginning of __md_stop

Since we have switched to sync way to handle METADATA_UPDATED
msg for md-cluster, then process_metadata_update is depended
on mddev->thread->wqueue.

With the new change, clustered raid could possible hang if
array received a METADATA_UPDATED msg after array unregistered
mddev->thread, so we need to stop clustered raid (bitmap_destroy
-> bitmap_free -> md_cluster_stop) earlier than unregister
thread (mddev_detach -> md_unregister_thread).

And this change should be safe for non-clustered raid since
all writes are stopped before the destroy. Also in md_run,
we activate the personality (pers->run()) before activating
the bitmap (bitmap_create()). So it is pleasingly symmetric
to stop the bitmap (bitmap_destroy()) before stopping the
personality (__md_stop() calls pers->free()), we achieve this
by move bitmap_destroy to the beginning of __md_stop.

But we don't want to break the codes for waiting behind IO as
Shaohua mentioned, so introduce bitmap_wait_behind_writes to
call the codes, and call the new fun in both mddev_detach and
bitmap_destroy, then we will not break original behind IO code
and also fit the new condition well.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ba903a3e 09-Mar-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

raid5-ppl: runtime PPL enabling or disabling

Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.

When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.

Enabling or disabling PPL is performed under a suspended array. The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 664aed04 09-Mar-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: add sysfs entries for PPL

Add 'consistency_policy' attribute for array. It indicates how the array
maintains consistency in case of unexpected shutdown.

Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
and size of the PPL space on the device. They can't be changed for
active members if the array is started and PPL is enabled, so in the
setter functions only basic checks are performed. More checks are done
in ppl_validate_rdev() when starting the log.

These attributes are writable to allow enabling PPL for external
metadata arrays and (later) to enable/disable PPL for a running array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ea0213e0 09-Mar-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: superblock changes for PPL

Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 818da59f 01-Mar-2017 Guoqing Jiang <gqjiang@suse.com>

md-cluster: add the support for resize

To update size for cluster raid, we need to make
sure all nodes can perform the change successfully.
However, it is possible that some of them can't do
it due to failure (bitmap_resize could fail). So
we need to consider the issue before we set the
capacity unconditionally, and we use below steps
to perform sanity check.

1. A change the size, then broadcast METADATA_UPDATED
msg.
2. B and C receive METADATA_UPDATED change the size
excepts call set_capacity, sync_size is not update
if the change failed. Also call bitmap_update_sb
to sync sb to disk.
3. A checks other node's sync_size, if sync_size has
been updated in all nodes, then send CHANGE_CAPACITY
msg otherwise send msg to revert previous change.
4. B and C call set_capacity if receive CHANGE_CAPACITY
msg, otherwise pers->resize will be called to restore
the old value.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 0ba95977 01-Mar-2017 Guoqing Jiang <gqjiang@suse.com>

md-cluster: use sync way to handle METADATA_UPDATED msg

Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
-> md_update_sb-metadata_update_start
(want EX on token however token is
got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
appear if stop a resyncing array.
This is because md_unregister_thread(&cinfo->recv_thread) is
blocked by recv_daemon -> process_recvd_msg
-> process_metadata_update.
To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
a. Node A sends METADATA_UPDATED msg (held Token lock).
b. Node B wants to do resync, and is blocked since it can't
get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
not set since the callchain
(md_do_sync -> sync_request
-> resync_info_update
-> sendmsg
-> lock_comm -> lock_token)
doesn't hold reconfig_mutex.
c. Node B trys to update sb (held reconfig_mutex), but stopped
at wait_event() in metadata_update_start since we have set
MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
d. Then Node B receives METADATA_UPDATED msg from A, of course
recv_daemon is blocked forever.
Since metadata_update_start always calls lock_token with reconfig_mutex,
we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
lock_token don't need to set it twice unless lock_token is invoked from
lock_comm.

Finally, thanks to Neil for his great idea and help!

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 13459213 09-Mar-2017 Jason Yan <yanaijie@huawei.com>

md: fix incorrect use of lexx_to_cpu in does_sb_need_changing

The sb->layout is of type __le32, so we shoud use le32_to_cpu.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 3fb632e4 09-Mar-2017 Jason Yan <yanaijie@huawei.com>

md: fix super_offset endianness in super_1_rdev_size_change

The sb->super_offset should be big-endian, but the rdev->sb_start is in
host byte order, so fix this by adding cpu_to_le64.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 1b3bae49 28-Feb-2017 NeilBrown <neilb@suse.com>

md: don't impose the MD_SB_DISKS limit on arrays without metadata.

These arrays, created with "mdadm --build" don't benefit from a limit.
The default will be used, which is '0' and is interpreted as "don't
impose a limit".

Reported-by: ian_bruce@mail.ru
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# c9483634 23-Feb-2017 Guoqing Jiang <gqjiang@suse.com>

md: move funcs from pers->resize to update_size

raid1_resize and raid5_resize should also check the
mddev->queue if run underneath dm-raid.

And both set_capacity and revalidate_disk are used in
pers->resize such as raid1, raid10 and raid5. So
move them from personality file to common code.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 99b3d74e 23-Feb-2017 Shaohua Li <shli@fb.com>

md: delete dead code

Nobody is using mddev_check_plugged(), so delete the dead code

Signed-off-by: Shaohua Li <shli@fb.com>


# 3f07c014 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d7a10308 14-Feb-2017 Ming Lei <tom.leiming@gmail.com>

md: fast clone bio in bio_clone_mddev()

Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ed7ef732 14-Feb-2017 Ming Lei <tom.leiming@gmail.com>

md: remove unnecessary check on mddev

mddev is never NULL and neither is ->bio_set, so
remove the check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 10273170 14-Feb-2017 Ming Lei <tom.leiming@gmail.com>

md: fail if mddev->bio_set can't be created

The current behaviour is to fall back to allocate
bio from 'fs_bio_set', that isn't a correct way
because it might cause deadlock.

So this patch simply return failure if mddev->bio_set
can't be created.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 9356863c 05-Feb-2017 NeilBrown <neilb@suse.com>

md: ensure md devices are freed before module is unloaded.

Commit: cbd199837750 ("md: Fix unfortunate interaction with evms")
change mddev_put() so that it would not destroy an md device while
->ctime was non-zero.

Unfortunately, we didn't make sure to clear ->ctime when unloading
the module, so it is possible for an md device to remain after
module unload. An attempt to open such a device will trigger
an invalid memory reference in:
get_gendisk -> kobj_lookup -> exact_lock -> get_disk

when tring to access disk->fops, which was in the module that has
been removed.

So ensure we clear ->ctime in md_exit(), and explain how that is useful,
as it isn't immediately obvious when looking at the code.

Fixes: cbd199837750 ("md: Fix unfortunate interaction with evms")
Tested-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# dc3b17cc 02-Feb-2017 Jan Kara <jack@suse.cz>

block: Use pointer to backing_dev_info from request_queue

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a85dd7b8 23-Jan-2017 Song Liu <songliubraving@fb.com>

md/r5cache: flush data only stripes in r5l_recovery_log()

For safer operation, all arrays start in write-through mode, which has been
better tested and is more mature. And actually the write-through/write-mode
isn't persistent after array restarted, so we always start array in
write-through mode. However, if recovery found data-only stripes before the
shutdown (from previous write-back mode), it is not safe to start the array in
write-through mode, as write-through mode can not handle stripes with data in
write-back cache. To solve this problem, we flush all data-only stripes in
r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
empty cache in write-through mode.

This logic is implemented in r5c_recovery_flush_data_only_stripes():

1. enable write back cache
2. flush all stripes
3. wake up conf->mddev->thread
4. wait for all stripes get flushed (reuse wait_for_quiescent)
5. disable write back cache

The wait in 4 will be waked up in release_inactive_stripe_list()
when conf->active_stripes reaches 0.

It is safe to wake up mddev->thread here because all the resource
required for the thread has been initialized.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 2953079c 08-Dec-2016 Shaohua Li <shli@fb.com>

md: separate flags for superblock changes

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 82a301cb 08-Dec-2016 Shaohua Li <shli@fb.com>

md: MD_RECOVERY_NEEDED is set for mddev->recovery

Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device
removal.")

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# e2342ca8 04-Dec-2016 NeilBrown <neilb@suse.com>

md: fix refcount problem on mddev when stopping array.

md_open() gets a counted reference on an mddev using mddev_find().
If it ends up returning an error, it must drop this reference.

There are two error paths where the reference is not dropped.
One only happens if the process is signalled and an awkward time,
which is quite unlikely.
The other was introduced recently in commit af8d8e6f0.

Change the code to ensure the drop the reference when returning an error,
and make it harded to re-introduce this sort of bug in the future.

Reported-by: Marc Smith <marc.smith@mcc.edu>
Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 034e33f5 21-Nov-2016 Shaohua Li <shli@fb.com>

md: stop write should stop journal reclaim

__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's
possible the reclaim thread is still running and doing write, which
doesn't match what __md_stop_writes should do. The extra ->quiesce()
call should not harm any raid types. For raid5-cache, this will
guarantee we reclaim all caches before we update superblock.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>


# ce1ccd07 21-Nov-2016 Shaohua Li <shli@fb.com>

raid5-cache: suspend reclaim thread instead of shutdown

There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>


# 46533ff7 17-Nov-2016 NeilBrown <neilb@suse.com>

md: Use REQ_FAILFAST_* on metadata writes where appropriate

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state. i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty. This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success. We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 688834e6 17-Nov-2016 NeilBrown <neilb@suse.com>

md/failfast: add failfast flag for md to be used by some personalities.

This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error. The underlying driver will be directed not to retry requests
that result in failures. There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device. No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 504634f6 18-Nov-2016 Shaohua Li <shli@fb.com>

md: add blktrace event for writes to superblock

superblock write is an expensive operation. With raid5-cache, it can be called
regularly. Tracing to help performance debug.

Signed-off-by: Shaohua Li <shli@fb.com>
Cc: NeilBrown <neilb@suse.com>


# 6119e679 08-Nov-2016 NeilBrown <neilb@suse.com>

md: remove md_super_wait() call after bitmap_flush()

bitmap_flush() finishes with bitmap_update_sb(), and that finishes
with write_page(..., 1), so write_page() will wait for all writes
to complete. So there is no point calling md_super_wait()
immediately afterwards.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 060b0689 03-Nov-2016 NeilBrown <neilb@suse.com>

md: perform async updates for metadata where possible.

When adding devices to, or removing device from, an array we need to
update the metadata. However we don't need to do it synchronously as
data integrity doesn't depend on these changes being recorded
instantly. So avoid the synchronous call to md_update_sb and just set
a flag so that the thread will do it.

This can reduce the number of updates performed when lots of devices
are being added or removed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 9d48739e 01-Nov-2016 NeilBrown <neilb@suse.com>

md: change all printk() to pr_err() or pr_warn() etc.

1/ using pr_debug() for a number of messages reduces the noise of
md, but still allows them to be enabled when needed.
2/ try to be consistent in the usage of pr_err() and pr_warn(), and
document the intention
3/ When strings have been split onto multiple lines, rejoin into
a single string.
The cost of having lines > 80 chars is less than the cost of not
being able to easily search for a particular message.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 7f0f0d87 01-Nov-2016 NeilBrown <neilb@suse.com>

md: fix some issues with alloc_disk_sb()

1/ don't print a warning if allocation fails.
page_alloc() does that already.
2/ always check return status for error.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 91a6c4ad 25-Oct-2016 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

md: wake up personality thread after array state update

When raid1/raid10 array fails to write to one of the drives, the request
is added to bio_end_io_list and finished by personality thread. The
thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In
case of external metadata this flag is cleared, however the thread is
not woken up. It causes request to be blocked for few seconds (until
another action on the array wakes up the thread) or to get stuck
indefinitely.

Wake up personality thread once MD_CHANGE_PENDING has been cleared.
Moving 'restart_array' call after the flag is cleared it not a solution
because in read-write mode the call doesn't wake up the thread.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# dcbcb486 21-Oct-2016 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

md: don't fail an array if there are unacknowledged bad blocks

If external metadata handler supports bad blocks and unacknowledged bad
blocks are present, don't report disk via sysfs as faulty. Such
situation can be still handled so disk just has to be blocked for a
moment. It makes it consistent with kernel state as corresponding rdev
flag is also not set.

When the disk in being unblocked there are few cases:
1. Disk has been in blocked and faulty state, it is being unblocked but
it still remains in faulty state. Metadata handler will remove it from
array in the next call.
2. There is no bad block support in external metadata handler and bad
blocks are present - put the disk in blocked and faulty state (see
case 1).
3. There is bad block support in external metadata handler and all bad
blocks are acknowledged - clear all flags, continue.
4. There is bad block support in external metadata handler but there are
still unacknowledged bad blocks - clear all flags, continue. It is fine
to clear Blocked flag because it was probably not set anyway (if it was
it is case 1). BlockedBadBlocks flag can also be cleared because the
request waiting for it will set it again when it finds out that some bad
block is still not acknowledged. Recovery is not necessary but there are
no problems if the flag is set. Sysfs rdev state is still reported as
blocked (due to unacknowledged bad blocks) so metadata handler will
process remaining bad blocks and unblock disk again.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 35b785f7 21-Oct-2016 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

md: add bad block support for external metadata

Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 70fd7614 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block,fs: use REQ_* flags directly

Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1217e1d1 27-Oct-2016 NeilBrown <neilb@suse.com>

md: be careful not lot leak internal curr_resync value into metadata. -- (all)

mddev->curr_resync usually records where the current resync is up to,
but during the starting phase it has some "magic" values.

1 - means that the array is trying to start a resync, but has yielded
to another array which shares physical devices, and also needs to
start a resync
2 - means the array is trying to start resync, but has found another
array which shares physical devices and has already started resync.

3 - means that resync has commensed, but it is possible that nothing
has actually been resynced yet.

It is important that this value not be visible to user-space and
particularly that it doesn't get written to the metadata, as the
resync or recovery checkpoint. In part, this is because it may be
slightly higher than the correct value, though this is very rare.
In part, because it is not a multiple of 4K, and some devices only
support 4K aligned accesses.

There are two places where this value is propagates into either
->curr_resync_completed or ->recovery_cp or ->recovery_offset.
These currently avoid the propagation of values 1 and 3, but will
allow 3 to leak through.

Change them to only propagate the value if it is > 3.

As this can cause an array to fail, the patch is suitable for -stable.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Viswesh <viswesh.vichu@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 16f88949 23-Oct-2016 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

md: report 'write_pending' state when array in sync

If there is a bad block on a disk and there is a recovery performed from
this disk, the same bad block is reported for a new disk. It involves
setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
metadata this flag is not being cleared as array state is reported as
'clean'. The read request to bad block in RAID5 array gets stuck as it
is waiting for a flag to be cleared - as per commit c3cce6cda162
("md/raid5: ensure device failure recorded before write request
returns.").

The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
clarified in commit 070dc6dd7103 ("md: resolve confusion of
MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
personality error handlers since and it doesn't fully comply with
initial purpose. It was supposed to notify that write request is about
to start, however now it is also used to request metadata update.
Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
been set and in_sync has been set to 0 at the same time. Error handlers
just set the flag without modifying in_sync value. Sysfs array state is
a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
set and in_sync is set to 1. Userspace has no idea it is expected to
take some action.

Swap the order that array state is checked so 'write_pending' is
reported ahead of 'clean' ('write_pending' is a misleading name but it
is too late to rename it now).

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# bb086a89 30-Sep-2016 Shaohua Li <shli@fb.com>

md: set rotational bit

if all disks in an array are non-rotational, set the array
non-rotational.

This only works for array with all disks populated at startup. Support
for disk hotadd/hotremove could be added later if necessary.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>


# 90bcf133 14-Sep-2016 Shaohua Li <shli@fb.com>

md: fix a potential deadlock

lockdep reports a potential deadlock. Fix this by droping the mutex
before md_import_device

[ 1137.126601] ======================================================
[ 1137.127013] [ INFO: possible circular locking dependency detected ]
[ 1137.127013] 4.8.0-rc4+ #538 Not tainted
[ 1137.127013] -------------------------------------------------------
[ 1137.127013] mdadm/16675 is trying to acquire lock:
[ 1137.127013] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
[ 1137.127013]
but task is already holding lock:
[ 1137.127013] (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50
[ 1137.127013]
which lock already depends on the new lock.

[ 1137.127013]
the existing dependency chain (in reverse order) is:
[ 1137.127013]
-> #1 (detected_devices_mutex){+.+.+.}:
[ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
[ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
[ 1137.127013] [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90
[ 1137.127013] [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0
[ 1137.127013] [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0
[ 1137.127013] [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40
[ 1137.127013] [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30
[ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50
[ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
[ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70
[ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 1137.127013]
-> #0 (&bdev->bd_mutex){+.+.+.}:
[ 1137.127013] [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690
[ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
[ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
[ 1137.127013] [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
[ 1137.127013] [<ffffffff81244307>] blkdev_get+0x227/0x350
[ 1137.127013] [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50
[ 1137.127013] [<ffffffff81a46d65>] lock_rdev+0x35/0x80
[ 1137.127013] [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0
[ 1137.127013] [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50
[ 1137.127013] [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30
[ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50
[ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
[ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70
[ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 1137.127013]
other info that might help us debug this:

[ 1137.127013] Possible unsafe locking scenario:

[ 1137.127013] CPU0 CPU1
[ 1137.127013] ---- ----
[ 1137.127013] lock(detected_devices_mutex);
[ 1137.127013] lock(&bdev->bd_mutex);
[ 1137.127013] lock(detected_devices_mutex);
[ 1137.127013] lock(&bdev->bd_mutex);
[ 1137.127013]
*** DEADLOCK ***

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# c20c33f0 11-Aug-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: clean related infos of cluster

cluster_info and bitmap_info.nodes also need to be
cleared when array is stopped.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# af8d8e6f 11-Aug-2016 Guoqing Jiang <gqjiang@suse.com>

md: changes for MD_STILL_CLOSED flag

When stop clustered raid while it is pending on resync,
MD_STILL_CLOSED flag could be cleared since udev rule
is triggered to open the mddev. So obviously array can't
be stopped soon and returns EBUSY.

mdadm -Ss md-raid-arrays.rules
set MD_STILL_CLOSED md_open()
... ... ... clear MD_STILL_CLOSED
do_md_stop

We make below changes to resolve this issue:

1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
when stop array and it means we are stopping array.
2. let md_open returns early if CLOSING is set, so no
other threads will open array if one thread is trying
to close it.
3. no need to clear CLOSING bit in md_open because 1 has
ensure the bit is cleared, then we also don't need to
test CLOSING bit in do_md_stop.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# e566aef1 11-Aug-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: call md_kick_rdev_from_array once ack failed

The new_disk_ack could return failure if WAITING_FOR_NEWDISK
is not set, so we need to kick the dev from array in case
failure happened.

And we missed to check err before call new_disk_ack othwise
we could kick a rdev which isn't in array, thanks for the
reminder from Shaohua.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 47a7b0d8 04-Sep-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: make md-cluster also can work when compiled into kernel

The md-cluster is compiled as module by default,
if it is compiled by built-in way, then we can't
make md-cluster works.

[64782.630008] md/raid1:md127: active with 2 out of 2 mirrors
[64782.630528] md-cluster module not found.
[64782.630530] md127: Could not setup cluster service (-2)

Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions")
Cc: stable@vger.kernel.org (v4.1+)
Reported-by: Marc Smith <marc.smith@mcc.edu>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 486b0f7b 19-Aug-2016 Song Liu <songliubraving@fb.com>

r5cache: set MD_JOURNAL_CLEAN correctly

Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.

With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# c622ca54 16-Aug-2016 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: don't print the same repeated messages about delayed sync operation

This fixes a long-standing bug that caused a flood of messages like:
"md: delaying data-check of md1 until md2 has finished (they share one
or more physical units)"

It can be reproduced like this:
1. Create at least 3 raid1 arrays on a pair of disks, each on different
partitions.
2. Request a sync operation like 'check' or 'repair' on 2 arrays by
writing to their md/sync_action attribute files. One operation should
start and one should be delayed and a message like the above will be
printed.
3. Issue a write to the third array. Each write will cause 2 copies of
the message to be printed.

This happens when wake_up(&resync_wait) is called, usually by
md_check_recovery(). Then the delayed sync thread again prints the
message and is put to sleep. This patch adds a check in md_do_sync() to
prevent printing this message more than once for the same pair of
devices.

Reported-by: Sven Koehler <sven.koehler@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 207efcd2 11-Aug-2016 Guoqing Jiang <gqjiang@suse.com>

md: remove obsolete ret in md_start_sync

The ret is not needed anymore since we have already
move resync_start into md_do_sync in commit 41a9a0d.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# b347af81 11-Aug-2016 Song Liu <songliubraving@fb.com>

md: do not count journal as spare in GET_ARRAY_INFO

GET_ARRAY_INFO counts journal as spare (spare_disks), which is not
accurate. This patch fixes this.

Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 1eff9d32 05-Aug-2016 Jens Axboe <axboe@fb.com>

block: rename bio bi_rw to bi_opf

Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 5d881783 28-Jul-2016 Shaohua Li <shli@fb.com>

MD: fix null pointer deference

The md device might not have personality (for example, ddf raid array). The
issue is introduced by 8430e7e0af9a15(md: disconnect device from personality
before trying to remove it)

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 573275b5 30-Jun-2016 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

md: add missing sysfs_notify on array_state update

Changeset 6791875e2e53 has added early return from a function so there is no
sysfs notification for 'active' and 'clean' state change.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4cb9da7d 22-Jun-2016 Alexey Obitotskiy <aleksey.obitotskiy@intel.com>

Fix kernel module refcount handling

md loads raidX modules and increments module refcount each time level
has changed but does not decrement it. You are unable to unload raid0
module after reshape because raid0 reshape changes level to raid4
and back to raid0.

Signed-off-by: Aleksey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 0e3ef49e 17-Jun-2016 Arnd Bergmann <arnd@arndb.de>

md: use seconds granularity for error logging

The md code stores the exact time of the last error in the
last_read_error variable using a timespec structure. It only
ever uses the seconds portion of that though, so we can
use a scalar for it.

There won't be an overflow in 2038 here, because it already
used monotonic time and 32-bit is enough for that, but I've
decided to use time64_t for consistency in the conversion.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <shli@fb.com>


# d787be40 02-Jun-2016 NeilBrown <neilb@suse.com>

md: reduce the number of synchronize_rcu() calls when multiple devices fail.

Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.

As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices. Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.

fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 8430e7e0 02-Jun-2016 NeilBrown <neilb@suse.com>

md: disconnect device from personality before trying to remove it.

When the HOT_REMOVE_DISK ioctl is used to remove a device, we
call remove_and_add_spares() which will remove it from the personality
if possible. This improves the chances that the removal will succeed.

When writing "remove" to dev-XX/state, we don't. So that can fail more easily.

So add the remove_and_add_spares() into "remove" handling.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4ba1e788 12-Jun-2016 Xiao Ni <xni@redhat.com>

MD:Update superblock when err == 0 in size_store

This is a simple check before updating the superblock. It should update
the superblock when update_size return 0.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 5b1f5bc3 08-Jun-2016 Cong Wang <xiyou.wangcong@gmail.com>

md: use a mutex to protect a global list

We saw a list corruption in the list all_detected_devices:

WARNING: CPU: 16 PID: 226 at lib/list_debug.c:29 __list_add+0x3c/0xa9()
list_add corruption. next->prev should be prev (ffff880859d58320), but was ffff880859ce74c0. (next=ffffffff81abfdb0).
Modules linked in: ahci libahci libata sd_mod scsi_mod
CPU: 16 PID: 226 Comm: kworker/u241:4 Not tainted 4.1.20 #1
Hardware name: Dell Inc. PowerEdge C6220/04GD66, BIOS 2.2.3 11/07/2013
Workqueue: events_unbound async_run_entry_fn
0000000000000000 ffff880859a5baf8 ffffffff81502872 ffff880859a5bb48
0000000000000009 ffff880859a5bb38 ffffffff810692a5 ffff880859ee8828
ffffffff812ad02c ffff880859d58320 ffffffff81abfdb0 ffff880859eb90c0
Call Trace:
[<ffffffff81502872>] dump_stack+0x4d/0x63
[<ffffffff810692a5>] warn_slowpath_common+0xa1/0xbb
[<ffffffff812ad02c>] ? __list_add+0x3c/0xa9
[<ffffffff81069305>] warn_slowpath_fmt+0x46/0x48
[<ffffffff812ad02c>] __list_add+0x3c/0xa9
[<ffffffff81406f28>] md_autodetect_dev+0x41/0x62
[<ffffffff81285862>] rescan_partitions+0x25f/0x29d
[<ffffffff81506372>] ? mutex_lock+0x13/0x31
[<ffffffff811a090f>] __blkdev_get+0x1aa/0x3cd
[<ffffffff811a0b91>] blkdev_get+0x5f/0x294
[<ffffffff81377ceb>] ? put_device+0x17/0x19
[<ffffffff8128227c>] ? disk_put_part+0x12/0x14
[<ffffffff812836f3>] add_disk+0x29d/0x407
[<ffffffff81384345>] ? __pm_runtime_use_autosuspend+0x5c/0x64
[<ffffffffa004a724>] sd_probe_async+0x115/0x1af [sd_mod]
[<ffffffff81083177>] async_run_entry_fn+0x72/0x12c
[<ffffffff8107c44c>] process_one_work+0x198/0x2ce
[<ffffffff8107cac7>] worker_thread+0x1dd/0x2bb
[<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15
[<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15
[<ffffffff81080d9c>] kthread+0xae/0xb6
[<ffffffff81080000>] ? param_array_set+0x40/0xfa
[<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61
[<ffffffff81508152>] ret_from_fork+0x42/0x70
[<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61

I suspect it is because there is no lock protecting this
global list, autostart_arrays() is called in ioctl() path
where there is no lock.

Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 28a8f0d3 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 796a5cf0 05-Jun-2016 Mike Christie <mchristi@redhat.com>

md: use bio op accessors

Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4e49ea4a 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block/fs/drivers: remove rw argument from submit_bio

This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>


# db767672 02-Jun-2016 Guoqing Jiang <gqjiang@suse.com>

md: simplify the code with md_kick_rdev_from_array

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# bb8bf15b 02-Jun-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: fix deadlock issue when add disk to an recoverying array

Add a disk to an array which is performing recovery
is a little complicated, we need to do both reap the
sync thread and perform add disk for the case, then
it caused deadlock as follows.

linux44:~ # ps aux|grep md|grep D
root 1822 0.0 0.0 0 0 ? D 16:50 0:00 [md127_resync]
root 1848 0.0 0.0 19860 952 pts/0 D+ 16:50 0:00 mdadm --manage /dev/md127 --re-add /dev/vdb
linux44:~ # cat /proc/1848/stack
[<ffffffff8107afde>] kthread_stop+0x6e/0x120
[<ffffffffa051ddb0>] md_unregister_thread+0x40/0x80 [md_mod]
[<ffffffffa0526e45>] md_reap_sync_thread+0x15/0x150 [md_mod]
[<ffffffffa05271e0>] action_store+0x260/0x270 [md_mod]
[<ffffffffa05206b4>] md_attr_store+0xb4/0x100 [md_mod]
[<ffffffff81214a7e>] sysfs_write_file+0xbe/0x140
[<ffffffff811a6b98>] vfs_write+0xb8/0x1e0
[<ffffffff811a75b8>] SyS_write+0x48/0xa0
[<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
[<00007f068ea1ed30>] 0x7f068ea1ed30
linux44:~ # cat /proc/1822/stack
[<ffffffffa05251a6>] md_do_sync+0x846/0xf40 [md_mod]
[<ffffffffa052402d>] md_thread+0x16d/0x180 [md_mod]
[<ffffffff8107ad94>] kthread+0xb4/0xc0
[<ffffffff8152a518>] ret_from_fork+0x58/0x90

Task1848 Task1822
md_attr_store (held reconfig_mutex by call mddev_lock())
action_store
md_reap_sync_thread
md_unregister_thread
kthread_stop md_wakeup_thread(mddev->thread);
wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING))

md_check_recovery is triggered by wakeup mddev->thread,
but it can't clear MD_CHANGE_PENDING flag since it can't
get lock which was held by md_attr_store already.

To solve the deadlock problem, we move "->resync_finish()"
from md_do_sync to md_reap_sync_thread (after md_update_sb),
also MD_HELD_RESYNC_LOCK is introduced since it is possible
that node can't get resync lock in md_do_sync.

Then we do not need to wait for MD_CHANGE_PENDING is cleared
or not since metadata should be updated after md_update_sb,
so just call resync_finish if MD_HELD_RESYNC_LOCK is set.

We also unified the code after skip label, since set PENDING
for non-clustered case should be harmless.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 85ad1d13 03-May-2016 Guoqing Jiang <gqjiang@suse.com>

md: set MD_CHANGE_PENDING in a atomic region

Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 092398dc 03-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

md: md.c: fix oops in mddev_suspend for raid0

Introduced by upstream commit 70d9798b95562abac005d4ba71d28820f9a201eb

The raid0 personality does not create mddev->thread as oposed to
other personalities leading to its unconditional access in
mddev_suspend() causing an oops.

Patch checks for mddev->thread in order to keep the
intention of aforementioned commit.

Fixes: 70d9798b9556 ("MD: warn for potential deadlock")
Cc: stable@vger.kernel.org (4.5+)
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# a578183e 02-May-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: wakeup thread if activated a spare disk

When a device is re-added, it will ultimately need
to be activated and that happens in md_check_recovery,
so we need to set MD_RECOVERY_NEEDED right after
remove_and_add_spares.

A specifical issue without the change is that when
one node perform fail/remove/readd on a disk, but
slave nodes could not add the disk back to array as
expected (added as missed instead of in sync). So
give slave nodes a chance to do resync.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ab5a98b1 02-May-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: change array_sectors and update size are not supported

Currently, some features are not supported yet,
such as change array_sectors and update size, so
return EINVAL for them and listed it in document.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 2c97cf13 02-May-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluser: make resync_finish only called after pers->sync_request

It is not reasonable that cluster raid to release resync
lock before the last pers->sync_request has finished.

As the metadata will be changed when node performs resync,
we need to inform other nodes to update metadata, so the
MD_CHANGE_PENDING flag is set before finish resync.

Then metadata_update_finish is move ahead to ensure that
METADATA_UPDATED msg is sent before finish resync, and
metadata_update_start need to be run after "repeat:" label
accordingly.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 41a9a0dc 02-May-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: change resync lock from asynchronous to synchronous

If multiple nodes choose to attempt do resync at the same time
they need to be serialized so they don't duplicate effort. This
serialization is done by locking the 'resync' DLM lock.

Currently if a node cannot get the lock immediately it doesn't
request notification when the lock becomes available (i.e.
DLM_LKF_NOQUEUE is set), so it may not reliably find out when it
is safe to try again.

Rather than trying to arrange an async wake-up when the lock
becomes available, switch to using synchronous locking - this is
a lot easier to think about. As it is not permitted to block in
the 'raid1d' thread, move the locking to the resync thread. So
the rsync thread is forked immediately, but it blocks until the
resync lock is available. Once the lock is locked it checks again
if any resync action is needed.

A particular symptom of the current problem is that a node can
get stuck with "resync=pending" indefinitely.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 9c573de3 25-Apr-2016 Shaohua Li <shli@fb.com>

MD: make bio mergeable

blk_queue_split marks bio unmergeable, which makes sense for normal bio.
But if dispatching the bio to underlayer disk, the blk_queue_split
checks are invalid, hence it's possible the bio becomes mergeable.

In the reported bug, this bug causes trim against raid0 performance slash
https://bugzilla.kernel.org/show_bug.cgi?id=117051

Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com>
Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio)
Cc: stable@vger.kernel.org (v4.3+)
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Reviewed-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 56883a7e 30-Mar-2016 Jens Axboe <axboe@fb.com>

md: update to using blk_queue_write_cache()

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# ed3b98c7 29-Mar-2016 Shaohua Li <shli@fb.com>

MD: add rdev reference for super write

Xiao Ni reported below crash:
[26396.335146] BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
[26396.342990] IP: [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod]
[26396.349449] PGD 0
[26396.351468] Oops: 0002 [#1] SMP
[26396.354898] Modules linked in: ext4 mbcache jbd2 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_td
[26396.408404] CPU: 5 PID: 3261 Comm: loop0 Not tainted 4.5.0 #1
[26396.414140] Hardware name: Dell Inc. PowerEdge R715/0G2DP3, BIOS 3.2.2 09/15/2014
[26396.421608] task: ffff8808339be680 ti: ffff8808365f4000 task.ti: ffff8808365f4000
[26396.429074] RIP: 0010:[<ffffffffa0425b00>] [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod]
[26396.437952] RSP: 0018:ffff8808365f7c38 EFLAGS: 00010046
[26396.443252] RAX: ffffffffa0425ae0 RBX: ffff8804336a7900 RCX: ffffe8f9f7b41198
[26396.450371] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8804336a7900
[26396.457489] RBP: ffff8808365f7c50 R08: 0000000000000005 R09: 00001801e02ce3d7
[26396.464608] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[26396.471728] R13: ffff8808338d9a00 R14: 0000000000000000 R15: ffff880833f9fe00
[26396.478849] FS: 00007f9e5066d740(0000) GS:ffff880237b40000(0000) knlGS:0000000000000000
[26396.486922] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[26396.492656] CR2: 00000000000002a8 CR3: 00000000019ea000 CR4: 00000000000006e0
[26396.499775] Stack:
[26396.501781] ffff8804336a7900 0000000000000000 0000000000000000 ffff8808365f7c68
[26396.509199] ffffffff81308cd0 ffff8804336a7900 ffff8808365f7ca8 ffffffff81310637
[26396.516618] 00000000a0233a00 ffff880833f9fe00 0000000000000000 ffff880833fb0000
[26396.524038] Call Trace:
[26396.526485] [<ffffffff81308cd0>] bio_endio+0x40/0x60
[26396.531529] [<ffffffff81310637>] blk_update_request+0x87/0x320
[26396.537439] [<ffffffff8131a20a>] blk_mq_end_request+0x1a/0x70
[26396.543261] [<ffffffff81313889>] blk_flush_complete_seq+0xd9/0x2a0
[26396.549517] [<ffffffff81313ccf>] flush_end_io+0x15f/0x240
[26396.554993] [<ffffffff8131a22a>] blk_mq_end_request+0x3a/0x70
[26396.560815] [<ffffffff8131a314>] __blk_mq_complete_request+0xb4/0xe0
[26396.567246] [<ffffffff8131a35c>] blk_mq_complete_request+0x1c/0x20
[26396.573506] [<ffffffffa04182df>] loop_queue_work+0x6f/0x72c [loop]
[26396.579764] [<ffffffff81697844>] ? __schedule+0x2b4/0x8f0
[26396.585242] [<ffffffff810a7812>] kthread_worker_fn+0x52/0x170
[26396.591065] [<ffffffff810a77c0>] ? kthread_create_on_node+0x1a0/0x1a0
[26396.597582] [<ffffffff810a7238>] kthread+0xd8/0xf0
[26396.602453] [<ffffffff810a7160>] ? kthread_park+0x60/0x60
[26396.607929] [<ffffffff8169bdcf>] ret_from_fork+0x3f/0x70
[26396.613319] [<ffffffff810a7160>] ? kthread_park+0x60/0x60

md_super_write() and corresponding md_super_wait() generally are called
with reconfig_mutex locked, which prevents disk disappears. There is one
case this rule is broken. write_sb_page of bitmap.c doesn't hold the
mutex. next_active_rdev does increase rdev reference, but it decreases
the reference too early (eg, before IO finish). disk can disappear at
the window. We unconditionally increase rdev reference in
md_super_write() to avoid the race.

Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>


# 466ad292 21-Mar-2016 Wei Fang <fangwei1@huawei.com>

md: fix a trivial typo in comments

Fix a trivial typo in md_ioctl().

Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 70d9798b 24-Feb-2016 Shaohua Li <shli@fb.com>

MD: warn for potential deadlock

The personality thread shouldn't call mddev_suspend(). Because
mddev_suspend() will for all IO finish, but IO is handled in personality
thread, so this could cause deadlock. To trigger this early, add a
warning if mddev_suspend() is called from personality thread.

Suggested-by: NeilBrown <neilb@suse.com>
Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 399146b8 17-Feb-2016 Sebastian Parschauer <sebastian.riemer@profitbricks.com>

md: Drop sending a change uevent when stopping

When stopping an MD device, then its device node /dev/mdX may still
exist afterwards or it is recreated by udev. The next open() call
can lead to creation of an inoperable MD device. The reason for
this is that a change event (KOBJ_CHANGE) is sent to udev which
races against the remove event (KOBJ_REMOVE) from md_free().
So drop sending the change event.

A change is likely also required in mdadm as many versions send the
change event to udev as well.

Neil mentioned the change event is a workaround for old kernel
Commit: 934d9c23b4c7 ("md: destroy partitions and notify udev when md array is stopped.")
new mdadm can handle device remove now, so this isn't required any more.

Cc: NeilBrown <neilb@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 1501efad 13-Jan-2016 Dan Williams <dan.j.williams@intel.com>

md/raid: only permit hot-add of compatible integrity profiles

It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue. Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.

The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a671 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.

This policy of disallowing changes the integrity profile once one has
been established is shared with DM.

Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277]

[ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[ 59.078302] md: data integrity enabled on md0
[..]
[ 90.489209] md0: incompatible integrity profile for pmem1m
[..]
[ 205.671277] md: super_written gets error=-5
[ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[ 205.677386] md/raid1:md0: Operation continuing on 1 devices.
[ 205.683037] RAID1 conf printout:
[ 205.684699] --- wd:1 rd:2
[ 205.685972] disk 0, wo:0, o:1, dev:pmem0s
[ 205.687562] disk 1, wo:1, o:1, dev:pmem1s
[ 205.691717] md: recovery of RAID array md0

Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: <stable@vger.kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 87d4d916 06-Jan-2016 Shaohua Li <shli@fb.com>

MD: add journal with array suspended

Hot add journal disk in recovery thread context brings a lot of trouble
as IO could be running. Unlike spare disk hot add, adding journal disk
with array suspended makes more sense and implmentation is much easier.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# a62ab49e 06-Jan-2016 Shaohua Li <shli@fb.com>

md: set MD_HAS_JOURNAL in correct places

Set MD_HAS_JOURNAL when a array is loaded or journal is initialized.
This is to avoid the flags set too early in journal disk hotadd.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# d3b407fb 06-Jan-2016 Dan Williams <dan.j.williams@intel.com>

badblocks: rename badblocks_free to badblocks_exit

For symmetry with badblocks_init() make it clear that this path only
destroys incremental allocations of a badblocks instance, and does not
free the badblocks instance itself.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# fc974ee2 24-Dec-2015 Vishal Verma <vishal.l.verma@intel.com>

md: convert to use the generic badblocks code

Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 274d8cbd 03-Jan-2016 NeilBrown <neilb@suse.com>

md: Remove 'ready' field from mddev.

This field is always set in tandem with ->pers, and when it is tested
->pers is also tested. So ->ready is not needed.

It was needed once, but code rearrangement and locking changes have
removed that needed.

Signed-off-by: NeilBrown <neilb@suse.com>


# bb9ef716 27-Dec-2015 Guoqing Jiang <gqjiang@suse.com>

md: remove unnecesary md_new_event_inintr

md_new_event had removed sysfs_notify since 'commit 72a23c211e45
("Make sure all changes to md/sync_action are notified.")', so we
can use md_new_event and delete md_new_event_inintr.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# f6b6ec5c 20-Dec-2015 Shaohua Li <shli@fb.com>

raid5-cache: add journal hot add/remove support

Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 9ebc6ef1 20-Dec-2015 Deepa Dinamani <deepa.kernel@gmail.com>

drivers: md: use ktime_get_real_seconds()

get_seconds() API is not y2038 safe on 32 bit systems and the API
is deprecated. Replace it with calls to ktime_get_real_seconds()
API instead. Change mddev structure types to time64_t accordingly.

32 bit signed timestamps will overflow in the year 2038.

Change the user interface mdu_array_info_s structure timestamps:
ctime and utime values used in ioctls GET_ARRAY_INFO and
SET_ARRAY_INFO to unsigned int. This will extend the field to last
until the year 2106.
The long term plan is to get rid of ctime and utime values in
this structure as this information can be read from the on-disk
meta data directly.

Clamp the tim64_t timestamps to positive values with a max of U32_MAX
when returning from GET_ARRAY_INFO ioctl to accommodate above changes
in the data type of timestamps to unsigned int.

v0.90 on disk meta data uses u32 for maintaining time stamps.
So this will also last until year 2106.
Assumption is that the usage of v0.90 will be deprecated by
year 2106.

Timestamp fields in the on disk meta data for v1.0 version already
use 64 bit data types. Remove the truncation of the bits while
writing to or reading from these from the disk.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>


# 3312c951 20-Dec-2015 Arnd Bergmann <arnd@arndb.de>

md: avoid warning for 32-bit sector_t

When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which
means we cannot have devices with more than 2TB, and the code that
is trying to handle compatibility support for large devices in
md version 0.90 is meaningless but also causes a compile-time warning:

drivers/md/md.c: In function 'super_90_load':
drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow]
drivers/md/md.c: In function 'super_90_rdev_size_change':
drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow]

This adds a check for CONFIG_LBDAF to avoid even getting into this
code path, and also adds an explicit cast to let the compiler know
it doesn't have to warn about the truncation.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>


# abf3508d 20-Dec-2015 Guoqing Jiang <gqjiang@suse.com>

md: update comment for md_allow_write

MD_CHANGE_CLEAN had been replaced with MD_CHANGE_PENDING after
commit 070dc6 ("md: resolve confusion of MD_CHANGE_CLEAN"),
so make the change accordingly.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 15858fa5 20-Dec-2015 Guoqing Jiang <gqjiang@suse.com>

md-cluster: Defer MD reloading to mddev->thread

Reloading of superblock must be performed under reconfig_mutex. However,
this cannot be done with md_reload_sb because it would deadlock with
the message DLM lock. So, we defer it in md_check_recovery() which is
executed by mddev->thread.

This introduces a new flag, MD_RELOAD_SB, which if set, will reload the
superblock. And good_device_nr is also added to 'struct mddev' which is
used to get the num of the good device within cluster raid.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# f6a2dc64 20-Dec-2015 Guoqing Jiang <gqjiang@suse.com>

md-cluster: append some actions when change bitmap from clustered to none

For clustered raid, we need to do extra actions when change
bitmap to none.

1. check if all the bitmap lock could be get or not, if yes then
we can continue the change since cluster raid is only active
in current node. Otherwise return fail and unlock the related
bitmap locks
2. set nodes to 0 and then leave cluster environment.
3. release other nodes's bitmap lock.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 09afd2a8 20-Dec-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Allow spare devices to be marked as faulty

If a spare device was marked faulty, it would not be reflected
in receiving nodes because it would mark it as activated and continue.
Continue the operation, so it may be set as faulty.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 54a88392 20-Dec-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Fix the remove sequence with the new MD reload code

The remove disk message does not need metadata_update_start(), but
can be an independent message.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 659b254f 20-Dec-2015 Guoqing Jiang <gqjiang@suse.com>

md-cluster: remove a disk asynchronously from cluster environment

For cluster raid, if one disk couldn't be reach in one node, then
other nodes would receive the REMOVE message for the disk.

In receiving node, we can't call md_kick_rdev_from_array to remove
the disk from array synchronously since the disk might still be busy
in this node. So let's set a ClusterRemove flag on the disk, then
let the thread to do the removal job eventually.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 312045ee 20-Dec-2015 NeilBrown <neilb@suse.com>

md: remove check for MD_RECOVERY_NEEDED in action_store.

md currently doesn't allow a 'sync_action' such as 'reshape' to be set
while MD_RECOVERY_NEEDED is set.

This s a problem, particularly since commit 738a273806ee as that can
cause ->check_shape to call mddev_resume() which sets
MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is
very likely that MD_RECOVERY_NEEDED is still set.

Testing for this flag is not really needed and is in any case very
racy as it can be set at any moment - asynchronously. Any race
between setting a sync_action and setting MD_RECOVERY_NEEDED must
already be handled properly in some locked code, probably
md_check_recovery(), so remove the test here.

The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case
so we should test it again after getting mddev_lock().

As this fixes a race and a regression which can cause 'reshape' to
fail, it is suitable for -stable kernels since 4.1

Reported-by: Xiao Ni <xni@redhat.com>
Fixes: 738a273806ee ("md/raid5: fix allocation of 'scribble' array.")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>


# cb01c549 17-Dec-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

Fix remove_and_add_spares removes drive added as spare in slot_store

Commit 2910ff17d154baa5eb50e362a91104e831eb2bb6
introduced a regression which would remove a recently added spare via
slot_store. Revert part of the patch which touches slot_store() and add
the disk directly using pers->hot_add_disk()

Fixes: 2910ff17d154 ("md: remove_and_add_spares() to activate specific
rdev")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 0dc10e50 17-Dec-2015 Mikulas Patocka <mpatocka@redhat.com>

md: fix bug due to nested suspend

The patch c7bfced9a6716ff66c9d61f934bb60af08d4688c committed to 4.4-rc
causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with
this BUG, the reason is that we attempt to suspend a device that is
already suspended. See also
https://bugzilla.redhat.com/show_bug.cgi?id=1283491

This patch fixes the bug by changing functions mddev_suspend and
mddev_resume to always nest.
The number of nested calls to mddev_nested_suspend is kept in the
variable mddev->suspended.
[neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend]

kernel BUG at drivers/md/md.c:317!
CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1
task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000

YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001000000000000001111 Not tainted
r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810
r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810
r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff
r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4
r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001
r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1
r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000
r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001
sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390
IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748
CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000
ORIG_R28: 00000000000000b0
IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod]
IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod]
RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1]
Backtrace:
[<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1]
[<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid]
[<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod]
[<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod]
[<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod]
[<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod]
[<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod]
[<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558

Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 9b15603d 17-Dec-2015 Shaohua Li <shli@fb.com>

MD: change journal disk role to disk 0

Neil pointed out setting journal disk role to raid_disks will confuse
reshape if we support reshape eventually. Switching the role to 0 (we
should be fine as long as the value >=0) and skip sysfs file creation to
avoid error.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# dece1635 05-Nov-2015 Jens Axboe <axboe@fb.com>

block: change ->make_request_fn() and users to return a queue cookie

No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>


# 339421de 08-Oct-2015 Song Liu <songliubraving@fb.com>

MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW

When RAID-4/5/6 array suffers from missing journal device, we put
the array in read only state. We should not allow trasition to
read-write states (clean and active) before replacing journal device.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# f2076e7d 08-Oct-2015 Shaohua Li <shli@fb.com>

MD: set journal disk ->raid_disk

Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# a3dfbdaa 08-Oct-2015 Song Liu <songliubraving@fb.com>

MD: kick out journal disk if it's not fresh

When journal disk is faulty and we are reassemabling the raid array, the
journal disk is old. We don't allow the journal disk added to the raid
array. Since journal disk is missing in the array, the raid5 will mark
the array readonly.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# a97b7896 08-Oct-2015 Song Liu <songliubraving@fb.com>

MD: add new bit to indicate raid array with journal

If a raid array has journal feature bit set, add a new bit to indicate
this. If the array is started without journal disk existing, we know
there is something wrong.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 9efdca16 12-Oct-2015 Shaohua Li <shli@fb.com>

MD: fix info output for journal disk

journal disk can be faulty. The Journal and Faulty aren't exclusive with
each other.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# ac6096e9 04-Oct-2015 Shaohua Li <shli@fb.com>

md: show journal for journal disk in disk state sysfs

Journal disk state sysfs entry should indicate it's journal

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 0b020e85 04-Sep-2015 Song Liu <songliubraving@fb.com>

skip match_mddev_units check for special roles

match_mddev_units is used to check whether 2 RAID arrays share
same disk(s). Arrays that share disk(s) will not do resync at the
same time for better performance (fewer HDD seek). However, this
check should not apply to Spare, Faulty, and Journal disks, as
they do not paticipate in resync.

In this patch, match_mddev_units skips check for disks with flag
"Faulty" or "Journal" or raid_disk < 0.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# bd18f646 02-Sep-2015 Shaohua Li <shli@fb.com>

md: skip resync for raid array with journal

If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# d01552a7 30-Oct-2015 NeilBrown <neilb@suse.com>

Revert "md: allow a partially recovered device to be hot-added to an array."

This reverts commit 7eb418851f3278de67126ea0c427641ab4792c57.

This commit is poorly justified, I can find not discusison in email,
and it clearly causes a problem.

If a device which is being recovered fails and is subsequently
re-added to an array, there could easily have been changes to the
array *before* the point where the recovery was up to. So the
recovery must start again from the beginning.

If a spare is being recovered and fails, then when it is re-added we
really should do a bitmap-based recovery up to the recovery-offset,
and then a full recovery from there. Before this reversion, we only
did the "full recovery from there" which is not corect. After this
reversion with will do a full recovery from the start, which is safer
but not ideal.

It will be left to a future patch to arrange the two different styles
of recovery.

Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (3.14+)
Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.")


# 3069aa8d 13-Aug-2015 Shaohua Li <shli@fb.com>

md: override md superblock recovery_offset for journal device

Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# bac624f3 13-Aug-2015 Song Liu <songliubraving@fb.com>

MD: add a new disk role to present write journal device

Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# c4d4c91b 13-Aug-2015 Song Liu <songliubraving@fb.com>

MD: replace special disk roles with macros

Add the following two macros for special roles: spare and faulty

MD_DISK_ROLE_SPARE 0xffff
MD_DISK_ROLE_FAULTY 0xfffe

Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role,
and minimal value of special role.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 28c1b9fd 21-Oct-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Call update_raid_disks() if another node --grow's raid_disks

To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.

This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.

In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.

mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# c7bfced9 21-Oct-2015 Dan Williams <dan.j.williams@intel.com>

md: suspend i/o during runtime blk_integrity_unregister

Synchronize pending i/o against a change in the integrity profile to
avoid the possibility of spurious integrity errors. Given linear_add()
is suspending the mddev before manipulating the mddev, do the same for
the other personalities.

Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9609b994 21-Oct-2015 Dan Williams <dan.j.williams@intel.com>

md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown

Now that the integrity profile is statically allocated there is no work
to do when shutting down an integrity enabled block device.

Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <JBottomley@Odin.com>
Acked-by: NeilBrown <neilb@suse.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 25520d55 21-Oct-2015 Martin K. Petersen <martin.petersen@oracle.com>

block: Inline blk_integrity in struct gendisk

Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

- Moving the blk_integrity definition to genhd.h.

- Inlining blk_integrity in struct gendisk.

- Removing the dynamic allocation code.

- Adding helper functions which allow gendisk to set up and tear down
the integrity sysfs dir when a disk is added/deleted.

- Adding a blk_integrity_revalidate() callback for updating the stable
pages bdi setting.

- The calls that depend on whether a device has an integrity profile or
not now key off of the bi->profile pointer.

- Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 23b63f9f 12-Oct-2015 Guoqing Jiang <gqjiang@suse.com>

md: check the return value for metadata_update_start

We shouldn't run related funs of md_cluster_ops in case
metadata_update_start returned failure.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>


# a9720903 12-Oct-2015 Guoqing Jiang <gqjiang@suse.com>

md-cluster: only call kick_rdev_from_array after remove disk successfully

For cluster raid, we should not kick it from array if the disk can't be
remove from array successfully.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# dbb64f86 01-Oct-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Fix adding of new disk with new reload code

Adding the disk worked incorrectly with the new reload code. Fix it:

- No operation should be performed on rdev marked as Candidate
- After a metadata update operation, kick disk if role is 0xfffe
else clear Candidate bit and continue with the regular change check.
- Saving the mode of the lock resource to check if token lock is already
locked, because it can be called twice while adding a disk. However,
unlock_comm() must be called only once.
- add_new_disk() is called by the node initiating the --add operation.
If it needs to be canceled, call add_new_disk_cancel(). The operation
is completed by md_update_sb() which will write and unlock the
communication.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# c186b128 30-Sep-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Perform resync/recovery under a DLM lock

Resync or recovery must be performed by only one node at a time.
A DLM lock resource, resync_lockres provides the mutual exclusion
so that only one node performs the recovery/resync at a time.

If a node is unable to get the resync_lockres, because recovery is
being performed by another node, it set MD_RECOVER_NEEDED so as
to schedule recovery in the future.

Remove the debug message in resync_info_update()
used during development.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 2aa82191 28-Sep-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Perform a lazy update

In a clustered environment, a change such as marking a device faulty,
can be recorded by any of the nodes. This is communicated to all the
nodes and re-recording such a change is unnecessary, and quite often
pretty disruptive.

With this patch, just before the update, we detect for the changes
and if the changes are already in superblock, we abort the update
after clearing all the flags

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 70bcecdb 21-Aug-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Improve md_reload_sb to be less error prone

md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.

Instead, read the superblock of one of the "good" rdevs and update
the necessary information:

- read the superblock into a newly allocated page, by temporarily
swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
1. iterates over list of active devices and checks the matching
dev_roles[] value.
If that is 'faulty', the device must be marked as faulty
- call md_error to mark the device as faulty. Make sure
not to set CHANGE_DEVS and wakeup mddev->thread or else
it would initiate a resync process, which is the responsibility
of the "primary" node.
- clear the Blocked bit
- Call remove_and_add_spares() to hot remove the device.
If the device is 'spare':
- call remove_and_add_spares() to get the number of spares
added in this operation.
- Reduce mddev->degraded to mark the array as not degraded.
2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
is equal to MaxSector, call spare_active() to set it In_sync

This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 2910ff17 28-Sep-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md: remove_and_add_spares() to activate specific rdev

remove_and_add_spares() checks for all devices to activate spare.
Change it to activate a specific device if a non-null rdev
argument is passed.

remove_and_add_spares() can be used to activate spares in
slot_store() as well.

For hot_remove_disk(), check if rdev->raid_disk == -1 before
calling remove_and_add_spares()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# c40f341f 18-Aug-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Use a small window for resync

Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 3c462c88 18-Aug-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md: Increment version for clustered bitmaps

Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
to assemble a clustered device.

In order to maximize compatibility, the major version is set to
BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.

Added MD_FEATURE_CLUSTERED in order to return error for older
kernels which would assemble MD even if the bitmap is corrupted.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# d4929add 18-Sep-2015 Shaohua Li <shli@fb.com>

md: clear CHANGE_PENDING in readonly array

If faulty disks of an array are more than allowed degraded number, the
array enters error handling. It will be marked as read-only with
MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is
set for a raid5 array, all returned IO will be hold on a list till the
bit is clear. But recovery nevery clears this bit, the IO is always in
pending state and nevery finish. This has bad effects like upper layer
can't get an IO error and the array can't be stopped.

Fixes: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 88724bfa 23-Sep-2015 NeilBrown <neilb@suse.com>

md: wait for pending superblock updates before switching to read-only

If a superblock update is pending, wait for it to complete before
letting md_set_readonly() switch to readonly.
Otherwise we might lose important information about a device having
failed.

For external arrays, waiting for superblock updates can wait on
user-space, so in that case, just return an error.

Reported-and-tested-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 55ce74d4 13-Aug-2015 NeilBrown <neilb@suse.com>

md/raid1: ensure device failure recorded before write request returns.

When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update. So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
- set MD_CHANGE_PENDING when requesting a metadata update for a
failed device, so we can know with certainty when it completes
- queue requests that experienced an error on a new queue which
is only processed after the metadata update completes
- call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown <neilb@suse.com>


# 6022e75b 12-Aug-2015 NeilBrown <neilb@suse.com>

md: extend spinlock protection in register_md_cluster_operations

This code looks racy.

The only possible race is if two modules try to register at the same
time and that won't happen. But make the code look safe anyway.

Signed-off-by: NeilBrown <neilb@suse.com>


# dc737d7c 10-Jul-2015 Guoqing Jiang <gqjiang@suse.com>

md-cluster: transfer the resync ownership to another node

When node A stops an array while the array is doing a resync, we need
to let another node B take over the resync task.

To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
message to the cluster. And the node B which received that message will
invoke __recover_slot to do resync.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 25b2edfa 24-Jul-2015 Sasha Levin <sasha.levin@oracle.com>

md: setup safemode_timer before it's being used

We used to set up the safemode_timer timer in md_run. If md_run
would fail before the timer was set up we'd end up trying to modify
a timer that doesn't have a callback function when we access safe_delay_store,
which would trigger a BUG.

neilb: delete init_timer() call as setup_timer() does that.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 5ed1df2e 23-Jul-2015 NeilBrown <neilb@suse.com>

md: sync sync_completed has correct value as recovery finishes.

There can be a small window between the moment that recovery
actually writes the last block and the time when various sysfs
and /proc/mdstat attributes report that it has finished.
During this time, 'sync_completed' can have the wrong value.
This can confuse monitoring software.

So:
- don't set curr_resync_completed beyond the end of the devices,
- set it correctly when resync/recovery has completed.

Signed-off-by: NeilBrown <neilb@suse.com>


# c5e19d90 16-Jul-2015 NeilBrown <neilb@suse.com>

md: be careful when testing resync_max against curr_resync_completed.

While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.

Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
curr_resync_completed > resync_max

Signed-off-by: NeilBrown <neilb@suse.com>


# a4a3d26d 16-Jul-2015 NeilBrown <neilb@suse.com>

md: set MD_RECOVERY_RECOVER when starting a degraded array.

This ensures that 'sync_action' will show 'recover' immediately the
array is started. If there is no spare the status will change to
'idle' once that is detected.

Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
happens.

This allows scripts which monitor status not to get confused -
particularly my test scripts.

Signed-off-by: NeilBrown <neilb@suse.com>


# 985ca973 05-Jul-2015 NeilBrown <neilb@suse.com>

md: close some races between setting and checking sync_action.

When checking sync_action in a script, we want to be sure it is
as accurate as possible.
As resync/reshape etc doesn't always start immediately (a separate
thread is scheduled to do it), it is best if 'action_show'
checks if MD_RECOVER_NEEDED is set (which it does) and in that
case reports what is likely to start soon (which it only sometimes
does).

So:
- report 'reshape' if reshape_position suggests one might start.
- set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
likely to happen next.

Signed-off-by: NeilBrown <neilb@suse.com>


# f7851be7 02-Jul-2015 NeilBrown <neilb@suse.com>

md: Keep /proc/mdstat reporting recovery until fully DONE.

Currently when a recovery completes, mdstat shows that it has finished
before the new device is marked as a full member. Because of this it
can appear to a script that the recovery finished but the array isn't
in sync.

So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
Once md_reap_sync_thread() completes, the spare will be active and then
MD_RECOVERY_DONE will be cleared.

To ensure this is race-free, set MD_RECOVERY_DONE before clearning
curr_resync.

Signed-off-by: NeilBrown <neilb@suse.com>


# 8ae12666 28-Apr-2015 Kent Overstreet <kent.overstreet@gmail.com>

block: kill merge_bvec_fn() completely

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 54efd50b 23-Apr-2015 Kent Overstreet <kent.overstreet@gmail.com>

block: make generic_make_request handle arbitrarily sized bios

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 25eafe1a 25-Jul-2015 Benjamin Randazzo <benjamin@randazzo.fr>

md: simplify get_bitmap_file now that "file" is zeroed.

There is no point assigning '\0' to file->pathname[0] as
file is now zeroed out, so remove that branch and
simplify the code.

[Original patch combined this with the change to use
kzalloc. I split the two so that the change to kzalloc
is easier to backport. - neilb]

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>


# b6878d9e 25-Jul-2015 Benjamin Randazzo <benjamin@randazzo.fr>

md: use kzalloc() when bitmap is disabled

In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
mdu_bitmap_file_t called "file".

5769 file = kmalloc(sizeof(*file), GFP_NOIO);
5770 if (!file)
5771 return -ENOMEM;

This structure is copied to user space at the end of the function.

5786 if (err == 0 &&
5787 copy_to_user(arg, file, sizeof(*file)))
5788 err = -EFAULT

But if bitmap is disabled only the first byte of "file" is initialized
with zero, so it's possible to read some bytes (up to 4095) of kernel
space memory from user space. This is an information leak.

5775 /* bitmap disabled, zero the first byte and copy out */
5776 if (!mddev->bitmap_info.file)
5777 file->pathname[0] = '\0';

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>


# 4246a0b6 20-Jul-2015 Christoph Hellwig <hch@lst.de>

block: add a bi_error field to struct bio

Currently we have two different ways to signal an I/O error on a BIO:

(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b0c26a79 21-Jul-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md: Return error if request_module fails and returns positive value

request_module() can return 256 (process exited) in some cases,
which is not as specified in the documentation before the
request_module() definition. Convert the error to -ENOENT.

The positive error number results in bitmap_create() returning
a value that is meant to be an error but doesn't look like one,
so it is dereferenced as a point and causes a crash.

(not needed for stable as this is "experimental" code)
Fixes: edb39c9deda8 ("Introduce md_cluster_operations to handle cluster functions")
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# ee5d004f 21-Jul-2015 NeilBrown <neilb@suse.com>

md: flush ->event_work before stopping array.

The 'event_work' worker used by dm-raid may still be running
when the array is stopped. This can result in an oops.

So flush the workqueue on which it is run after detaching
and before destroying the device.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release)
Fixes: 9d09e663d550 ("dm: raid456 basic support")


# 90a9befb 25-Jun-2015 Rasmus Villemoes <linux@rasmusvillemoes.dk>

drivers/md/md.c: use strreplace()

There's no point in starting over when we meet a '/'. This also
eliminates a stack variable and a little .text.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab16bfc7 16-Jun-2015 Neil Brown <neilb@suse.de>

md: clear Blocked flag on failed devices when array is read-only.

The Blocked flag indicates that a device has failed but that this
fact hasn't been recorded in the metadata yet. Writes to such
devices cannot be allowed until the metadata has been updated.

On a read-only array, the Blocked flag will never be cleared.
This prevents the device being removed from the array.

If the metadata is being handled by the kernel
(i.e. !mddev->external), then we can be sure that if the array is
switch to writable, then a metadata update will happen and will
record the failure. So we don't need the flag set.

If metadata is externally managed, it is upto the external manager
to clear the 'blocked' flag.

Reported-by: XiaoNi <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 9a8c0fa8 25-Jun-2015 NeilBrown <neilb@suse.de>

md: unlock mddev_lock on an error path.

This error path retuns while still holding the lock - bad.

Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)
Signed-off-by: NeilBrown <neilb@suse.com>


# bd691922 25-Jun-2015 NeilBrown <neilb@suse.de>

md: clear mddev->private when it has been freed.

If ->private is set when ->run is called, it is assumed to be
a 'config' prepared as part of 'reshape'.

So it is important when we free that config, that we also clear ->private.
This is not often a problem as the mddev will normally be discarded
shortly after the config us freed.
However if an 'assemble' races with a final close, the assemble can use
the old mddev which has a stale ->private. This leads to any of
various sorts of crashes.

So clear ->private after calling ->free().

Reported-by: Nate Clark <nate@neworld.us>
Cc: stable@vger.kernel.org (v4.0+)
Fixes: afa0f557cb15 ("md: rename ->stop to ->free")
Signed-off-by: NeilBrown <neilb@suse.com>


# 9bf39ab2 19-Jun-2015 Miklos Szeredi <mszeredi@suse.cz>

vfs: add file_path() helper

Turn
d_path(&file->f_path, ...);
into
file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4e023612 10-Jun-2015 Firo Yang <firogm@gmail.com>

md: fix a build warning

Warning like this:

drivers/md/md.c: In function "update_array_info":
drivers/md/md.c:6394:26: warning: logical not is only applied
to the left hand side of comparison [-Wlogical-not-parentheses]
!mddev->persistent != info->not_persistent||

Fix it as Neil Brown said:
mddev->persistent != !info->not_persistent ||

Signed-off-by: Firo Yang <firogm@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 4c9309c0 16-May-2015 Alexey Dobriyan <adobriyan@gmail.com>

md: convert to kstrto*()

Convert away from deprecated simple_strto*() functions.

Add "fit into sector_t" checks.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# ea358cd0 12-Jun-2015 NeilBrown <neilb@suse.de>

md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync

MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished. However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared. This
can lean to multiple reshapes running at the same time, which isn't
good.

To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.

Signed-off-by: NeilBrown <neilb@suse.de>


# 8e8e2518 12-Jun-2015 NeilBrown <neilb@suse.de>

md: Close race when setting 'action' to 'idle'.

Checking ->sync_thread without holding the mddev_lock()
isn't really safe, even after flushing the workqueue which
ensures md_start_sync() has been run.

While this code is waiting for the lock, md_check_recovery could reap
the thread itself, and then start another thread (e.g. recovery might
finish, then reshape starts). When this thread gets the lock
md_start_sync() hasn't run so it doesn't get reaped, but
MD_RECOVERY_RUNNING gets cleared. This allows two threads to start
which leads to confusion.

So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
the flush and the test and the reap all under the mddev_lock to
avoid any race with md_check_recovery.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)


# c008f1d3 12-Jun-2015 NeilBrown <neilb@suse.de>

md: don't return 0 from array_state_store

Returning zero from a 'store' function is bad.
The return value should be either len length of the string
or an error.

So use 'len' if 'err' is zero.

Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.")
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel (v4.0+)


# 56ccc112 28-May-2015 NeilBrown <neilb@suse.de>

md: fix race when unfreezing sync_action

A recent change removed the need for locking around writing
to "sync_action" (and various other places), but introduced a
subtle race.
When e.g. setting 'reshape' on a 'frozen' array, the 'frozen'
flag is cleared before 'reshape' is set, so the md thread can
get in and start trying recovery - which isn't wanted.

So instead of clearing MD_RECOVERY_FROZEN for any command
except 'frozen', only clear it when each specific command
is parsed. This allows the handling of 'reshape' to clear
the bit while a lock is held.

Also remove some places where we set MD_RECOVERY_NEEDED,
as it is always set on non-error exit of the function.


Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.")


# 6cd18e71 26-Apr-2015 NeilBrown <neilb@suse.de>

block: destroy bdi before blockdev is unregistered.

Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().

Therefore it is important that all visible artifacts of the previous
device are removed before this call. In particular, the 'bdi'.

Since:
commit c4db59d31e39ea067c32163ac961e9c80198fd37
Author: Christoph Hellwig <hch@lst.de>
fs: don't reassign dirty inodes to default_backing_dev_info

moved the
device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().

The new device finds that the bdi name is already registered in sysfs
and complains

> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.

Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk(). As loop.c devices are also created on demand by
opening the device node, we make the same change there.

Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37
Reported-by: Azat Khuzhin <a3at.mail@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ac8fa419 18-Feb-2015 NeilBrown <neilb@suse.de>

md: allow resync to go faster when there is competing IO.

When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.

The default minimum might have made sense many years ago
but the drives have become faster. Changing the default
to match the times isn't really a long term solution.

This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.

Testing shows that:
- for some loads, the resync speed is unchanged. For those loads
increasing the minimum doesn't change the speed either.
So this is a good result. To increase resync speed under such
loads we would probably need to increase the resync window
size.

- for other loads, resync speed does increase to a reasonable
fraction (e.g. 20%) of maximum possible, and throughput of
the load only drops a little bit (e.g. 10%)

- for other loads, throughput of the non-sync load drops quite a bit
more. These seem to be latency-sensitive loads.

So it isn't a perfect solution, but it is mostly an improvement.

Signed-off-by: NeilBrown <neilb@suse.de>


# 09314799 18-Feb-2015 NeilBrown <neilb@suse.de>

md: remove 'go_faster' option from ->sync_request()

This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.

Signed-off-by: NeilBrown <neilb@suse.de>


# 50c37b13 23-Mar-2015 NeilBrown <neilb@suse.de>

md: don't require sync_min to be a multiple of chunk_size.

There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.

So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.

Signed-off-by: NeilBrown <neilb@suse.de>


# 97f6cd39 14-Apr-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md-cluster: re-add capabilities

When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:

1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
and copies them into the local bitmap. It does not clear the bitmap
from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# a6da4ef8 14-Apr-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md: re-add a failed disk

This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.

This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.

This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 88bcfef7 14-Apr-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md-cluster: remove capabilities

This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.

This facilitates the removal of failed devices.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 57d051dc 14-Apr-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md: Export and rename find_rdev_nr_rcu

This is required by the clustering module (patches to follow) to
find the device to remove or re-add.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# fb56dfef 14-Apr-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md: Export and rename kick_rdev_from_array

This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 74672d06 02-Apr-2015 Gu Zheng <guz.fnst@cn.fujitsu.com>

md: fix md io stats accounting broken

Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.

Reported-by: Simon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# fa8259da 02-Mar-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md: Fix stray --cluster-confirm crash

A --cluster-confirm without an --add (by another node) can
crash the kernel.

Fix it by guarding it using a state.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0c35bd47 12-Mar-2015 NeilBrown <neilb@suse.de>

md: fix problems with freeing private data after ->run failure.

If ->run() fails, it can either free the data structures it
allocated, or leave that task to ->free() which will be called
on failures.

However:
md.c calls ->free() even if ->private_data is NULL, which
causes problems in some personalities.
raid0.c frees the data, but doesn't clear ->private_data,
which will become a problem when we fix md.c

So better fix both these issues at once.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 5aa61f427e4979be733e4847b9199ff9cc48a47e
URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381
Signed-off-by: NeilBrown <neilb@suse.de>


# ba599aca 24-Feb-2015 NeilBrown <neilb@suse.de>

md: fix error paths from bitmap_create.

Recent change to bitmap_create mishandles errors.
In particular a failure doesn't alway cause 'err' to be set.

Signed-off-by: NeilBrown <neilb@suse.de>


# 750f199e 29-Sep-2014 NeilBrown <neilb@suse.de>

md: mark some attributes as pre-alloc

Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18
it can now be used by md.

This ensure that writing to these sysfs attributes will never
block due to a memory allocation.
Such blocking could become a deadlock if mdmon is trying to
reconfigure an array after a failure prior to re-enabling writes.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1aee41f6 29-Oct-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Add new disk to clustered array

Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 589a1c49 07-Jun-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Suspend writes in RAID1 if within range

If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.

If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 965400eb 07-Jun-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Send RESYNCING while performing resync start/stop

When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 1d7e3e96 07-Jun-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Reload superblock if METADATA_UPDATED is received

Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 293467aa 07-Jun-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

metadata_update sends message to other nodes

- request to send a message
- make changes to superblock
- send messages telling everyone that the superblock has changed
- other nodes all read the superblock
- other nodes all ack the messages
- updating node release the "I'm sending a message" resource.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# f9209a32 05-Jun-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

bitmap_create returns bitmap pointer

This is done to have multiple bitmaps open at the same time.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 96ae923a 05-Jun-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Gather on-going resync information of other nodes

When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.

[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.

If the node does not get the PW, it requests CR and reads the LVB
for the resync information.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# cf921cc1 29-Mar-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Add node recovery callbacks

DLM offers callbacks when a node fails and the lock remastery
is performed:

1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
can start
3. recover_done: called when all nodes have completed recover_slot

recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# ca8895d9 25-Nov-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Return MD_SB_CLUSTERED if mddev is clustered

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# c4ce867f 29-Mar-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce md_cluster_info

md_cluster_info stores the cluster information in the MD device.

The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
1. Setup a DLM lockspace
2. Setup all initial locks such as super block locks and bitmap lock (will come later)

The leave() clears up the lockspace and all the locks held.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# edb39c9d 29-Mar-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce md_cluster_operations to handle cluster functions

This allows dynamic registering of cluster hooks.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 6791875e 14-Dec-2014 NeilBrown <neilb@suse.de>

md: make reconfig_mutex optional for writes to md sysfs files.

Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.

This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).

Signed-off-by: NeilBrown <neilb@suse.de>


# 5c47daf6 14-Dec-2014 NeilBrown <neilb@suse.de>

md: move mddev_lock and related to md.h

The one which is not inline (mddev_unlock) gets EXPORTed.

This makes the locking available to personality modules so that it
doesn't have to be imposed upon them.

Signed-off-by: NeilBrown <neilb@suse.de>


# 23da422b 14-Dec-2014 NeilBrown <neilb@suse.de>

md: use mddev->lock to protect updates to resync_{min,max}.

There are interdependencies between these two sysfs attributes
and whether a resync is currently running.

Rather than depending on reconfig_mutex to ensure no races when
testing these interdependencies are met, use the spinlock.
This will allow the mutex to be remove from protecting this
code in a subsequent patch.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1b30e66f 14-Dec-2014 NeilBrown <neilb@suse.de>

md: minor cleanup in safe_delay_store.

There isn't really much room for races with ->safemode_delay.
But as I am trying to clean up any racy code and will soon
be removing reconfig_mutex protection from most _store()
functions:
- only set mddev->safemode_delay once, to ensure no code
can see an intermediate value
- use safemode_timer to call md_safemode_timeout() rather than
calling it directly, to ensure it never races with itself.

Signed-off-by: NeilBrown <neilb@suse.de>


# 4af1a041 14-Dec-2014 NeilBrown <neilb@suse.de>

md: move GET_BITMAP_FILE ioctl out from mddev_lock.

It makes more sense to report bitmap_info->file, rather than
bitmap->file (the later is only available once the array is
active).

With that change, use mddev->lock to protect bitmap_info being
set to NULL, and we can call get_bitmap_file() without taking
the mutex.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1e594bb2 14-Dec-2014 NeilBrown <neilb@suse.de>

md: tidy up set_bitmap_file

1/ delay setting mddev->bitmap_info.file until 'f' looks
usable, so we don't have to unset it.
2/ Don't allow bitmap file to be set if bitmap_info.file
is already set.

Signed-off-by: NeilBrown <neilb@suse.de>


# f4ad3d38 14-Dec-2014 NeilBrown <neilb@suse.de>

md: remove unnecessary 'buf' from get_bitmap_file.

'buf' is only used because d_path fills from the end of the
buffer instead of from the start.
We don't need a separate buf to handle that, we just need to use
memmove() to move the string to the start.

Signed-off-by: NeilBrown <neilb@suse.de>


# 758bfc8a 14-Dec-2014 NeilBrown <neilb@suse.de>

md: remove mddev_lock from rdev_attr_show()

No rdev attributes need locking for 'show', though
state_show() might benefit from ensuring it sees a
consistent set of flags.

None even use rdev->mddev, so testing for it isn't really
needed and it certainly doesn't need to be held constant.

So improve state_show() and remove the locking.

Signed-off-by: NeilBrown <neilb@suse.de>


# b7b17c9b 14-Dec-2014 NeilBrown <neilb@suse.de>

md: remove mddev_lock() from md_attr_show()

Most attributes can be read safely without any locking.
A race might lead to a slightly out-dated value, but nothing wrong.

We already have locking in some places where needed.
All that remains is can_clear_show(), behind_writes_used_show()
and action_show() which are easily fixed.

Signed-off-by: NeilBrown <neilb@suse.de>


# f97fcad3 14-Dec-2014 NeilBrown <neilb@suse.de>

md: remove need for mddev_lock() in md_seq_show()

The only access in md_seq_show that could suffer from races
not protected by ->lock is walking the rdev list.
This can receive sufficient protection from 'rcu'.

So use rdev_for_each_rcu() and get rid of mddev_lock().

Now reading /proc/mdstat will never block in md_seq_show.

Signed-off-by: NeilBrown <neilb@suse.de>


# 36d091f4 14-Dec-2014 NeilBrown <neilb@suse.de>

md: protect ->pers changes with mddev->lock

->pers is already protected by ->reconfig_mutex, and
cannot possibly change when there are threads running or
outstanding IO.

However there are some places where we access ->pers
not in a thread or IO context, and where ->reconfig_mutex
is unnecessarily heavy-weight: level_show and md_seq_show().

So protect all changes, and those accesses, with ->lock.
This is a step toward taking those accesses out from under
reconfig_mutex.

[Fixed missing "mddev->pers" -> "pers" conversion, thanks to
Dan Carpenter <dan.carpenter@oracle.com>]

Signed-off-by: NeilBrown <neilb@suse.de>


# db721d32 14-Dec-2014 NeilBrown <neilb@suse.de>

md: level_store: group all important changes into one place.

Gather all the changes that can happen atomically and might
be relevant to other code into one place. This will
make it easier to refine the locking.

Note that this puts quite a few things between mddev_detach()
and ->free(). Enabling this was the point of some recent patches.

Signed-off-by: NeilBrown <neilb@suse.de>


# afa0f557 14-Dec-2014 NeilBrown <neilb@suse.de>

md: rename ->stop to ->free

Now that the ->stop function only frees the private data,
rename is accordingly.

Also pass in the private pointer as an arg rather than using
mddev->private. This flexibility will be useful in level_store().

Finally, don't clear ->private. It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before ->to_remove was
introduced).

Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5aa61f42 14-Dec-2014 NeilBrown <neilb@suse.de>

md: split detach operation out from ->stop.

Each md personality has a 'stop' operation which does two
things:
1/ it finalizes some aspects of the array to ensure nothing
is accessing the ->private data
2/ it frees the ->private data.

All the steps in '1' can apply to all arrays and so can be
performed in common code.

This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.

So split the 'step 1' functionality out into a new mddev_detach().

Signed-off-by: NeilBrown <neilb@suse.de>


# 64590f45 14-Dec-2014 NeilBrown <neilb@suse.de>

md: make merge_bvec_fn more robust in face of personality changes.

There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5c675f83 14-Dec-2014 NeilBrown <neilb@suse.de>

md: make ->congested robust against personality changes.

There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.

Signed-off-by: NeilBrown <neilb@suse.de>


# 85572d7c 14-Dec-2014 NeilBrown <neilb@suse.de>

md: rename mddev->write_lock to mddev->lock

This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further. So the
name is inappropriate.

Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.

So:
-rename write_lock to lock
-document what it protects
-remove _irq ... except in md_flush_request() as there
is no wait_event_lock() (with no _irq). This can be
cleaned up after appropriate changes to wait.h.

Signed-off-by: NeilBrown <neilb@suse.de>


# f851b60d 10-Dec-2014 NeilBrown <neilb@suse.de>

md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.

A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously. This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.

So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.

Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well. To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.

Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started. This is particularly best if we are about to stop the
sync thread.

Fixes: ac05f256691fe427a3e84c19261adb0b67dd73c0
Signed-off-by: NeilBrown <neilb@suse.de>


# 7d7e64f2 02-Dec-2014 kbuild test robot <fengguang.wu@intel.com>

md: fix semicolon.cocci warnings

drivers/md/md.c:7175:43-44: Unneeded semicolon

Removes unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 18c0b223 23-Nov-2014 Gu Zheng <guz.fnst@cn.fujitsu.com>

md: use generic io stats accounting functions to simplify io stat accounting

Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 45eaf45d 28-Oct-2014 NeilBrown <neilb@suse.de>

md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN

md_check_recovery will skip any recovery and also clear
MD_RECOVERY_NEEDED if MD_RECOVERY_FROZEN is set.
So when we clear _FROZEN, we must set _NEEDED and ensure that
md_check_recovery gets run.
Otherwise we could miss out on something that is needed.

In particular, this can make it impossible to remove a
failed device from an array is the 'recovery-needed' processing
didn't happen.
Suitable for stable kernels since 3.13.

Cc: stable@vger.kernel.org (3.13+)
Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Fixes: 30b8feb730f9b9b3c5de02580897da03f59b6b16
Signed-off-by: NeilBrown <neilb@suse.de>


# 6c144d31 30-Sep-2014 NeilBrown <neilb@suse.de>

md: move EXPORT_SYMBOL to after function in md.c

Signed-off-by: NeilBrown <neilb@suse.de>


# 2cbbca5e 30-Sep-2014 NeilBrown <neilb@suse.de>

md: discard PRINT_RAID_DEBUG ioctl

All the interesting information printed by this ioctl
is provided in /proc/mdstat and/or sysfs.
So it isn't needed and isn't used and would be best if it didn't
exist.

Signed-off-by: NeilBrown <neilb@suse.de>


# 403df478 29-Sep-2014 NeilBrown <neilb@suse.de>

md: remove MD_BUG()

Most of the places that call this are doing so pointlessly.
A couple of the others a best replaced with WARN_ON().

Signed-off-by: NeilBrown <neilb@suse.de>


# 3adc28d8 29-Sep-2014 NeilBrown <neilb@suse.de>

md: clean up 'exit' labels in md_ioctl().

There are 4 labels and we only really need two.

Signed-off-by: NeilBrown <neilb@suse.de>


# 326eb17d 29-Sep-2014 NeilBrown <neilb@suse.de>

md: remove unnecessary test for MD_MAJOR in md_ioctl()

unknown ioctls no longer get this deep into md_ioctl since
md_ioctl_valid() was introduced in 3.14.
So remove the test and the misleading comment.

Signed-off-by: NeilBrown <neilb@suse.de>


# e1960f8c 29-Sep-2014 NeilBrown <neilb@suse.de>

md: don't allow "-sync" to be set for device in an active array.

If an array is active, devices can be marked 'faulty', but simply
removing the 'sync' flag is wrong. That only makes sense
for an array which is not active (and is probably only useful
for testing anyway).

Signed-off-by: NeilBrown <neilb@suse.de>


# f72ffdd6 29-Sep-2014 NeilBrown <neilb@suse.de>

md: remove unwanted white space from md.c

My editor shows much of this is RED.

Signed-off-by: NeilBrown <neilb@suse.de>


# ac05f256 29-Sep-2014 NeilBrown <neilb@suse.de>

md: don't start resync thread directly from md thread.

The main 'md' thread is needed for processing writes, so if it blocks
write requests could be delayed.

Starting a new thread requires some GFP_KERNEL allocations and so can
wait for writes to complete. This can deadlock.

So instead, ask a workqueue to start the sync thread.
There is no particular rush for this to happen, so any work queue
will do.

MD_RECOVERY_RUNNING is used to ensure only one thread is started.

Reported-by: BillStuff <billstuff2001@sbcglobal.net>
Signed-off-by: NeilBrown <neilb@suse.de>


# 8b1afc3d 28-Sep-2014 NeilBrown <neilb@suse.de>

md: Just use RCU when checking for overlap between arrays.

We don't really need the full mddev_lock here, and having to
drop it is messy.
RCU is enough to protect these lists.

Signed-off-by: NeilBrown <neilb@suse.de>


# 50bd3774 25-Sep-2014 Chao Yu <chao@kernel.org>

md: avoid potential long delay under pers_lock

printk may cause long time lapse if value of printk_delay in sysctl is
configured large by user. If register_md_personality takes long time to print in
spinlock pers_lock, we may encounter high CPU usage rate when there are other
pers_lock competitors who may be blocked to spin.
We can avoid this condition by moving printk out of coverage of pers_lock
spinlock.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0638bb0e 25-Sep-2014 NeilBrown <neilb@suse.de>

md: simplify export_array()

We don't really need that for_each loop, or those MD_BUGs.

Signed-off-by: NeilBrown <neilb@suse.de>


# 4878e9eb 25-Sep-2014 NeilBrown <neilb@suse.de>

md: discard find_rdev_nr in favour of find_rdev_nr_rcu

Having both is a waste - just use the one.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1967cd56 08-Sep-2014 NeilBrown <neilb@suse.de>

md: use wait_event() to simplify md_super_wait()

md_super_wait is really just wait_event() open-coded.
So use the macro instead.

Signed-off-by: NeilBrown <neilb@suse.de>


# 9ba3b7f5 08-Sep-2014 NeilBrown <neilb@suse.de>

md: be more relaxed about stopping an array which isn't started.

In general we don't allow an array to be stopped if it is in use.
However if the array hasn't really been started yet, then any
apparent use is an anomily, probably due to 'udev' or similar
having a look to see what is there.

This means that if something goes wrong while assembling an array
it cannot reliably be un-assembled - STOP_ARRAY could fail.
There is no value here, so change do_md_stop() to succeed
despite concurrent opens if the array has not yet been
activated. i.e. if ->pers is NULL.

Reported-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# d66b1b39 07-Aug-2014 NeilBrown <neilb@suse.de>

md: don't allow bitmap file to be added to raid0/linear.

An array can only accept a bitmap if it will call bitmap_daemon_work
periodically, which means it needs a thread running.

If there is no thread, don't allow a bitmap to be added.

Signed-off-by: NeilBrown <neilb@suse.de>


# ac7e50a3 07-Aug-2014 Xiao Ni <xni@redhat.com>

md: Recovery speed is wrong

When we calculate the speed of recovery, the numerator that contains
the recovery done sectors. It's need to subtract the sectors which
don't finish recovery.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# af5628f0 30-Jul-2014 NeilBrown <neilb@suse.de>

md: disable probing for md devices 512 and over.

The way md devices are traditionally created in the kernel
is simply to open the device with the desired major/minor number.

This can be problematic as some support tools, notably udev and
programs run by udev, can open a device just to see what is there, and
find that it has created something. It is easy for a race to cause
udev to open an md device just after it was destroy, causing it to
suddenly re-appear.

For some time we have had an alternate way to create md devices
echo md_somename > /sys/modules/md_mod/paramaters/new_array

This will always use a minor number of 512 or higher, which mdadm
normally avoids.
Using this makes the creation-by-opening unnecessary, but does
not disable it, so it is still there to cause problems.

This patch disable probing for devices with a major of 9 (MD_MAJOR)
and a minor of 512 and up. This devices created by writing to
new_array cannot be re-created by opening the node in /dev.

Signed-off-by: NeilBrown <neilb@suse.de>


# 133d4527 01-Jul-2014 NeilBrown <neilb@suse.de>

md: flush writes before starting a recovery.

When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).

If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.

Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.

We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not. That
depends on details of the particular raid level and particular write
request.

This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel. It applied correctly to 3.0 at
least and will minor editing to earlier kernels.

Reported-by: Bill <billstuff2001@sbcglobal.net>
Tested-by: Bill <billstuff2001@sbcglobal.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown <neilb@suse.de>


# 9bd35920 01-Jul-2014 NeilBrown <neilb@suse.de>

md: make sure GET_ARRAY_INFO ioctl reports correct "clean" status

If an array has a bitmap, the when we set the "has bitmap" flag we
incorrectly clear the "is clean" flag.

"is clean" isn't really important when a bitmap is present, but it is
best to get it right anyway.

Reported-by: George Duffield <forumscollective@gmail.com>
Link: http://lkml.kernel.org/CAG__1a4MRV6gJL38XLAurtoSiD3rLBTmWpcS5HYvPpSfPR88UQ@mail.gmail.com
Fixes: 36fa30636fb84b209210299684e1be66d9e58217 (v2.6.14)
Signed-off-by: NeilBrown <neilb@suse.de>


# 8b32bf5e 27-May-2014 NeilBrown <neilb@suse.de>

md: md_clear_badblocks should return an error code on failure.

Julia Lawall and coccinelle report that md_clear_badblocks always
returns 0, despite appearing to have an error path.
The error path really should return an error code. ENOSPC is
reasonably appropriate.

Reported-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.de>


# bd8839e0 27-May-2014 NeilBrown <neilb@suse.de>

md: refuse to change shape of array if it is active but read-only

read-only arrays should not be changed. This includes changing
the level, layout, size, or number of devices.

So reject those changes for readonly arrays.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2ac295a5 28-May-2014 NeilBrown <neilb@suse.de>

md: always set MD_RECOVERY_INTR when interrupting a reshape thread.

Commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
md: fix problem when adding device to read-only array with bitmap.

added a called to md_reap_sync_thread() which cause a reshape thread
to be interrupted (in particular, it could cause md_thread() to never even
call md_do_sync()).
However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
know that the reshape didn't complete.

This only happens when mddev->ro is set and normally reshape threads
don't run in that situation. But raid5 and raid10 can start a reshape
thread during "run" is the array is in the middle of a reshape.
They do this even if ->ro is set.

So it is best to set MD_RECOVERY_INTR before abortingg the
sync thread, just in case.

Though it rare for this to trigger a problem it can cause data corruption
because the reshape isn't finished properly.
So it is suitable for any stable which the offending commit was applied to.
(3.2 or later)

Fixes: 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>


# 3991b31e 27-May-2014 NeilBrown <neilb@suse.de>

md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".

If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.

If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.

Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted. If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.

Consequently, this patch is suitable for any -stable kernel.

Cc: stable@vger.kernel.org (any)
Signed-off-by: NeilBrown <neilb@suse.de>


# 0f62fb22 05-May-2014 NeilBrown <neilb@suse.de>

md: avoid possible spinning md thread at shutdown.

If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:

1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
and md_check_recover enters an infinite loop.

Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.

Cc: stable@vger.kernel.org (any kernel)
Signed-off-by: NeilBrown <neilb@suse.de>


# e2f23b60 08-Apr-2014 NeilBrown <neilb@suse.de>

md: avoid oops on unload if some process is in poll or select.

If md-mod is unloaded while some process is in poll() or select(),
then that process maintains a pointer to md_event_waiters, and when
the try to unlink from that list, they will oops.

The procfs infrastructure ensures that ->poll won't be called after
remove_proc_entry, but doesn't provide a wait_queue_head for us to
use, and the waitqueue code doesn't provide a way to remove all
listeners from a waitqueue.

So we need to:
1/ make sure no further references to md_event_waiters are taken (by
setting md_unloading)
2/ wake up all processes currently waiting, and
3/ wait until all those processes have disconnected from our
wait_queue_head.

Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 035328c2 08-Apr-2014 NeilBrown <neilb@suse.de>

md/bitmap: don't abuse i_writecount for bitmap files.

md bitmap code currently tries to use i_writecount to stop any other
process from writing to out bitmap file. But that is really an abuse
and has bit-rotted so locking is all wrong.

So discard that - root should be allowed to shoot self in foot.

Still use it in a much less intrusive way to stop the same file being
used as bitmap on two different array, and apply other checks to
ensure the file is at least vaguely usable for bitmap storage
(is regular, is open for write. Support for ->bmap is already checked
elsewhere).

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>


# cb335f88 15-Jan-2014 Nicolas Schichan <nschichan@freebox.fr>

md: check command validity early in md_ioctl().

Verify that the cmd parameter passed to md_ioctl() is valid before
doing anything.

This fixes mddev->hold_active being set to 0 when an invalid ioctl
command is passed to md_ioctl() before the array has been configured.

Clearing mddev->hold_active in that case can lead to a livelock
situation when an invalid ioctl number is given to md_ioctl() by a
process when the mddev is currently being opened by another process:

Process 1 Process 2
--------- ---------

md_alloc()
mddev_find()
-> returns a new mddev with
hold_active == UNTIL_IOCTL
add_disk()
-> sends KOBJ_ADD uevent

(sees KOBJ_ADD uevent for device)
md_open()
md_ioctl(INVALID_IOCTL)
-> returns ENODEV and clears
mddev->hold_active
md_release()
md_put()
-> deletes the mddev as
hold_active is 0

md_open()
mddev_find()
-> returns a newly
allocated mddev with
mddev->gendisk == NULL
-> returns with ERESTARTSYS
(kernel restarts the open syscall)

Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: NeilBrown <neilb@suse.de>


# 830778a1 13-Jan-2014 NeilBrown <neilb@suse.de>

md: ensure metadata is writen after raid level change.

level_store() currently does not make sure the metadata is
updates to reflect the new raid level. It simply sets MD_CHANGE_DEVS.

Any level with a ->thread will quickly notice this and update the
metadata. However RAID0 and Linear do not have a thread so no
metadata update happens until the array is stopped. At that point the
metadata is written.

This is later that we would like. While the delay doesn't risk any
data it can cause confusion. So if there is no md thread, immediately
update the metadata after a level change.

Reported-by: Richard Michael <rmichael@edgeofthenet.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 7eb41885 13-Jan-2014 NeilBrown <neilb@suse.de>

md: allow a partially recovered device to be hot-added to an array.

When adding a new device into an array it is normally important to
clear any stale data from ->recovery_offset else the new device may
not be recovered properly.

However when re-adding a device which is known to be nearly in-sync,
this is not needed and can be detrimental. The (bitmap-based)
resync will still happen, and further recovery is only needed from
where-ever it was already up to.

So if save_raid_disk is set, signifying a re-add, don't clear
->recovery_offset.

Signed-off-by: NeilBrown <neilb@suse.de>


# f466722c 08-Dec-2013 NeilBrown <neilb@suse.de>

md: Change handling of save_raid_disk and metadata update during recovery.

Since commit d70ed2e4fafdbef0800e739
MD: Allow restarting an interrupted incremental recovery.

we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.

At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".

Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.

After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.

However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.

It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.

With this we can remove the code that disables the write-out of
metadata on some devices.

So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.

- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.

Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 8313b8e5 11-Dec-2013 NeilBrown <neilb@suse.de>

md: fix problem when adding device to read-only array with bitmap.

If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.

If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.

However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update. We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.

This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag. If that is set,
then the device will not be added to a read-only array.

Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fafdbef0800e73942482bb075c21578b
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>


# 142d44c3 27-Nov-2013 NeilBrown <neilb@suse.de>

md: test mddev->flags more safely in md_check_recovery.

commit 7a0a5355cbc71efa md: Don't test all of mddev->flags at once.
made most tests on mddev->flags safer, but missed one.

When
commit 260fa034ef7a4ff8b7306 md: avoid deadlock when dirty buffers during md_stop.
added MD_STILL_CLOSED, this caused md_check_recovery to misbehave.
It can think there is something to do but find nothing. This can
lead to the md thread spinning during array shutdown.

https://bugzilla.kernel.org/show_bug.cgi?id=65721

Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 260fa034ef7a4ff8b7306
Cc: stable@vger.kernel.org (3.12)
Signed-off-by: NeilBrown <neilb@suse.de>


# c170bbb4 24-Nov-2013 Kent Overstreet <kmo@daterainc.com>

block: submit_bio_wait() conversions

It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f024f37 11-Oct-2013 Kent Overstreet <kmo@daterainc.com>

block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6


# 33879d45 23-Nov-2013 Kent Overstreet <kmo@daterainc.com>

block: submit_bio_wait() conversions

It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>


# 82592c38 13-Nov-2013 Joe Perches <joe@perches.com>

md: Convert use of typedef ctl_table to struct ctl_table

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 30b8feb7 13-Nov-2013 NeilBrown <neilb@suse.de>

md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.

When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
badblock until md_update_sb() is called.
But md_stop will take reconfig lock which means raid5d can't call
md_update_sb() in md_check_recovery(), the badblock will always
be unack, so raid5d thread enters an infinite loop and md_stop_write()
can never stop sync_thread. This causes deadlock.

To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
running, we need set md->recovery FROZEN and INTR flags and wait for
sync_thread to stop before we (re)take reconfig lock.

This requires that raid5 reshape_request notices MD_RECOVERY_INTR
(which it probably should have noticed anyway) and stops waiting for a
metadata update in that case.

Reported-by: Jianpeng Ma <majianpeng@gmail.com>
Reported-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# c91abf5a 18-Nov-2013 NeilBrown <neilb@suse.de>

md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.

We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.

So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.

Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.

Signed-off-by: NeilBrown <neilb@suse.de>


# 29f097c4 13-Nov-2013 NeilBrown <neilb@suse.de>

md: fix some places where mddev_lock return value is not checked.

Sometimes we need to lock and mddev and cannot cope with
failure due to interrupt.
In these cases we should use mutex_lock, not mutex_lock_interruptible.

Signed-off-by: NeilBrown <neilb@suse.de>


# 02e5f5c0 13-Nov-2013 NeilBrown <neilb@suse.de>

md: fix calculation of stacking limits on level change.

The various ->run routines of md personalities assume that the 'queue'
has been initialised by the blk_set_stacking_limits() call in
md_alloc().

However when the level is changed (by level_store()) the ->run routine
for the new level is called for an array which has already had the
stacking limits modified. This can result in incorrect final
settings.

So call blk_set_stacking_limits() before ->run in level_store().

A specific consequence of this bug is that it causes
discard_granularity to be set incorrectly when reshaping a RAID4 to a
RAID0.

This is suitable for any -stable kernel since 3.3 in which
blk_set_stacking_limits() was introduced.

Cc: stable@vger.kernel.org (3.3+)
Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 6678d83f 07-Aug-2013 Kent Overstreet <kmo@daterainc.com>

block: Consolidate duplicated bio_trim() implementations

Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 905b0297 11-Oct-2013 Bian Yu <bianyu@kedacom.com>

md: avoid deadlock when md_set_badblocks.

When operate harddisk and hit errors, md_set_badblocks is called after
scsi_restart_operations which already disabled the irq. but md_set_badblocks
will call write_sequnlock_irq and enable irq. so softirq can preempt the
current thread and that may cause a deadlock. I think this situation should
use write_sequnlock_irqsave/irqrestore instead.

I met the situation and the call trace is below:
[ 638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010
[ 638.921923] lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0
[ 638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37
[ 638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013
[ 638.927816] ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007
[ 638.929829] ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8
[ 638.931848] ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8
[ 638.933884] Call Trace:
[ 638.935867] <IRQ> [<ffffffff8172ff85>] dump_stack+0x55/0x76
[ 638.937878] [<ffffffff81730030>] spin_dump+0x8a/0x8f
[ 638.939861] [<ffffffff81730056>] spin_bug+0x21/0x26
[ 638.941836] [<ffffffff81336de4>] do_raw_spin_lock+0xa4/0xc0
[ 638.943801] [<ffffffff8173f036>] _raw_spin_lock+0x66/0x80
[ 638.945747] [<ffffffff814a73ed>] ? scsi_device_unbusy+0x9d/0xd0
[ 638.947672] [<ffffffff8173fb1b>] ? _raw_spin_unlock+0x2b/0x50
[ 638.949595] [<ffffffff814a73ed>] scsi_device_unbusy+0x9d/0xd0
[ 638.951504] [<ffffffff8149ec47>] scsi_finish_command+0x37/0xe0
[ 638.953388] [<ffffffff814a75e8>] scsi_softirq_done+0xa8/0x140
[ 638.955248] [<ffffffff8130e32b>] blk_done_softirq+0x7b/0x90
[ 638.957116] [<ffffffff8104fddd>] __do_softirq+0xfd/0x330
[ 638.958987] [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[ 638.960861] [<ffffffff8174a5cc>] call_softirq+0x1c/0x30
[ 638.962724] [<ffffffff81004c7d>] do_softirq+0x8d/0xc0
[ 638.964565] [<ffffffff8105024e>] irq_exit+0x10e/0x150
[ 638.966390] [<ffffffff8174ad4a>] smp_apic_timer_interrupt+0x4a/0x60
[ 638.968223] [<ffffffff817499af>] apic_timer_interrupt+0x6f/0x80
[ 638.970079] <EOI> [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[ 638.971899] [<ffffffff8173fa6a>] ? _raw_spin_unlock_irq+0x3a/0x50
[ 638.973691] [<ffffffff8173fa60>] ? _raw_spin_unlock_irq+0x30/0x50
[ 638.975475] [<ffffffff81562393>] md_set_badblocks+0x1f3/0x4a0
[ 638.977243] [<ffffffff81566e07>] rdev_set_badblocks+0x27/0x80
[ 638.978988] [<ffffffffa00d97bb>] raid5_end_read_request+0x36b/0x4e0 [raid456]
[ 638.980723] [<ffffffff811b5a1d>] bio_endio+0x1d/0x40
[ 638.982463] [<ffffffff81304ff3>] req_bio_endio.isra.65+0x83/0xa0
[ 638.984214] [<ffffffff81306b9f>] blk_update_request+0x7f/0x350
[ 638.985967] [<ffffffff81306ea1>] blk_update_bidi_request+0x31/0x90
[ 638.987710] [<ffffffff813085e0>] __blk_end_bidi_request+0x20/0x50
[ 638.989439] [<ffffffff8130862f>] __blk_end_request_all+0x1f/0x30
[ 638.991149] [<ffffffff81308746>] blk_peek_request+0x106/0x250
[ 638.992861] [<ffffffff814a62a9>] ? scsi_kill_request.isra.32+0xe9/0x130
[ 638.994561] [<ffffffff814a633a>] scsi_request_fn+0x4a/0x3d0
[ 638.996251] [<ffffffff813040a7>] __blk_run_queue+0x37/0x50
[ 638.997900] [<ffffffff813045af>] blk_run_queue+0x2f/0x50
[ 638.999553] [<ffffffff814a5750>] scsi_run_queue+0xe0/0x1c0
[ 639.001185] [<ffffffff814a7721>] scsi_run_host_queues+0x21/0x40
[ 639.002798] [<ffffffff814a2e87>] scsi_restart_operations+0x177/0x200
[ 639.004391] [<ffffffff814a4fe9>] scsi_error_handler+0xc9/0xe0
[ 639.005996] [<ffffffff814a4f20>] ? scsi_unjam_host+0xd0/0xd0
[ 639.007600] [<ffffffff81072f6b>] kthread+0xdb/0xe0
[ 639.009205] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170
[ 639.010821] [<ffffffff81748cac>] ret_from_fork+0x7c/0xb0
[ 639.012437] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170

This bug was introduce in commit 2e8ac30312973dd20e68073653
(the first time rdev_set_badblock was call from interrupt context),
so this patch is appropriate for 3.5 and subsequent kernels.

Cc: <stable@vger.kernel.org> (3.5+)
Signed-off-by: Bian Yu <bianyu@kedacom.com>
Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 388975cc 11-Sep-2013 Tejun Heo <tj@kernel.org>

sysfs: clean up sysfs_get_dirent()

The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention. For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.

This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns(). This makes confusions a lot less
likely.

There are other interfaces which take @ns before @name. They'll be
updated by following patches.

This patch doesn't introduce any functional changes.

v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
error on module builds. Reported by build test robot. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 260fa034 27-Aug-2013 NeilBrown <neilb@suse.de>

md: avoid deadlock when dirty buffers during md_stop.

When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.

However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.

So do_md_stop() calls sync_blockdev(). However at this point it is
holding ->reconfig_mutex. So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex. So we deadlock.

We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.

So introduce new flag MD_STILL_CLOSED. Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.

It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl. Just don't do that.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7a0a5355 27-Aug-2013 NeilBrown <neilb@suse.de>

md: Don't test all of mddev->flags at once.

mddev->flags is mostly used to record if an update of the
metadata is needed. Sometimes the whole field is tested
instead of just the important bits. This makes it difficult
to introduce more state bits.

So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.

Signed-off-by: NeilBrown <neilb@suse.de>


# c9ad020f 19-Aug-2013 Dave Jones <davej@redhat.com>

md: Fix apparent cut-and-paste error in super_90_validate

Setting a variable to itself probably wasn't the intention here.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 275c51c4 07-Aug-2013 NeilBrown <neilb@suse.de>

md: fix safe_mode buglet.

Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
immediately - otherwise the small value might not be honoured.
However if the previous timeout was 0 meaning "no timeout", we didn't.
This would mean that no timeout happens until the next write completes,
which could be a long time.

Signed-off-by: NeilBrown <neilb@suse.de>


# 60559da4 16-Jul-2013 NeilBrown <neilb@suse.de>

md: don't call md_allow_write in get_bitmap_file.

There is no really need as GFP_NOIO is very likely sufficient,
and failure is not catastrophic.

Calling md_allow_write here will convert a read-auto array to
read/write which could be confusing when you are just performing
a read operation.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5024c298 16-Jul-2013 NeilBrown <neilb@suse.de>

md: Remove recent change which allows devices to skip recovery.

commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86
md: Allow devices to be re-added to a read-only array.

allowed a bit more than just that. It also allows devices to be added
to a read-write array and to end up skipping recovery.

This patch removes the offending piece of code pending a rewrite for a
subsequent release.

More specifically:
If the array has a bitmap, then the device will still need a bitmap
based resync ('saved_raid_disk' is set under different conditions
is a bitmap is present).
If the array doesn't have a bitmap, then this is correct as long as
nothing has been written to the array since the metadata was checked
by ->validate_super. However there is no locking to ensure that there
was no write.

Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# c4a39551 25-Jun-2013 Jonathan Brassow <jbrassow@redhat.com>

MD: Remember the last sync operation that was performed

MD: Remember the last sync operation that was performed

This patch adds a field to the mddev structure to track the last
sync operation that was performed. This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs. If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check. If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired. etc.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# b29bebd6 01-Jun-2013 Jingoo Han <jg1.han@samsung.com>

md: replace strict_strto*() with kstrto*()

The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 90f5f7ad 02-Apr-2013 Hannes Reinecke <hare@suse.de>

md: Wait for md_check_recovery before attempting device removal.

When a device has failed, it needs to be removed from the personality
module before it can be removed from the array as a whole.
The first step is performed by md_check_recovery() which is called
from the raid management thread.

So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
to have run. This increases the chance that the ioctl will succeed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Neil Brown <nfbrown@suse.de>


# 6b6204ee 08-May-2013 NeilBrown <neilb@suse.de>

md: md_stop_writes() should always freeze recovery.

__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.

However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward. This can particularly cause problems or dm-raid.

So change __md_stop_writes() to always freeze recovery. This is safe
and more predicatable.

Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# db2a144b 05-May-2013 Al Viro <viro@zeniv.linux.org.uk>

block_device_operations->release() should return void

The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 486adf72 23-Apr-2013 NeilBrown <neilb@suse.de>

md: bad block list should default to disabled.

Maintenance of a bad-block-list currently defaults to 'enabled'
and is then disabled when it cannot be supported.
This is backwards and causes problem for dm-raid which didn't know
to disable it.

So fix the defaults, and only enabled for v1.x metadata which
explicitly has bad blocks enabled.

The problem with dm-raid has been present since badblock support was
added in v3.1, so this patch is suitable for any -stable from 3.1
onwards.

Cc: stable@vger.kernel.org (3.1+)
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# a91d5ac0 23-Apr-2013 Jonathan Brassow <jbrassow@redhat.com>

MD: Export 'md_reap_sync_thread' function

MD: Export 'md_reap_sync_thread' function

Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c.
- rename reap_sync_thread to md_reap_sync_thread
- move the fn after md_check_recovery to match md.h declaration placement
- export md_reap_sync_thread

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# b6d428c6 23-Apr-2013 NeilBrown <neilb@suse.de>

md: don't update metadata when stopping a read-only array.

read-only arrays should stay that way as much as possible.
Updating the metadata - which could be triggered by a re-add
while assembling the array metadata - should be avoided.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7ceb17e8 23-Apr-2013 NeilBrown <neilb@suse.de>

md: Allow devices to be re-added to a read-only array.

When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.

We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.

Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto. This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.

So:
- remove the "md_update_sb()" call from add_new_disk(). This isn't
really needed as just adding a disk doesn't require a metadata
update. Instead, just set MD_CHANGE_DEVS. This will effect a
metadata update soon enough, once the array is not read-only.

- Allow the ADD_NEW_DISK ioctl to succeed without activating a
read-auto array, providing the MD_DISK_SYNC flag is set.
In this case, the device will be rejected if it cannot be added
with the correct device number, or has an incorrect event count.

- Teach remove_and_add_spares() to be careful about adding spares
when the array is read-only (or read-mostly) - only add devices
that are thought to be in-sync, and only do it if the array is
in-sync itself.

- In md_check_recovery, use remove_and_add_spares in the read-only
case, rather than open coding just the 'remove' part of it.

Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>


# 3ea8929d 23-Apr-2013 NeilBrown <neilb@suse.de>

md: HOT_DISK_REMOVE shouldn't make a read-auto device active.

If a fail device or a spare is removed from an array, there is
not need to make the array 'active'. If/when the array does become
active for some other reason the metadata will be update to reflect
the removal.
If that never happens and the array is stopped while still read-auto,
then there is no loss in forgetting the that the device had 'failed'.

A read-only array will leave failed devices attached to
the array personality, so we need to explicitly call
remove_and_add_spares() to free it (clearing Blocked just
like we do in store_slot()).

Signed-off-by: NeilBrown <neilb@suse.de>


# 746d3207 23-Apr-2013 NeilBrown <neilb@suse.de>

md: use common code for all calls to ->hot_remove_disk()

slot_store and remove_and_add_spares both call ->hot_remove_disk(),
but with slightly different tests and consequences, which is
at least untidy and might be buggy.

So modify remove_and_add_spaces() so that it can be asked
to remove a specific device, and call it from slot_store().

We also clear the Blocked flag to ensure that doesn't prevent
removal. The purpose of Blocked is to prevent automatic removal
by the kernel before an error is acknowledged.
If the array is read/write then user-space would have not reason
to remove a device unless it was known to be 'spare' or 'faulty' in
which it would have already cleared the Blocked flag.
If the array is read-only, the flag might still be blocked, but
there is no harm in clearing the flag for read-only arrays.

Signed-off-by: NeilBrown <neilb@suse.de>


# d87f064f 23-Apr-2013 NeilBrown <neilb@suse.de>

md: never update metadata when array is read-only.

Normally we don't even try to update the metadata if
the array is read-only. However future patches
will increase the number of things that can happen on a read-only
array, so it is safest to explicitly disable this.

Every time that mddev->ro is set to 0, either
- md_update_sb will be called again (at least if MD_CHANGE_DEVS
is set) or
- the mddev->thread is scheduled, which will also run
md_update_sb if needed.

So this is safe: if the array ever become read-write the
metadata will be updated.

Signed-off-by: NeilBrown <neilb@suse.de>


# fb9e3534 26-Sep-2012 Kent Overstreet <koverstreet@google.com>

md: Convert md_trim_bio() to use bio_advance()

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Acked-by: NeilBrown <neilb@suse.de>


# 90584fc9 07-Mar-2013 Jonathan Brassow <jbrassow@redhat.com>

MD: Prevent sysfs operations on uninitialized kobjects

MD: Prevent sysfs operations on uninitialized kobjects

Device-mapper does not use sysfs; but when device-mapper is leveraging
MD's RAID personalities, MD sometimes attempts to update sysfs. This
patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure
it is about to operate on something valid. This patch also checks for
'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'.
Although 'sysfs_notify' already makes this check, doing so in
'remove_and_add_spares' prevents an additional mutex operation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# f3378b48 27-Feb-2013 NeilBrown <neilb@suse.de>

md: expedite metadata update when switching read-auto -> active

If something has failed while the array was read-auto,
then when we switch to 'active' we need to update the metadata.
This will happen anyway but it is good to expedite it, and
also to ensure any failed device has been released by the
underlying device before we try to action the ioctl which
caused us to switch to 'active' mode.

Reported-by: Joe Lawrence <Joe.Lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# a6468539 20-Feb-2013 NeilBrown <neilb@suse.de>

md: fix two bugs when attempting to resize RAID0 array.

You cannot resize a RAID0 array (in terms of making the devices
bigger), but the code doesn't entirely stop you.
So:

disable setting of the available size on each device for
RAID0 and Linear devices. This must not change as doing so
can change the effective layout of data.

Make sure that the size that raid0_size() reports is accurate,
but rounding devices sizes to chunk sizes. As the device sizes
cannot change now, this isn't so important, but it is best to be
safe.

Without this change:
mdadm --grow /dev/md0 -z max
mdadm --grow /dev/md0 -Z max
then read to the end of the array

can cause a BUG in a RAID0 array.

These bugs have been present ever since it became possible
to resize any device, which is a long time. So the fix is
suitable for any -stable kerenl.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# bbfa57c0 20-Feb-2013 Sebastian Riemer <sebastian.riemer@profitbricks.com>

md: protect against crash upon fsync on ro array

If an fsync occurs on a read-only array, we need to send a
completion for the IO and may not increment the active IO count.
Otherwise, we hit a bug trace and can't stop the MD array anymore.

By advice of Christoph Hellwig we return success upon a flush
request but we return -EROFS for other writes.
We detect flush requests by checking if the bio has zero sectors.

This patch is suitable to any -stable kernel to which it applies.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Paul Menzel <paulepanter@users.sourceforge.net>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0a19caab 19-Nov-2012 majianpeng <majianpeng@gmail.com>

md: Use ->curr_resync as last completed request when cleanly aborting resync.

If a resync is aborted cleanly, ->curr_resync is a reliable
record of where we got up to.
If there was an error it is less reliable but we always know that
->curr_resync_completed is safe.

So add a flag MD_RECOVERY_ERROR to differentiate between these cases
and set recovery_cp accordingly.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 54f89341 30-Oct-2012 majianpeng <majianpeng@gmail.com>

md: Update checkpoint of resync/recovery based on time.

md will current only only checkpoint recovery or resync ever 1/16th
of the device size. As devices get larger this can become a long time
an so a lot of work that might need to be duplicated after a shutdown.

So add a time-based checkpoint. Every 5 minutes limits the amount of
duplicated effort to at most 5 minutes, and has almost zero impact on
performance.

[changelog entry re-written by NeilBrown]

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 35d78c66 30-Oct-2012 kernelmail <kedacomkernel@gmail.com>

md:Add place to update ->recovery_cp.

In resyncing, recovery_cp only updated when resync aborted or completed.
But in md drives,many place used it to judge.So add a place to update.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# c02c0aeb 10-Dec-2012 NeilBrown <neilb@suse.de>

md.c: re-indent various 'switch' statements.

Intent was unnecessarily deep.

Also change one 'switch' which has a single case element, into an
'if'.

Signed-off-by: NeilBrown <neilb@suse.de>


# a7a3f08d 10-Dec-2012 NeilBrown <neilb@suse.de>

md: close race between removing and adding a device.

When we remove a device from an md array, the final removal of
the "dev-XX" sys entry is run asynchronously.
If we then re-add that device immediately before the worker thread
gets to run, we can end up trying to add the "dev-XX" sysfs entry back
before it has been removed.

So in both places where we add a device, call
flush_workqueue(md_misc_wq);
before taking the md lock (as holding the md lock can prevent removal
to complete).

Signed-off-by: NeilBrown <neilb@suse.de>


# 1f3c9907 10-Dec-2012 NeilBrown <neilb@suse.de>

md: removed unused variable in calc_sb_1_csm.

'i' is unused.

NeilBrown <neilb@suse.de>


# eed8c02e 30-Nov-2012 Lukas Czerner <lczerner@redhat.com>

wait: add wait_event_lock_irq() interface

New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.

The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.

All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.

Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5eff3c43 18-Nov-2012 NeilBrown <neilb@suse.de>

md: make sure everything is freed when dm-raid stops an array.

md_stop() would stop an array, but not free various attached
data structures.
For internal arrays, these are freed later in do_md_stop() or
mddev_put(), but they don't apply for dm-raid arrays.
So get md_stop() to free them, and only all it from dm-raid.
For internal arrays we now call __md_stop.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 35f9ac2d 07-Nov-2012 majianpeng <majianpeng@gmail.com>

md: Avoid write invalid address if read_seqretry returned true.

If read_seqretry returned true and bbp was changed, it will write
invalid address which can cause some serious problem.

This bug was introduced by commit v3.0-rc7-130-g2699b67.
So fix is suitable for 3.0.y thru 3.6.y.

Reported-by: zhuwenfeng@kedacom.com
Tested-by: zhuwenfeng@kedacom.com
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# ab05613a 06-Nov-2012 majianpeng <majianpeng@gmail.com>

md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.

This bug was introduced by commit(v3.0-rc7-126-g2230dfe).
So fix is suitable for 3.0.y thru 3.6.y.

Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 83f0d77a 29-Oct-2012 Masanari Iida <standby24x7@gmail.com>

md: Fix typo in drivers/md

Correct spelling typo in drivers/md.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 72f36d59 10-Oct-2012 NeilBrown <neilb@suse.de>

md: refine reporting of resync/reshape delays.

If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.

So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.

Signed-off-by: NeilBrown <neilb@suse.de>


# db07d85e 10-Oct-2012 NeilBrown <neilb@suse.de>

md: make sure manual changes to recovery checkpoint are saved.

If you make an array bigger but suppress resync of the new region with
mdadm --grow /dev/mdX --size=max --assume-clean

then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.

Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 48c26ddc 10-Oct-2012 NeilBrown <neilb@suse.de>

md: writing to sync_action should clear the read-auto state.

In some cases array are started in 'read-auto' state where in
nothing gets written to any device until the array is written
to. The purpose of this is to make accidental auto-assembly
of the wrong arrays less of a risk, and to allow arrays to be
started to read suspend-to-disk images without actually changing
anything (as might happen if the array were dirty and a
resync seemed necessary).

Explicitly writing the 'sync_action' for a read-auto array currently
doesn't clear the read-auto state, so the sync action doesn't
happen, which can be confusing.

So allow any successful write to sync_action to clear any read-auto
state.

Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de>
Signed-off-by: NeilBrown <neilb@suse.de>


# 7f7583d4 10-Oct-2012 Jianpeng Ma <majianpeng@gmail.com>

Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races

Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 2863b9eb 10-Oct-2012 Jonathan Brassow <jbrassow@redhat.com>

MD RAID10: Prep for DM RAID10 device replacement capability

MD RAID10: Fix a couple potential kernel panics if RAID10 is used by dm-raid

When device-mapper uses the RAID10 personality through dm-raid.c, there is no
'gendisk' structure in mddev and some sysfs information is also not populated.

This patch avoids touching those non-existent structures.

Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 1ca69c4b 10-Oct-2012 NeilBrown <neilb@suse.de>

md: avoid taking the mutex on some ioctls.

Some ioctls don't need to take the mutex and doing so can cause
a delay as it is held during super-block update.
So move those ioctls out of the mutex and rely on rcu locking
to ensure we don't access stale data.

Signed-off-by: NeilBrown <neilb@suse.de>


# 4ed8731d 10-Oct-2012 Shaohua Li <shli@kernel.org>

MD: change the parameter of md thread

Change the thread parameter, so the thread can carry extra info. Next patch
will use it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 6dafab6b 18-Sep-2012 NeilBrown <neilb@suse.de>

md: make sure metadata is updated when spares are activated or removed.

It isn't always necessary to update the metadata when spares are
removed as the presence-or-not of a spare isn't really important to
the integrity of an array.
Also activating a spare doesn't always require updating the metadata
as the update on 'recovery-completed' is usually sufficient.

However the introduction of 'replacement' devices have made these
transitions sometimes more important. For example the 'Replacement'
flag isn't cleared until the original device is removed, so we need
to ensure a metadata update after that 'spare' is removed.

So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
complement the current situation where it is set when a spare is added
or a device is failed (or a number of other less common situations).

This is suitable for -stable as out-of-data metadata could lead
to data corruption.
This is only relevant for 3.3 and later 9when 'replacement' as
introduced.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# bf800ef1 06-Sep-2012 Kent Overstreet <koverstreet@google.com>

block: Add bio_clone_bioset(), bio_clone_kmalloc()

Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().

This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.

This will also help in a later patch changing how bio cloning works.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1e2a410f 06-Sep-2012 Kent Overstreet <koverstreet@google.com>

block: Ues bi_pool for bio_integrity_alloc()

Now that bios keep track of where they were allocated from,
bio_integrity_alloc_bioset() becomes redundant.

Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
related functions and make them use bio->bi_pool.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 395c72a7 06-Sep-2012 Kent Overstreet <koverstreet@google.com>

block: Generalized bio pool freeing

With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.

This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.

v6: Explain the temporary if statement in bio_put

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Nicholas Bellinger <nab@linux-iscsi.org>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 667a5313 16-Aug-2012 NeilBrown <neilb@suse.de>

md: Don't truncate size at 4TB for RAID0 and Linear

commit 27a7b260f71439c40546b43588448faac01adb93
md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.

changed 0.90 metadata handling to truncated size to 4TB as that is
all that 0.90 can record.
However for RAID0 and Linear, 0.90 doesn't need to record the size, so
this truncation is not needed and causes working arrays to become too small.

So avoid the truncation for RAID0 and Linear

This bug was introduced in 3.1 and is suitable for any stable kernels
from then onwards.
As the offending commit was tagged for 'stable', any stable kernel
that it was applied to should also get this patch. That includes
at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for
providing that list).

Cc: stable@vger.kernel.org
Signed-off-by: Neil Brown <neilb@suse.de>


# 74018dc3 31-Jul-2012 NeilBrown <neilb@suse.de>

blk: pass from_schedule to non-request unplug functions.

This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9cbb1750 31-Jul-2012 NeilBrown <neilb@suse.de>

blk: centralize non-request unplug handling.

Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0021b7bc 31-Jul-2012 NeilBrown <neilb@suse.de>

md: remove plug_cnt feature of plugging.

This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.

So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.

This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.

Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 90cf195d 30-Jul-2012 NeilBrown <neilb@suse.de>

md: remove duplicated test on ->openers when calling do_md_stop()

do_md_stop tests mddev->openers while holding ->open_mutex,
and fails if this count is too high.
So callers do not need to check mddev->openers and doing so isn't
very meaningful as they don't hold ->open_mutex so the number could
change.

So remove the unnecessary tests on mddev->openers.
These are not called often enough for there to be any gain in
an early test on ->open_mutex to avoid the need for a slightly more
costly mutex_lock call.

Signed-off-by: NeilBrown <neilb@suse.de>


# a05b7ea0 18-Jul-2012 NeilBrown <neilb@suse.de>

md: avoid crash when stopping md array races with closing other open fds.

md will refuse to stop an array if any other fd (or mounted fs) is
using it.
When any fs is unmounted of when the last open fd is closed all
pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
so there will be no pending IO to worry about when the array is
stopped.

However in order to send the STOP_ARRAY ioctl to stop the array one
must first get and open fd on the block device.
If some fd is being used to write to the block device and it is closed
after mdadm open the block device, but before mdadm issues the
STOP_ARRAY ioctl, then there will be no last-close on the md device so
__blkdev_put will not call sync_blockdev.

If this happens, then IO can still be in-flight while md tears down
the array and bad things can happen (use-after-free and subsequent
havoc).

So in the case where do_md_stop is being called from an open file
descriptor, call sync_block after taking the mutex to ensure there
will be no new openers.

This is needed when setting a read-write device to read-only too.

Cc: stable@vger.kernel.org
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 25f7fd47 18-Jul-2012 NeilBrown <neilb@suse.de>

md: fix bug in handling of new_data_offset

commit c6563a8c38fde3c1c7fc925a10bde3ca20799301
md: add possibility to change data-offset for devices.

introduced a 'new_data_offset' attribute which should normally
be the same as 'data_offset', but can be explicitly set to a different
value to allow a reshape operation to move the data.

Unfortunately when the 'data_offset' is explicitly set through
sysfs, the new_data_offset is not also set, so the two would become
out-of-sync incorrectly.

One result of this is that trying to set the 'size' after the
'data_offset' would fail because it is not permitted to set the size
when the 'data_offset' and 'new_data_offset' are different - as that
can be confusing.
Consequently when mdadm tried to do this while assembling an IMSM
array it would fail.

This bug was introduced in 3.5-rc1.

Reported-by: Brian Downing <bdowning@lavos.net>
Bisected-by: Brian Downing <bdowning@lavos.net>
Tested-by: Brian Downing <bdowning@lavos.net>
Signed-off-by: NeilBrown <neilb@suse.de>


# f4563091 02-Jul-2012 NeilBrown <neilb@suse.de>

md: support re-add of recovering devices.

We currently only allow a device to be re-added if it appear to be
in-sync. This is overly restrictive as it may be desirable to re-add
a device that is in the middle of recovery.

So remove the test for "InSync" - the test on rdev->raid_disk is
sufficient to ensure that the re-add will succeed.

Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0232605d 02-Jul-2012 NeilBrown <neilb@suse.de>

md: make 'name' arg to md_register_thread non-optional.

Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.

So make it non-optional and always explicitly pass the name
of the level that the array will be.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 7c2c57c9 02-Jul-2012 majianpeng <majianpeng@gmail.com>

md:Add blk_plug in sync_thread.

Add blk_plug in sync_thread will increase the performance of sync.
Because sync_thread did not blk_plug,so when raid sync, the bio merge
not well.

Testing environment:
SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller.
OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
x86_64 x86_64 x86_64 GNU/Linux.
RAID5: four ST31000524NS disk.

Without blk_plug:recovery speed about 63M/Sec;
Add blk_plug:recovery speed about 120M/Sec.

Using blktrace:
blktrace -d /dev/sdb -w 60 -o -|blkparse -i -

without blk_plug:
Total (8,16):
Reads Queued: 309811, 1239MiB Writes Queued: 0, 0KiB
Read Dispatches: 283583, 1189MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 273351, 1149MiB Writes Completed: 0, 0KiB
Read Merges: 23533, 94132KiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 0

add blk_plug:
Total (8,16):
Reads Queued: 428697, 1714MiB Writes Queued: 0, 0KiB
Read Dispatches: 3954, 1714MiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 3956, 1715MiB Writes Completed: 0, 0KiB
Read Merges: 424743, 1698MiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 3384

The ratio of merge will be markedly increased.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0c098220 21-May-2012 Yuanhan Liu <yuanhan.liu@linux.intel.com>

md: check the return of mddev_find()

Check the return of mddev_find(), since it may fail due to out of
memeory or out of usable minor number.

The reason I chose -ENODEV instead of -ENOMEM or something else is
md_alloc() function chose that ;)

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 47525e59 21-May-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Set recovery flags on resume

Properly initialize MD recovery flags when resuming device-mapper devices.

When a device-mapper device is suspended, all I/O must stop. This is done by
calling 'md_stop_writes' and 'mddev_suspend'. These calls in-turn manipulate
the recovery flags - including setting 'MD_RECOVERY_FROZEN'. The DM device
may have been suspended while recovery was not yet complete, so the process
needs to pick-up where it left off. Since 'mddev_resume' does not unset
'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
(e.g. It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
setting 'MD_RECOVERY_FROZEN'. Clearing it in 'mddev_resume' would override the
desired behavior.)

Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.

Also clean up where level_store calls mddev_resume() - it current
duplicates some of the funcitons of that call. - NB

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# a4a6125a 21-May-2012 NeilBrown <neilb@suse.de>

md: allow array to be resized while bitmap is present.

Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.

This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1ec885cd 21-May-2012 NeilBrown <neilb@suse.de>

md/bitmap: move some fields of 'struct bitmap' into a 'storage' substruct.

This new 'struct bitmap_storage' reflects the external storage of the
bitmap.
Having this clearly defined will make it easier to change the storage
used while the array is active.

Signed-off-by: NeilBrown <neilb@suse.de>


# ef99bf48 21-May-2012 NeilBrown <neilb@suse.de>

md/bitmap: allow a bitmap with no backing storage.

An md bitmap comprises two parts
- internal counting of active writes per 'chunk'.
- external storage of whether there are any active writes on
each chunk

The second requires the first, but the first doesn't require the
second.

Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.

So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.

A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.

We don't actually activate these bitmaps yet - that will come
in a later patch.

Signed-off-by: NeilBrown <neilb@suse.de>


# 6409bb05 21-May-2012 NeilBrown <neilb@suse.de>

md/bitmap: add new 'space' attribute for bitmaps.

If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.

So create an attribute to store this information and set appropriate
defaults.

It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.

Signed-off-by: NeilBrown <neilb@suse.de>


# 4fa2f327 21-May-2012 NeilBrown <neilb@suse.de>

md: move freeing of badblocks.page into md_rdev_clear

This ensures that it is always freed - there were case where
we failed to free the page.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 545c8795 21-May-2012 NeilBrown <neilb@suse.de>

md: dm-raid should call helper function to clear rdev.

dm-raid currently open-codes the freeing of some members of
and rdev. It is more maintainable to have it call common code
from md.c which does this for all call-sites.

So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c

Signed-off-by: NeilBrown <neilb@suse.de>


# c804cdec 20-May-2012 NeilBrown <neilb@suse.de>

md: use resync_max_sectors for reshape as well as resync.

Some resync type operations need to act on the address space of the
device, others on the address space of the array.

This only affects RAID10, so it sets resync_max_sectors to the array
size (it defaults to the device size), and that is currently used for
resync only. However reshape of a RAID10 must be done against the
array size, not device size, so change code to use resync_max_sectors
for both the resync and the reshape cases.
This does not affect RAID5 or RAID1, just RAID10.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1fdd6fc9 20-May-2012 NeilBrown <neilb@suse.de>

md: teach sync_page_io about new_data_offset.

Some code in raid1 and raid10 use sync_page_io to
read/write pages when responding to read errors.
As we will shortly support changing data_offset for
raid10, this function must understand new_data_offset.

So add that understanding.

Signed-off-by: NeilBrown <neilb@suse.de>


# c6563a8c 20-May-2012 NeilBrown <neilb@suse.de>

md: add possibility to change data-offset for devices.

When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
we need a new FEATURE flag for this.
A (belatedly) check that all remaining 'pad' fields are
zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be. At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors. So provide an exported function to do that.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2c810cdd 20-May-2012 NeilBrown <neilb@suse.de>

md: allow a reshape operation to be reversed.

Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.

To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.

However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.

So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.

This will become more important when we allow the data_offset to
change in a reshape. Then the explicit statement of what direction is
being used will be more useful.

This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.

Signed-off-by: NeilBrown <neilb@suse.de>


# b5e1b8ce 20-May-2012 Shaohua Li <shli@kernel.org>

md: using GFP_NOIO to allocate bio for flush request

A flush request is usually issued in transaction commit code path, so
using GFP_KERNEL to allocate memory for flush request bio falls into
the classic deadlock issue.

This is suitable for any -stable kernel to which it applies as it
avoids a possible deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0d9f4f13 16-May-2012 Jonathan Brassow <jbrassow@redhat.com>

MD: Add del_timer_sync to mddev_suspend (fix nasty panic)

Use del_timer_sync to remove timer before mddev_suspend finishes.

We don't want a timer going off after an mddev_suspend is called. This is
especially true with device-mapper, since it can call the destructor function
immediately following a suspend. This results in the removal (kfree) of the
structures upon which the timer depends - resulting in a very ugly panic.
Therefore, we add a del_timer_sync to mddev_suspend to prevent this.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 30b8aa91 23-Apr-2012 NeilBrown <neilb@suse.de>

md: fix possible corruption of array metadata on shutdown.

commit c744a65c1e2d59acc54333ce8
md: don't set md arrays to readonly on shutdown.

removed the possibility of a 'BUG' when data is written to an array
that has just been switched to read-only, but also introduced the
possibility that the array metadata could be corrupted.

If, when md_notify_reboot gets the mddev lock, the array is
in a state where it is assembled but hasn't been started (as can
happen if the personality module is not available, or in other unusual
situations), then incorrect metadata will be written out making it
impossible to re-assemble the array.

So only call __md_stop_writes() if the array has actually been
activated.

This patch is needed for any stable kernel which has had the above
commit applied.

Cc: stable@vger.kernel.org
Reported-by: Christoph Nelles <evilazrael@evilazrael.de>
Signed-off-by: NeilBrown <neilb@suse.de>


# ed209584 23-Apr-2012 NeilBrown <neilb@suse.de>

md: don't call ->add_disk unless there is good reason.

Commit 7bfec5f35c68121e7b18

md/raid5: If there is a spare and a want_replacement device, start replacement.

cause md_check_recovery to call ->add_disk much more often.
Instead of only when the array is degraded, it is now called whenever
md_check_recovery finds anything useful to do, which includes
updating the metadata for clean<->dirty transition.
This causes unnecessary work, and causes info messages from ->add_disk
to be reported much too often.

So refine md_check_recovery to only do any actual recovery checking
(including ->add_disk) if MD_RECOVERY_NEEDED is set.

This fix is suitable for 3.3.y:

Cc: stable@vger.kernel.org
Reported-by: Jan Ceuleers <jan.ceuleers@computer.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# ecb178bb 18-Mar-2012 majianpeng <majianpeng@gmail.com>

md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().

If there are no unacked bad blocks, then there is no point searching
for them to acknowledge them.


Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# d0962936 18-Mar-2012 NeilBrown <neilb@suse.de>

md: fix clearing of the 'changed' flags for the bad blocks list.

In super_1_sync (the first hunk) we need to clear 'changed' before
checking read_seqretry(), otherwise we might race with other code
adding a bad block and so won't retry later.

In md_update_sb (the second hunk), in the case where there is no
metadata (neither persistent nor external), we treat any bad blocks as
an error. However we need to clear the 'changed' flag before calling
md_ack_all_badblocks, else it won't do anything.

This patch is suitable for -stable release 3.0 and later.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 57148964 18-Mar-2012 NeilBrown <neilb@suse.de>

md/bitmap: move printing of bitmap status to bitmap.c

The part of /proc/mdstat which describes the bitmap should really
be generated by code in bitmap.c. So move it there.

Signed-off-by: NeilBrown <neilb@suse.de>


# 050b6615 18-Mar-2012 NeilBrown <neilb@suse.de>

md/raid10: handle merge_bvec_fn in member devices.

Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So enhance the raid10 merge_bvec_fn to check that function in children
as well.

This introduces a small problem. There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request. So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched(). This will work
providing no preemption happens. If there is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>


# dafb20fa 18-Mar-2012 NeilBrown <neilb@suse.de>

md: tidy up rdev_for_each usage.

md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev. However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
- rename rdev_for_each to rdev_for_each_safe
- create a new rdev_for_each which uses the plain
list_for_each_entry,
- use the 'safe' version only where needed, and convert all other
list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>


# c744a65c 18-Mar-2012 NeilBrown <neilb@suse.de>

md: don't set md arrays to readonly on shutdown.

It seems that with recent kernel, writeback can still be happening
while shutdown is happening, and consequently data can be written
after the md reboot notifier switches all arrays to read-only.
This causes a BUG.

So don't switch them to read-only - just mark them clean and
set 'safemode' to '2' which mean that immediately after any
write the array will be switch back to 'clean'.

This could result in the shutdown happening when array is marked
dirty, thus forcing a resync on reboot. However if you reboot
without performing a "sync" first, you get to keep both halves.

This is suitable for any stable kernel (though there might be some
conflicts with obvious fixes in earlier kernels).

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# db91ff55 06-Feb-2012 NeilBrown <neilb@suse.de>

md: two small fixes to handling interrupt resync.

1/ If a resync is aborted we should record how far we got
(recovery_cp) the last request that we know has completed
(->curr_resync_completed) rather than the last request that was
submitted (->curr_resync).

2/ When a resync aborts we still want to update the metadata with
any changes, so set MD_CHANGE_DEVS even if we 'skip'.

Signed-off-by: NeilBrown <neilb@suse.de>


# b1bd055d 11-Jan-2012 Martin K. Petersen <martin.petersen@oracle.com>

block: Introduce blk_set_stacking_limits function

Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.

This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.

Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f2a371c5 08-Jan-2012 NeilBrown <neilb@suse.de>

md: notify the 'degraded' sysfs attribute on failure.

We currently only 'notify' changes to the 'degraded' attribute
when it decreases, not when it increases.

Notifying on failure is a little awkward as it happen in
interrupt context.
So instead, notify when we remove the failed device from the array,
which is very soon afterwards.

Reported-and-tested-by: Mikhail Balabin <mbalabin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# ff01bb48 16-Sep-2011 Al Viro <viro@zeniv.linux.org.uk>

fs: move code out of buffer.c

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7bfec5f3 22-Dec-2011 NeilBrown <neilb@suse.de>

md/raid5: If there is a spare and a want_replacement device, start replacement.

When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.

This requires that common md code attempt hot_add even when the array
is not formally degraded.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 2d78f8c4 22-Dec-2011 NeilBrown <neilb@suse.de>

md: create externally visible flags for supporting hot-replace.

hot-replace is a feature being added to md which will allow a
device to be replaced without removing it from the array first.

With hot-replace a spare can be activated and recovery can start while
the original device is still in place, thus allowing a transition from
an unreliable device to a reliable device without leaving the array
degraded during the transition. It can also be use when the original
device is still reliable but it not wanted for some reason.

This will eventually be supported in RAID4/5/6 and RAID10.

This patch adds a super-block flag to distinguish the replacement
device. If an old kernel sees this flag it will reject the device.

It also adds two per-device flags which are viewable and settable via
sysfs.
"want_replacement" can be set to request that a device be replaced.
"replacement" is set to show that this device is replacing another
device.

The "rd%d" links in /sys/block/mdXx/md only apply to the original
device, not the replacement. We currently don't make links for the
replacement - there doesn't seem to be a need.

Signed-off-by: NeilBrown <neilb@suse.de>


# b8321b68 22-Dec-2011 NeilBrown <neilb@suse.de>

md: change hot_remove_disk to take an rdev rather than a number.

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 476a7abb 22-Dec-2011 NeilBrown <neilb@suse.de>

md: remove test for duplicate device when setting slot number.

When setting the slot number on a device in an active array we
currently check that the number is not already in use.
We then call into the personality's hot_add_disk function
which performs the same test and returns the same error.

Thus the common test is not needed.

As we will shortly be changing some personalities to allow duplicates
in some cases (to support hot-replace), the common test will become
inconvenient.

So remove the common test.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 506c9e44 22-Dec-2011 NeilBrown <neilb@suse.de>

md: allow non-privileged uses to GET_*_INFO about raid arrays.

The info is already available in /proc/mdstat and /sys/block in
an accessible form so there is no point in putting a road-block in
the ioctl for information gathering.

Signed-off-by: NeilBrown <neilb@suse.de>


# 60fc1370 22-Dec-2011 NeilBrown <neilb@suse.de>

md: don't give up looking for spares on first failure-to-add

Before performing a recovery we try to remove any spares that
might not be working, then add any that might have become relevant.

Currently we abort on the first spare that cannot be added.
This is a false optimisation.
It is conceivable that - depending on rules in the personality - a
subsequent spare might be accepted.
Also the loop does other things like count the available spares and
reset the 'recovery_offset' value.

If we abort early these might not happen properly.

So remove the early abort.

In particular if you have an array what is undergoing recovery and
which has extra spares, then the recovery may not restart after as
reboot as the could of 'spares' might end up as zero.

Reported-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: NeilBrown <neilb@suse.de>


# 8bd2f0a0 07-Dec-2011 NeilBrown <neilb@suse.de>

md: ensure new badblocks are handled promptly.

When we mark blocks as bad we need them to be acknowledged by the
metadata handler promptly.

For an in-kernel metadata handler that was already being done. But
for an external metadata handler we need to alert it of the change by
sending a notification through the sysfs file. This adds that
notification.

Signed-off-by: NeilBrown <neilb@suse.de>


# 52c64152 07-Dec-2011 NeilBrown <neilb@suse.de>

md: bad blocks shouldn't cause a Blocked status on a Faulty device.

Once a device is marked Faulty the badblocks - whether acknowledged or
not - become irrelevant. So they shouldn't cause the device to be
marked as Blocked.

Without this patch, a process might write "-blocked" to clear the
Blocked status, but while that will correctly fail the device, it
won't remove the apparent 'blocked' status.

Signed-off-by: NeilBrown <neilb@suse.de>


# af8a2434 07-Dec-2011 NeilBrown <neilb@suse.de>

md: take a reference to mddev during sysfs access.


When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.

The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.

To to guard against this we:
- initialise mddev->all_mddevs in on last put so the state can be
easily detected.
- in md_attr_show and md_attr_store, check ->all_mddevs under
all_mddevs_lock and mddev_get the mddev if it still appears to
be active.

This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.

rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection. They check that rdev->mddev still points to
mddev after taking mddev_lock. As this is cleared before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1d23f178 07-Dec-2011 NeilBrown <neilb@suse.de>

md: refine interpretation of "hold_active == UNTIL_IOCTL".

We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not. We can only tell from recent history of changes.

In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.

So we always preserve a newly created md device until something
significant happens. This state is stored in 'hold_active'.

The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it. If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.

We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.

So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs. Other changes via sysfs
are ignored.

Signed-off-by: NeilBrown <neilb@suse.de>


# 056075c7 03-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

md: Add module.h to all files using it implicitly

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 751e67ca 18-Oct-2011 Chris Dunlop <chris@onthe.net.au>

md.c: trivial comment fix

Trivial comment fix

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: NeilBrown <neilb@suse.de>


# d70ed2e4 17-Oct-2011 Andrei Warkentin <andreiw@vmware.com>

MD: Allow restarting an interrupted incremental recovery.

If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).

Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# d30519fc 17-Oct-2011 NeilBrown <neilb@suse.de>

md: clear In_sync bit on devices added to an active array.

When we add a device to an active array it can be meaningful to set
the 'insync' flag. This indicates that the device is in-sync with the
array except for locations recorded in the bitmap.
A bitmap-based recovery can then bring it completely in-sync.

Internally we move that flag to 'saved_raid_disk' but forgot to clear
In_sync like we do in add_new_disk.

So clear In_sync after moving its value to saved_raid_disk.

Reported-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 84fc4b56 10-Oct-2011 NeilBrown <neilb@suse.de>

md: rename "mdk_personality" to "md_personality"

"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2b8bf345 10-Oct-2011 NeilBrown <neilb@suse.de>

md: remove typedefs: mdk_thread_t -> struct md_thread

Signed-off-by: NeilBrown <neilb@suse.de>


# fd01b88c 10-Oct-2011 NeilBrown <neilb@suse.de>

md: remove typedefs: mddev_t -> struct mddev

Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown <neilb@suse.de>


# 3cb03002 10-Oct-2011 NeilBrown <neilb@suse.de>

md: removing typedefs: mdk_rdev_t -> struct md_rdev

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown <neilb@suse.de>


# 36a4e1fe 06-Oct-2011 NeilBrown <neilb@suse.de>

md: remove PRINTK and dprintk debugging and use pr_debug

Being able to dynamically enable these make them much more useful.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2dba6a91 23-Sep-2011 Daniel P. Berrange <berrange@redhat.com>

md: don't delay reboot by 1 second if no MD devices exist

The md_notify_reboot() method includes a call to mdelay(1000),
to deal with "exotic SCSI devices" which are too volatile on
reboot. The delay is unconditional. Even if the machine does
not have any block devices, let alone MD devices, the kernel
shutdown sequence is slowed down.

1 second does not matter much with physical hardware, but with
certain virtualization use cases any wasted time in the bootup
& shutdown sequence counts for alot.

* drivers/md/md.c: md_notify_reboot() - only impose a delay if
there was at least one MD device to be stopped during reboot

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 01f96c0a 20-Sep-2011 NeilBrown <neilb@suse.de>

md: Avoid waking up a thread after it has been freed.

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org


# 5a7bbad2 11-Sep-2011 Christoph Hellwig <hch@infradead.org>

block: remove support for bio remapping from ->make_request

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 27a7b260 10-Sep-2011 NeilBrown <neilb@suse.de>

md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.

0.90 metadata uses an unsigned 32bit number to count the number of
kilobytes used from each device.
This should allow up to 4TB per device.
However we multiply this by 2 (to get sectors) before casting to a
larger type, so sizes above 2TB get truncated.

Also we allow rdev->sectors to be larger than 4TB, so it is possible
for the array to be resized larger than the metadata can handle.
So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
used.

Also the sanity check at the end of super_90_load should include level
1 as it used ->size too. (RAID0 and Linear don't use ->size at all).

Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 7da64a0a 30-Aug-2011 NeilBrown <neilb@suse.de>

md: fix clearing of 'blocked' flag in the presence of bad blocks.

When the 'blocked' flag on a device is cleared while there are
unacknowledged bad blocks we must fail the device. This is needed for
backwards compatability of the interface.

The code currently uses the wrong test for "unacknowledged bad blocks
exist". Change it to the right test.

Signed-off-by: NeilBrown <neilb@suse.de>


# a5bf4df0 24-Aug-2011 Namhyung Kim <namhyung@gmail.com>

md: use REQ_NOIDLE flag in md_super_write()

Queue idling is used for the anticipation of immediate
sequencial I/O's but md_super_write() is a kind of one-
shot operation, coupled with md_super_wait(), so the
idling in this case will be just a waste of time.

Specifying REQ_NOIDLE prevents it. Instead of adding
the flag to submit_bio() directly, use pre-defined
macro WRITE_FLUSH_FUA.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# aeb9b211 24-Aug-2011 NeilBrown <neilb@suse.de>

md: ensure changes to 'write-mostly' are reflected in metadata.

The 'write-mostly' flag can be changed through sysfs.
With 0.90 metadata, those changes are reflected in the metadata.
For 1.x metadata, they aren't.

So fix super_1_sync to record 'write-mostly' status.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5ef56c8f 24-Aug-2011 NeilBrown <neilb@suse.de>

md: report failure if a 'set faulty' request doesn't.

Sometimes a device will refuse to be set faulty. e.g. RAID1 will
never let the last working device become faulty.

So check if "md_error()" did manage to set the faulty flag and fail
with EBUSY if it didn't.

Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198
Reported-by: Mike Hommey <mh+reportbug@glandium.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# e875ecea 27-Jul-2011 NeilBrown <neilb@suse.de>

md/raid10 record bad blocks as needed during recovery.

When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.

Signed-off-by: NeilBrown <neilb@suse.de>


# de393cde 27-Jul-2011 NeilBrown <neilb@suse.de>

md: make it easier to wait for bad blocks to be acknowledged.

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown <neilb@suse.de>


# d7a9d443 27-Jul-2011 NeilBrown <neilb@suse.de>

md: add 'write_error' flag to component devices.

If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>


# d2eb35ac 27-Jul-2011 NeilBrown <neilb@suse.de>

md/raid1: avoid reading from known bad blocks.

Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.

Signed-off-by: NeilBrown <neilb@suse.de>


# 9f2f3830 27-Jul-2011 NeilBrown <neilb@suse.de>

md: Disable bad blocks and v0.90 metadata.

v0.90 metadata cannot record bad blocks, so when loading metadata
for such a device, set shift to -1.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2699b672 27-Jul-2011 NeilBrown <neilb@suse.de>

md: load/store badblock list from v1.x metadata

Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.

We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.

If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.

Signed-off-by: NeilBrown <neilb@suse.de>


# 16c791a5 27-Jul-2011 NeilBrown <neilb@suse.de>

md/bad-block-log: add sysfs interface for accessing bad-block-log.

This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.

Clearing bad blocks is not allowed via sysfs (except for
code testing). A bad block can only be cleared when
a write to the block succeeds.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>


# 2230dfe4 27-Jul-2011 NeilBrown <neilb@suse.de>

md: beginnings of bad block management.

This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.

This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>


# a519b26d 27-Jul-2011 NeilBrown <neilb@suse.de>

md: remove suspicious size_of()

When calling bioset_create we pass the size of the front_pad as
sizeof(mddev)
which looks suspicious as mddev is a pointer and so it looks like a
common mistake where
sizeof(*mddev)
was intended.
The size is actually correct as we want to store a pointer in the
front padding of the bios created by the bioset, so make the intent
more explicit by using
sizeof(mddev_t *)

Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 768e587e 26-Jul-2011 Jonathan Brassow <jbrassow@redhat.com>

MD: generate an event when array sync is complete

This patch causes MD to generate an event (for device-mapper) when the
synchronization thread is reaped. This is expected behavior for device-mapper.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 65a06f06 26-Jul-2011 Namhyung Kim <namhyung@gmail.com>

md: get rid of unnecessary casts on page_address()

page_address() returns void pointer, so the casts can be removed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 5389042f 26-Jul-2011 NeilBrown <neilb@suse.de>

md: change managed of recovery_disabled.

If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.

Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1. For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.

So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.

This will allow RAID10 to get the control that it needs.

Signed-off-by: NeilBrown <neilb@suse.de>


# a478a069 26-Jul-2011 Namhyung Kim <namhyung@gmail.com>

md: remove ro check in md_check_recovery()

Commit c89a8eee6154 ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 36fad858 26-Jul-2011 Namhyung Kim <namhyung@gmail.com>

md: introduce link/unlink_rdev() helpers

There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# f1514638 12-Jul-2011 Kay Sievers <kay.sievers@vrfy.org>

fs: seq_file - add event counter to simplify poll() support

Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.

All current users are switched over to use the new counter.

Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4274215d 28-Jun-2011 NeilBrown <neilb@suse.de>

md: avoid endless recovery loop when waiting for fail device to complete.

If a device fails in a way that causes pending request to take a while
to complete, md will not be able to immediately remove it from the
array in remove_and_add_spares.
It will then incorrectly look like a spare device and md will try to
recover it even though it is failed.
This leads to a recovery process starting and instantly aborting over
and over again.

We should check if the device is faulty before considering it to be a
spare. This will avoid trying to start a recovery that cannot
proceed.

This bug was introduced in 2.6.26 so that patch is suitable for any
kernel since then.

Cc: stable@kernel.org
Reported-by: Jim Paradis <james.paradis@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 01393f3d 08-Jun-2011 Namhyung Kim <namhyung@gmail.com>

md: check ->hot_remove_disk when removing disk

Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store()
during disk removal. The linear personality only has ->hot_add_disk and
no ->hot_remove_disk, so that removing disk in the array resulted to
following kernel bug:

$ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3]
$ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
PGD c9f5d067 PUD 8575a067 PMD 0
Oops: 0010 [#1] SMP
CPU 2
Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg

Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO
RIP: 0010:[<0000000000000000>] [< (null)>] (null)
RSP: 0018:ffff880085757df0 EFLAGS: 00010282
RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000
RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a
R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff
R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000
FS: 00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000)
Stack:
ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000
ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90
ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98
Call Trace:
[<ffffffff8138496a>] ? slot_store+0xaa/0x265
[<ffffffff81384bae>] rdev_attr_store+0x89/0xa8
[<ffffffff8115a96a>] sysfs_write_file+0x108/0x144
[<ffffffff81106b87>] vfs_write+0xb1/0x10d
[<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135
[<ffffffff81106cac>] sys_write+0x4d/0x77
[<ffffffff814fe702>] system_call_fastpath+0x16/0x1b
Code: Bad RIP value.
RIP [< (null)>] (null)
RSP <ffff880085757df0>
CR2: 0000000000000000
---[ end trace ba5fc64319a826fb ]---

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 9864c005 08-Jun-2011 马建朋 <majianpeng@gmail.com>

md: Using poll /proc/mdstat can monitor the events of adding a spare disks

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 076f968b 07-Jun-2011 Jonathan Brassow <jbrassow@redhat.com>

MD: add sync_super to mddev_t struct

Add the 'sync_super' function pointer to MD array structure (struct mddev_s)

If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0fd018af 07-Jun-2011 Jonathan Brassow <jbrassow@redhat.com>

MD: move thread wakeups into resume

Move personality and sync/recovery thread starting outside md_run.

Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should
not happen until after a resume has taken place.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# ac42450c 07-Jun-2011 Jonathan Brassow <jbrassow@redhat.com>

MD: possible typo

Make message a bit clearer by s/blocks/k/

I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message. 'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 68866e42 07-Jun-2011 Jonathan Brassow <jbrassow@f14.redhat.com>

MD: no sync IO while suspended

Disallow resync I/O while the RAID array is suspended.

Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 629acb6a 07-Jun-2011 Jonathan Brassow <jbrassow@f14.redhat.com>

MD: no integrity register if no gendisk

Don't attempt md_integrity_register if there is no gendisk struct available.

When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# b098636c 10-May-2011 NeilBrown <neilb@suse.de>

md: allow resync_start to be set while an array is active.

The sysfs attribute 'resync_start' (known internally as recovery_cp),
records where a resync is up to. A value of 0 means the array is
not known to be in-sync at all. A value of MaxSector means the array
is believed to be fully in-sync.

When the size of member devices of an array (RAID1,RAID4/5/6) is
increased, the array can be increased to match. This process sets
resync_start to the old end-of-device offset so that the new part of
the array gets resynced.

However with RAID1 (and RAID6) a resync is not technically necessary
and may be undesirable. So it would be good if the implied resync
after the array is resized could be avoided.

So: change 'resync_start' so the value can be changed while the array
is active, and as a precaution only allow it to be changed while
resync/recovery is 'frozen'. Changing it once resync has started is
not going to be useful anyway.

This allows the array to be resized without a resync by:
write 'frozen' to 'sync_action'
write new size to 'component_size' (this will set resync_start)
write 'none' to 'resync_start'
write 'idle' to 'sync_action'.

Also slightly improve some tests on recovery_cp when resizing
raid1/raid5. Now that an arbitrary value could be set we should be
more careful in our tests.

Signed-off-by: NeilBrown <neilb@suse.de>


# bedd86b7 10-May-2011 NeilBrown <neilb@suse.de>

md: reject a re-add request that cannot be honoured.

The 'add_new_disk' ioctl can be used to add a device either as a
spare, or as an active disk that just needs to be resynced based on
write-intent-bitmap information (re-add)

Currently if a re-add is requested but fails we add as a spare
instead. This makes it impossible for user-space to check for
failure.

So change to require that a re-add attempt will either succeed or
completely fail. User-space can then decide what to do next.

Signed-off-by: NeilBrown <neilb@suse.de>


# b0140891 10-May-2011 NeilBrown <neilb@suse.de>

md: Fix race when creating a new md device.

There is a race when creating an md device by opening /dev/mdXX.

If two processes do this at much the same time they will follow the
call path
__blkdev_get -> get_gendisk -> kobj_lookup

The first will call
-> md_probe -> md_alloc -> add_disk -> blk_register_region

and the race happens when the second gets to kobj_lookup after
add_disk has called blk_register_region but before it returns to
md_alloc.

In the case the second will not call md_probe (as the probe is already
done) but will get a handle on the gendisk, return to __blkdev_get
which will then call md_open (via the ->open) pointer.

As mddev->gendisk hasn't been set yet, md_open will think something is
wrong an return with ERESTARTSYS.

This can loop endlessly while the first thread makes no progress
through add_disk. Nothing is blocking it, but due to scheduler
behaviour it doesn't get a turn.
So this is essentially a live-lock.

We fix this by simply moving the assignment to mddev->gendisk before
the call the add_disk() so md_open doesn't get confused.
Also move blk_queue_flush earlier because add_disk should be as late
as possible.

To make sure that md_open doesn't complete until md_alloc has done all
that is needed, we take mddev->open_mutex during the last part of
md_alloc. md_open will wait for this.

This can cause a lock-up on boot so Cc:ing for stable.
For 2.6.36 and earlier a different patch will be needed as the
'blk_queue_flush' call isn't there.

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Cc: stable@kernel.org


# fee68723 19-Apr-2011 Krzysztof Wojcik <krzysztof.wojcik@intel.com>

md: Cleanup after raid45->raid0 takeover

Problem:
After raid4->raid0 takeover operation, another takeover operation
(e.g raid0->raid10) results "kernel oops".
Root cause:
Variables 'degraded' in mddev structure is not cleared
on raid45->raid0 takeover.

This patch reset this variable.

Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 97658cdd 18-Apr-2011 NeilBrown <neilb@suse.de>

md: provide generic support for handling unplug callbacks.

When an md device adds a request to a queue, it can call
mddev_check_plugged.
If this succeeds then we know that the md thread will be woken up
shortly, and ->plug_cnt will be non-zero until then, so some
processing can be delayed.

If it fails, then no unplug callback is expected and the make_request
function needs to do whatever is required to make the request happen.

Signed-off-by: NeilBrown <neilb@suse.de>


# 482c0834 18-Apr-2011 NeilBrown <neilb@suse.de>

md - remove old plugging code.

md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.

This relied on the ->unplug_fn callback which doesn't exist any more.

So remove all of that code, both in md and raid5. Subsequent patches
with restore the plugging functionality.

Signed-off-by: NeilBrown <neilb@suse.de>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 89078d57 28-Mar-2011 Martin K. Petersen <martin.petersen@oracle.com>

md: Fix integrity registration error when no devices are capable

We incorrectly returned -EINVAL when none of the devices in the array
had an integrity profile. This in turn prevented mdadm from starting
the metadevice. Fix this so we only return errors on mismatched
profiles and memory allocation failures.

Reported-by: Giacomo Catenazzi <cate@cateee.net>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a91a2785 17-Mar-2011 Martin K. Petersen <martin.petersen@oracle.com>

block: Require subsystems to explicitly allocate bio_set integrity mempool

MD and DM create a new bio_set for every metadevice. Each bio_set has an
integrity mempool attached regardless of whether the metadevice is
capable of passing integrity metadata. This is a waste of memory.

Instead we defer the allocation decision to MD and DM since we know at
metadevice creation time whether integrity passthrough is needed or not.

Automatic integrity mempool allocation can then be removed from
bioset_create() and we make an explicit integrity allocation for the
fs_bio_set.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snizer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 721a9602 09-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: kill off REQ_UNPLUG

With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 7eaceacc 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# f0b4f7e2 23-Feb-2011 NeilBrown <neilb@suse.de>

md: Fix - again - partition detection when array becomes active

Revert
b821eaa572fd737faaf6928ba046e571526c36c6
and
f3b99be19ded511a1bf05a148276239d9f13eefa

When I wrote the first of these I had a wrong idea about the
lifetime of 'struct block_device'. It can disappear at any time that
the block device is not open if it falls out of the inode cache.

So relying on the 'size' recorded with it to detect when the
device size has changed and so we need to revalidate, is wrong.

Rather, we really do need the 'changed' attribute stored directly in
the mddev and set/tested as appropriate.

Without this patch, a sequence of:
mknod / open / close / unlink

(which can cause a block_device to be created and then destroyed)
will result in a rescan of the partition table and consequence removal
and addition of partitions.
Several of these in a row can get udev racing to create and unlink and
other code can get confused.

With the patch, the rescan is only performed when needed and so there
are no races.

This is suitable for any stable kernel from 2.6.35.

Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# 8f5f02c4 15-Feb-2011 NeilBrown <neilb@suse.de>

md: correctly handle probe of an 'mdp' device.

'mdp' devices are md devices with preallocated device numbers
for partitions. As such it is possible to mknod and open a partition
before opening the whole device.

this causes md_probe() to be called with a device number of a
partition, which in-turn calls mddev_find with such a number.

However mddev_find expects the number of a 'whole device' and
does the wrong thing with partition numbers.

So add code to mddev_find to remove the 'partition' part of
a device number and just work with the 'whole device'.

This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652

Reported-by: hkmaly@bigfoot.com
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: <stable@kernel.org>


# cbe6ef1d 15-Feb-2011 NeilBrown <neilb@suse.de>

md: don't set_capacity before array is active.

If the desired size of an array is set (via sysfs) before the array is
active (which is the normal sequence), we currrently call set_capacity
immediately.
This means that a subsequent 'open' (as can be caused by some
udev-triggers program) will notice the new size and try to probe for
partitions. However as the array isn't quite ready yet the read will
fail. Then when the array is read, as the size doesn't change again
we don't try to re-probe.

So when setting array size via sysfs, only call set_capacity if the
array is already active.

Signed-off-by: NeilBrown <neilb@suse.de>


# e91ece55 07-Feb-2011 Chris Mason <chris.mason@oracle.com>

md_make_request: don't touch the bio after calling make_request

md_make_request was calling bio_sectors() for part_stat_add
after it was calling the make_request function. This is
bad because the make_request function can free the bio and
because the bi_size field can change around.

The fix here was suggested by Jens Axboe. It saves the
sector count before the make_request call. I hit this
with CONFIG_DEBUG_PAGEALLOC turned on while trying to break
his pretty fusionio card.

Cc: <stable@kernel.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# c6751b2b 01-Feb-2011 NeilBrown <neilb@suse.de>

md: Don't allow slot_store while resync/recovery is happening.

Activating a spare in an array while resync/recovery is already
happening can lead the that spare being marked in-sync when it isn't
really.
So don't allow the 'slot' to be set (this activating the device)
while resync/recovery is happening.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7281f812 30-Jan-2011 NeilBrown <neilb@suse.de>

md: don't clear curr_resync_completed at end of resync.

There is no need to set this to zero at this point. It will be
set to zero by remove_and_add_spares or at the start of
md_do_sync at the latest.
And setting it to zero before MD_RECOVERY_RUNNING is cleared can
make a 'zero' appear briefly in the 'sync_completed' sysfs attribute
just as resync is finishing.

So simply remove this setting to zero.


Signed-off-by: NeilBrown <neilb@suse.de>


# a8c42c7f 30-Jan-2011 NeilBrown <neilb@suse.de>

md: Don't use remove_and_add_spares to remove failed devices from a read-only array

remove_and_add_spares is called in two places where the needs really
are very different.
remove_and_add_spares should not be called on an array which is about
to be reshaped as some extra devices might have been manually added
and that would remove them. However if the array is 'read-auto',
that will currently happen, which is bad.

So in the 'ro != 0' case don't call remove_and_add_spares but simply
remove the failed devices as the comment suggests is needed.

Signed-off-by: NeilBrown <neilb@suse.de>


# f21e9ff7 30-Jan-2011 NeilBrown <neilb@suse.de>

md: Remove the AllReserved flag for component devices.

This flag is not needed and is used badly.

Devices that are included in a native-metadata array are reserved
exclusively for that array - and currently have AllReserved set.
They all are bd_claimed for the rdev and so cannot be shared.

Devices that are included in external-metadata arrays can be shared
among multiple arrays - providing there is no overlap.
These are bd_claimed for md in general - not for a particular rdev.

When changing the amount of a device that is used in an array we need
to check for overlap. This currently includes a check on AllReserved
So even without overlap, sharing with an AllReserved device is not
allowed.
However the bd_claim usage already precludes sharing with these
devices, so the test on AllReserved is not needed. And in fact it is
wrong.

As this is the only use of AllReserved, simply remove all usage and
definition of AllReserved.

Signed-off-by: NeilBrown <neilb@suse.de>


# de171cb9 30-Jan-2011 NeilBrown <neilb@suse.de>

md: revert change to raid_disks on failure.

If we try to update_raid_disks and it fails, we should put
'delta_disks' back to zero. This is important because some code,
such as slot_store, assumes that delta_disks has been validated.

Signed-off-by: NeilBrown <neilb@suse.de>


# ada609ee 25-Jan-2011 Tejun Heo <tj@kernel.org>

workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER

WQ_RESCUER is now an internal flag and should only be used in the
workqueue implementation proper. Use WQ_MEM_RECLAIM instead.

This doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>


# 49731baa 14-Jan-2011 Tejun Heo <tj@kernel.org>

block: restore multiple bd_link_disk_holder() support

Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum. dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.

Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking. The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.

While at it, note that this facility should not be used by anyone else
than the current ones. Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# bf2cb0da 13-Jan-2011 NeilBrown <neilb@suse.de>

md: Fix removal of extra drives when converting RAID6 to RAID5

When a RAID6 is converted to a RAID5, the extra drive should
be discarded. However it isn't due to a typo in a comparison.

This bug was introduced in commit e93f68a1fc6 in 2.6.35-rc4
and is suitable for any -stable since than.

As the extra drive is not removed, the 'degraded' counter is wrong and
so the RAID5 will not respond correctly to a subsequent failure.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# ba1b41b6 13-Jan-2011 NeilBrown <neilb@suse.de>

md: range check slot number when manually adding a spare.

When adding a spare to an active array, we should check the slot
number, but allow it to be larger than raid_disks if a reshape
is being prepared.

Apply the same test when adding a device to an
array-under-construction. It already had most of the test in place,
but not quite all.

Signed-off-by: NeilBrown <neilb@suse.de>


# 13ae864b 13-Jan-2011 Rémi Rérolle <rrerolle@lacie.com>

md: fix sync_completed reporting for very large drives (>2TB)

The values exported in the sync_completed file are unsigned long, which
overflows with very large drives, resulting in wrong values reported.

Since sync_completed uses sectors as unit, we'll start getting wrong
values with components larger than 2TB.

This patch simply replaces the use of unsigned long by unsigned long long.

Signed-off-by: Rémi Rérolle <rrerolle@lacie.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 23ddff37 13-Jan-2011 NeilBrown <neilb@suse.de>

md: allow suspend_lo and suspend_hi to decrease as well as increase.

The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
to which read/writes are suspended so that the under lying data can be
manipulated without user-space noticing.
Currently the window they describe can only move forwards along the
device. However this is an unnecessary restriction which will cause
problems with planned developments.
So relax this restriction and allow these endpoints to move
arbitrarily.

Signed-off-by: NeilBrown <neilb@suse.de>


# 75d3da43 13-Jan-2011 NeilBrown <neilb@suse.de>

md: Don't let implementation detail of curr_resync leak out through sysfs.

mddev->curr_resync has artificial values of '1' and '2' which are used
by the code which ensures only one resync is happening at a time on
any given device.

These values are internal and should never be exposed to user-space
(except when translated appropriately as in the 'pending' status in
/proc/mdstat).

Unfortunately they are as ->curr_resync is assigned to
->curr_resync_completed and that value is directly visible through
sysfs.

So change the assignments to ->curr_resync_completed to get the same
valued from elsewhere in a form that doesn't have the magic '1' or '2'
values.

Signed-off-by: NeilBrown <neilb@suse.de>


# a6ff7e08 13-Jan-2011 Jonathan Brassow <jbrassow@redhat.com>

md: separate meta and data devs

Allow the metadata to be on a separate device from the
data.

This doesn't mean the data and metadata will by on separate
physical devices - it simply gives device-mapper and userspace
tools more flexibility.

Signed-off-by: NeilBrown <neilb@suse.de>


# ccebd4c4 13-Jan-2011 Jonathan Brassow <jbrassow@redhat.com>

md-new-param-to_sync_page_io

Add new parameter to 'sync_page_io'.

The new parameter allows us to distinguish between metadata and data
operations. This becomes important later when we add the ability to
use separate devices for data and metadata.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>


# 57b2caa3 13-Jan-2011 Jonathan Brassow <jbrassow@redhat.com>

md-new-param-to-calc_dev_sboffset

When we allow for separate devices for data and metadata
in a later patch, we will need to be able to calculate
the superblock offset based on more than the bdev.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>


# 7ebc0be7 13-Jan-2011 NeilBrown <neilb@suse.de>

md: Be more careful about clearing flags bit in ->recovery

Setting ->recovery to 0 is generally not a good idea as it could clear
bits that shouldn't be cleared. In particular, MD_RECOVERY_FROZEN
should only be cleared on explicit request from user-space.

So when we need to clear things, just clear the bits that need
clearing.

As there are a few different places which reap a resync process - and
some do an incomplte job - factor out the code for doing the from
md_check_recovery and call that function instead of open coding part
of it.

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Jonathan Brassow <jbrassow@redhat.com>


# defad61a 13-Jan-2011 NeilBrown <neilb@suse.de>

md: md_stop_writes requires mddev_lock.

As md_stop_writes manipulates the sync_thread and calls md_update_sb,
it need to be called with mddev_lock held.

In all internal cases it is, but the symbol is exported for dm-raid to
call and in that case the lock won't be help.
Do make an exported version which takes the lock, and an internal
version which does not.

Signed-off-by: NeilBrown <neilb@suse.de>


# 0ca69886 13-Jan-2011 NeilBrown <neilb@suse.de>

md: Ensure no IO request to get md device before it is properly initialised.

When an md device is in the process of coming on line it is possible
for an IO request (typically a partition table probe) to get through
before the array is fully initialised, which can cause unexpected
behaviour (e.g. a crash).

So explicitly record when the array is ready for IO and don't allow IO
through until then.

There is no possibility for a similar problem when the array is going
off-line as there must only be one 'open' at that time, and it is busy
off-lining the array and so cannot send IO requests. So no memory
barrier is needed in md_stop()

This has been a bug since commit 409c57f3801 in 2.6.30 which
introduced md_make_request. Before then, each personality would
register its own make_request_fn when it was ready.
This is suitable for any stable kernel from 2.6.30.y onwards.

Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>


# 6c987910 13-Jan-2011 NeilBrown <neilb@suse.de>

md: fix regression resulting in delays in clearing bits in a bitmap

commit 589a594be1fb (2.6.37-rc4) fixed a problem were md_thread would
sometimes call the ->run function at a bad time.

If an error is detected during array start up after the md_thread has
been started, the md_thread is killed. This resulted in the ->run
function being called once. However the array may not be in a state
that it is safe to call ->run.

However the fix imposed meant that ->run was not called on a timeout.
This means that when an array goes idle, bitmap bits do not get
cleared promptly. While the array is busy the bits will still be
cleared when appropriate so this is not very serious. There is no
risk to data.

Change the test so that we only avoid calling ->run when the thread
is being stopped. This more explicitly addresses the problem situation.

This is suitable for 2.6.37-stable and any -stable kernel to which
589a594be1fb was applied.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# bf572541 11-Jan-2011 NeilBrown <neilb@suse.de>

md: fix regression with re-adding devices to arrays with no metadata

Commit 1a855a0606 (2.6.37-rc4) fixed a problem where devices were
re-added when they shouldn't be but caused a regression in a less
common case that means sometimes devices cannot be re-added when they
should be.

In particular, when re-adding a device to an array without metadata
we should always access the device, but after the above commit we
didn't.

This patch sets the In_sync flag in that case so that the re-add
succeeds.

This patch is suitable for any -stable kernel to which 1a855a0606 was
applied.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# e692cb66 01-Dec-2010 Martin K. Petersen <martin.petersen@oracle.com>

block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead

When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.

There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.

The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.

Reported-by: Ed Lin <ed.lin@promise.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 589a594b 08-Dec-2010 NeilBrown <neilb@suse.de>

md: protect against NULL reference when waiting to start a raid10.

When we fail to start a raid10 for some reason, we call
md_unregister_thread to kill the thread that was created.

Unfortunately md_thread() will then make one call into the handler
(raid10d) even though md_wakeup_thread has not been called. This is
not safe and as md_unregister_thread is called after mddev->private
has been set to NULL, it will definitely cause a NULL dereference.

So fix this at both ends:
- md_thread should only call the handler if THREAD_WAKEUP has been
set.
- raid10 should call md_unregister_thread before setting things
to NULL just like all the other raid modules do.

This is applicable to 2.6.35 and later.

Cc: stable@kernel.org
Reported-by: "Citizen" <citizen_lee@thecus.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 1a855a06 08-Dec-2010 NeilBrown <neilb@suse.de>

md: fix bug with re-adding of partially recovered device.

With v0.90 metadata, a hot-spare does not become a full member of the
array until recovery is complete. So if we re-add such a device to
the array, we know that all of it is as up-to-date as the event count
would suggest, and so it a bitmap-based recovery is possible.

However with v1.x metadata, the hot-spare immediately becomes a full
member of the array, but it record how much of the device has been
recovered. If the array is stopped and re-assembled recovery starts
from this point.

When such a device is hot-added to an array we currently lose the 'how
much is recovered' information and incorrectly included it as a full
in-sync member (after bitmap-based fixup).
This is wrong and unsafe and could corrupt data.

So be more careful about setting saved_raid_disk - which is what
guides the re-adding of devices back into an array.
The new code matches the code in slot_store which does a similar
thing, which is encouraging.

This is suitable for any -stable kernel.

Reported-by: "Dailey, Nate" <Nate.Dailey@stratus.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# a035fc3e 08-Dec-2010 NeilBrown <neilb@suse.de>

md: fix possible deadlock in handling flush requests.

As recorded in
https://bugzilla.kernel.org/show_bug.cgi?id=24012

it is possible for a flush request through md to hang. This is due to
an interaction between the recursion avoidance in
generic_make_request, the insistence in md of only having one flush
active at a time, and the possibility of dm (or md) submitting two
flush requests to a device from the one generic_make_request.

If a generic_make_request call into dm causes two flush requests to be
queued (as happens if the dm table has two targets - they get one
each), these two will be queued inside generic_make_request.

Assume they are for the same md device.
The first is processed and causes 1 or more flush requests to be sent
to lower devices. These get queued within generic_make_request too.
Then the second flush to the md device gets handled and it blocks
waiting for the first flush to complete. But it won't complete until
the two lower-device requests complete, and they haven't even been
submitted yet as they are on the generic_make_request queue.

The deadlock can be broken by using a separate thread to submit the
requests to lower devices. md has such a thread readily available:
md_wq.

So use it to submit these requests.

Reported-by: Giacomo Catenazzi <cate@cateee.net>
Tested-by: Giacomo Catenazzi <cate@cateee.net>
Signed-off-by: NeilBrown <neilb@suse.de>


# a7a07e69 08-Dec-2010 NeilBrown <neilb@suse.de>

md: move code in to submit_flushes.

submit_flushes is called from exactly one place.
Move the code that is before and after that call into
submit_flushes.

This has not functional change, but will make the next patch
smaller and easier to follow.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2b74e12e 08-Dec-2010 NeilBrown <neilb@suse.de>

md: remove handling of flush_pending in md_submit_flush_data

None of the functions called between setting flush_pending to 1, and
atomic_dec_and_test can change flush_pending, or will anything
running in any other thread (as ->flush_bio is not NULL). So the
atomic_dec_and_test will always succeed.
So remove the atomic_sec and the atomic_dec_and_test.

Signed-off-by: NeilBrown <neilb@suse.de>


# be20e6c6 23-Nov-2010 Darrick J. Wong <djwong@us.ibm.com>

md: Call blk_queue_flush() to establish flush/fua support

Before 2.6.37, the md layer had a mechanism for catching I/Os with the
barrier flag set, and translating the barrier into barriers for all
the underlying devices. With 2.6.37, I/O barriers have become plain
old flushes, and the md code was updated to reflect this. However,
one piece was left out -- the md layer does not tell the block layer
that it supports flushes or FUA access at all, which results in md
silently dropping flush requests.

Since the support already seems there, just add this one piece of
bookkeeping.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# c26a44ed 23-Nov-2010 Justin Maggard <jmaggard10@gmail.com>

md: fix return value of rdev_size_change()

When trying to grow an array by enlarging component devices,
rdev_size_store() expects the return value of rdev_size_change() to be
in sectors, but the actual value is returned in KBs.

This functionality was broken by commit
dd8ac336c13fd8afdb082ebacb1cddd5cf727889
so this patch is suitable for any kernel since 2.6.30.

Cc: stable@kernel.org
Signed-off-by: Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# d4d77629 13-Nov-2010 Tejun Heo <tj@kernel.org>

block: clean up blkdev_get() wrappers and their users

After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.

* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>


# e525fd89 13-Nov-2010 Tejun Heo <tj@kernel.org>

block: make blkdev_get/put() handle exclusive access

Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.

Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>


# e09b457b 13-Nov-2010 Tejun Heo <tj@kernel.org>

block: simplify holder symlink handling

Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.

Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface. This is a step for cleaning up
bd_claim/release. This patch makes dm-table slightly more complex but
it will be simplified again with further changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com


# 77304d2a 08-Nov-2010 Mike Snitzer <snitzer@redhat.com>

block: read i_size with i_size_read()

Convert direct reads of an inode's i_size to using i_size_read().

i_size_{read,write} use a seqcount to protect reads from accessing
incomple writes. Concurrent i_size_write()s require mutual exclussion
to protect the seqcount that is used by i_size_{read,write}. But
i_size_read() callers do not need to use additional locking.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# a167f663 26-Oct-2010 NeilBrown <neilb@suse.de>

md: use separate bio pool for each md device.

bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.

So allocate a local bio pool for each md array and use that rather
than the common pool.

This pool is used both for regular IO and metadata updates.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2b193363 26-Oct-2010 NeilBrown <neilb@suse.de>

md: change type of first arg to sync_page_io.

Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.

Signed-off-by: NeilBrown <neilb@suse.de>


# e804ac78 15-Oct-2010 Tejun Heo <tj@kernel.org>

md: fix and update workqueue usage

Workqueue usage in md has two problems.

* Flush can be used during or depended upon by memory reclaim, but md
uses the system workqueue for flush_work which may lead to deadlock.

* md depends on flush_scheduled_work() to achieve exclusion against
completion of removal of previous instances. flush_scheduled_work()
may incur unexpected amount of delay and is scheduled to be removed.

This patch adds two workqueues to md - md_wq and md_misc_wq. The
former is guaranteed to make forward progress under memory pressure
and serves flush_work. The latter serves as the flush domain for
other works.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 4b532c9b 28-Oct-2010 NeilBrown <neilb@suse.de>

md: remove md_mutex locking.

lock_kernel calls were recently pushed down into open/release
functions.
md doesn't need that protection.
Then the BKL calls were change to md_mutex. We don't need those
either.
So remove it all.

Signed-off-by: NeilBrown <neilb@suse.de>


# d97a41dc 28-Oct-2010 NeilBrown <neilb@suse.de>

md: Fix regression with raid1 arrays without persistent metadata.

A RAID1 which has no persistent metadata, whether internal or
external, will hang on the first write.
This is caused by commit 070dc6dd7103b6b3f7e4d46e754354a5c15f366e
In that case, MD_CHANGE_PENDING never gets cleared.

So during md_update_sb, is neither persistent or external,
clear MD_CHANGE_PENDING.

This is suitable for 2.6.36-stable.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# 2a48fc0a 02-Jun-2010 Arnd Bergmann <arnd@arndb.de>

block: autoconvert trivial BKL users to private mutex

The block device drivers have all gained new lock_kernel
calls from a recent pushdown, and some of the drivers
were already using the BKL before.

This turns the BKL into a set of per-driver mutexes.
Still need to check whether this is safe to do.

file=$1
name=$2
if grep -q lock_kernel ${file} ; then
if grep -q 'include.*linux.mutex.h' ${file} ; then
sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
else
sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
fi
sed -i ${file} \
-e "/^#include.*linux.mutex.h/,$ {
1,/^\(static\|int\|long\)/ {
/^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);

} }" \
-e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
-e '/[ ]*cycle_kernel_lock();/d'
else
sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \
-e '/cycle_kernel_lock()/d'
fi

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# ddcf3522 08-Sep-2010 NeilBrown <neilb@suse.de>

md: fix v1.x metadata update when a disk is missing.

If an array with 1.x metadata is assembled with the last disk missing,
md doesn't properly record the fact that the disk was missing.

This is unlikely to cause a real problem as the event count will be
different to the count on the missing disk so it won't be included in
the array. However it could still cause confusion.

So make sure we clear all the relevant slots, not just the early ones.

Signed-off-by: NeilBrown <neilb@suse.de>


# 126925c0 07-Sep-2010 NeilBrown <neilb@suse.de>

md: call md_update_sb even for 'external' metadata arrays.

Now that we depend on md_update_sb to clear variable bits in
mddev->flags (rather than trying not to set them) it is important to
always call md_update_sb when appropriate.

md_check_recovery has this job but explicitly avoids it for ->external
metadata arrays. This is not longer appropraite, or needed.

However we do want to avoid taking the mddev lock if only
MD_CHANGE_PENDING is set as that is not cleared by md_update_sb for
external-metadata arrays.

Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# e9c7469b 03-Sep-2010 Tejun Heo <tj@kernel.org>

md: implment REQ_FLUSH/FUA support

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.

* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 070dc6dd 30-Aug-2010 NeilBrown <neilb@suse.de>

md: resolve confusion of MD_CHANGE_CLEAN

MD_CHANGE_CLEAN is used for two different purposes and this leads to
confusion.
One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
not used for anything else, so have MD_CHANGE_PENDING take over that
purpose fully.

The two purposes are:
1/ tell md_update_sb that an update is needed and that it is just a
clean/dirty transition.
2/ tell user-space that an transition from clean to dirty is pending
(something wants to write), and tell te kernel (by clearin the
flag) that the transition is OK.

The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
fully to MD_CHANGE_PENDING.

This means that various places which conditionally set or cleared
MD_CHANGE_CLEAN no longer need to be conditional.

Signed-off-by: NeilBrown <neilb@suse.de>


# bd52b746 30-Aug-2010 Dan Williams <dan.j.williams@intel.com>

md: don't clear MD_CHANGE_CLEAN in md_update_sb() for external arrays

If this bit is cleared in md_update_sb() the kernel will allow writes to the
array if userspace triggers md_allow_write(), e.g. through stripe_cache_size,
when mdmon is not active. When mdmon is active the array transitions to
active-idle bypassing write-pending, setting up a race for mdmon to set the
array clean before a write arrives.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 3a3a5ddb 16-Aug-2010 NeilBrown <neilb@suse.de>

Update recovery_offset even when external metadata is used.

The update of ->recovery_offset in sync_sbs is appropriate even then external
metadata is in use. However sync_sbs is only called when native
metadata is used.

So move that update in to the top of md_update_sb (which is the only
caller of sync_sbs) before the test on ->external.

This moves the update out of ->write_lock protection, but those fields
only need ->reconfig_mutex protection which they still have.

Also move the test on ->persistent up to where ->external is set as
for metadata update purposes they are the same.

Clear MD_CHANGE_DEVS and MD_CHANGE_CLEAN as they can only be confusing
if ->external is set or ->persistent isn't.

Finally move the update of ->utime down as it is only relevent (like
the ->events update) for native metadata.

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>


# 6e17b027 07-Aug-2010 NeilBrown <neilb@suse.de>

md: clean up do_md_stop

There is only one error exit from do_md_stop, so make that more
explicit and discard the 'err' variable.
Also drop the 'revalidate' variable by moving the unlock calls around.

Signed-off-by: NeilBrown <neilb@suse.de>


# bb4f1e9d 08-Aug-2010 NeilBrown <neilb@suse.de>

md: fix another deadlock with removing sysfs attributes.

Move the deletion of sysfs attributes from reconfig_mutex to
open_mutex didn't really help as a process can try to take
open_mutex while holding reconfig_mutex, so the same deadlock can
happen, just requiring one more process to be involved in the chain.

I looks like I cannot easily use locking to wait for the sysfs
deletion to complete, so don't.

The only things that we cannot do while the deletions are still
pending is other things which can change the sysfs namespace: run,
takeover, stop. Each of these can fail with -EBUSY.
So set a flag while doing a sysfs deletion, and fail run, takeover,
stop if that flag is set.

This is suitable for 2.6.35.x

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 147e0b6a 06-Aug-2010 Dan Williams <dan.j.williams@intel.com>

md: move revalidate_disk() back outside open_mutex

Commit b821eaa5 "md: remove ->changed and related code" moved
revalidate_disk() under open_mutex, and lockdep noticed.

[ INFO: possible circular locking dependency detected ]
2.6.32-mdadm-locking #1
-------------------------------------------------------
mdadm/3640 is trying to acquire lock:
(&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811acecb>] revalidate_disk+0x5b/0x90

but task is already holding lock:
(&mddev->open_mutex){+.+...}, at: [<ffffffffa055e07a>] do_md_stop+0x4a/0x4d0 [md_mod]

which lock already depends on the new lock.

It is suitable for 2.6.35.x

Cc: <stable@kernel.org>
Reported-by: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 6e9624b8 07-Aug-2010 Arnd Bergmann <arnd@arndb.de>

block: push down BKL into .open and .release

The open and release block_device_operations are currently
called with the BKL held. In order to change that, we must
first make sure that all drivers that currently rely
on this have no regressions.

This blindly pushes the BKL into all .open and .release
operations for all block drivers to prepare for the
next step. The drivers can subsequently replace the BKL
with their own locks or remove it completely when it can
be shown that it is not needed.

The functions blkdev_get and blkdev_put are the only
remaining users of the big kernel lock in the block
layer, besides a few uses in the ioctl code, none
of which need to serialize with blkdev_{get,put}.

Most of these two functions is also under the protection
of bdev->bd_mutex, including the actual calls to
->open and ->release, and the common code does not
access any global data structures that need the BKL.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 7b6d91da 07-Aug-2010 Christoph Hellwig <hch@lst.de>

block: unify flags for struct bio and struct request

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 69e51b44 01-Jun-2010 NeilBrown <neilb@suse.de>

md/bitmap: separate out loading a bitmap from initialising the structures.

dm makes this distinction between ->ctr and ->resume, so we need to
too.

Also get the new bitmap_load to clear out the bitmap first, as this is
most consistent with the dm suspend/resume approach

Signed-off-by: NeilBrown <neilb@suse.de>


# b63d7c2e 01-Jun-2010 NeilBrown <neilb@suse.de>

md/bitmap: clean up plugging calls.

1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
arrays with no queue attached.

2/ Don't bother plugging the queue when we set a bit in the bitmap.
The reason for this was to encourage as many bits as possible to
get set before we unplug and write stuff out.
However every personality already plugs the queue after
bitmap_startwrite either directly (raid1/raid10) or be setting
STRIPE_BIT_DELAY which causes the queue to be plugged later
(raid5).

Signed-off-by: NeilBrown <neilb@suse.de>


# 252ac522 01-Jun-2010 NeilBrown <neilb@suse.de>

md/plug: optionally use plugger to unplug an array during resync/recovery.

If an array doesn't have a 'queue' then md_do_sync cannot
unplug it.
In that case it will have a 'plugger', so make that available
to the mddev, and use it to unplug the array if needed.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2ac87401 01-Jun-2010 NeilBrown <neilb@suse.de>

md/raid5: add simple plugging infrastructure.

md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'. However when we plug raid5 under dm there
is no request queue so we cannot use that.

So create a similar infrastructure that is much lighter weight and use
it for raid5.

Signed-off-by: NeilBrown <neilb@suse.de>


# 768a418d 25-Jul-2010 NeilBrown <neilb@suse.de>

md: add support for raising dm events.

dm uses scheduled work to raise events to user-space.
So allow md device to have work_structs and schedule them on an error.

Signed-off-by: NeilBrown <neilb@suse.de>


# 390ee602 01-Jun-2010 NeilBrown <neilb@suse.de>

md: export various start/stop interfaces

export entry points for starting and stopping md arrays.
This will be used by a module to make md/raid5 work under
dm.
Also stop calling md_stop_writes from md_stop, as that won't
work well with dm - it will want to call the two separately.

Signed-off-by: NeilBrown <neilb@suse.de>


# e8bb9a83 01-Jun-2010 NeilBrown <neilb@suse.de>

md: split out md_rdev_init

This functionality will be needed separately in a subsequent patch, so
split it into it's own exported function.

Signed-off-by: NeilBrown <neilb@suse.de>


# 676e42d8 01-Jun-2010 NeilBrown <neilb@suse.de>

md: be more careful setting MD_CHANGE_CLEAN

When MD_CHANGE_CLEAN is set we might block in md_write_start.
So we should only set it when fairly sure that something will clear
it.

There are two places where it is set so as to encourage a metadata
update to record the progress of resync/recovery. This should only
be done if the internal metadata update mechanisms are in use, which
can be tested by by inspecting '->persistent'.

Signed-off-by: NeilBrown <neilb@suse.de>


# 00bcb4ac 01-Jun-2010 NeilBrown <neilb@suse.de>

md: reduce dependence on sysfs.

We will want md devices to live as dm targets where sysfs is not
visible. So allow md to not connect to sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>


# 70fffd0b 16-Jun-2010 NeilBrown <neilb@suse.de>

md: Don't update ->recovery_offset when reshaping an array to fewer devices.

When an array is reshaped to have fewer devices, the reshape proceeds
from the end of the devices to the beginning.

If a device happens to be non-In_sync (which is possible but rare)
we would normally update the ->recovery_offset as the reshape
progresses. However that would be wrong as the recover_offset records
that the early part of the device is in_sync, while in fact it would
only be the later part that is in_sync, and in any case the offset
number would be measured from the wrong end of the device.

Relatedly, if after a reshape a spare is discovered to not be
recoverred all the way to the end, not allow spare_active
to incorporate it in the array.

This becomes relevant in the following sample scenario:

A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined
operation.
The RAID5->RAID6 conversion will cause a 5 drive to be included as a
spare, then the 5drive -> 6drive reshape will effectively rebuild that
spare as it progresses. The 6th drive is treated as in_sync the whole
time as there is never any case that we might consider reading from
it, but must not because there is no valid data.

If we interrupt this reshape part-way through and reverse it to return
to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update
the recovery_offset - as that would be wrong - and we don't want to
include that spare as active in the 5-drive RAID6 when the reversed
reshape completed and it will be mostly out-of-sync still.

Signed-off-by: NeilBrown <neilb@suse.de>


# e93f68a1 15-Jun-2010 NeilBrown <neilb@suse.de>

md: fix handling of array level takeover that re-arranges devices.

Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over. However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.

Reported-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# f3b99be1 23-Jun-2010 NeilBrown <neilb@suse.de>

Restore partition detection of newly created md arrays.

Commit b821eaa572fd737faaf6928ba046e571526c36c6 broke partition
detection for md arrays.

The logic was almost right. However if revalidate_disk is called
when the device is not yet open, bdev->bd_disk won't be set, so the
flush_disk() Call will not set bd_invalidated.

So when md_open is called we still need to ensure that
->bd_invalidated gets set. This is easily done with a call to
check_disk_size_change in the place where the offending commit removed
check_disk_change. At the important times, the size will have changed
from 0 to non-zero, so check_disk_size_change will set bd_invalidated.

Tested-by: Duncan <1i5t5.duncan@cox.net>
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: NeilBrown <neilb@suse.de>


# 3ff195b0 30-Mar-2010 Eric W. Biederman <ebiederm@xmission.com>

sysfs: Implement sysfs tagged directory support.

The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.

What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.

I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.

For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.

To achieve a reace free design all tagged directories are created
and managed by sysfs itself.

Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid

- Implement mount_ns() which returns the ns of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.

Everything else is left up to sysfs and the driver layer.

For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.

Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.

The work needed in sysfs is more extensive. At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent. Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.

Currently only directories which hold kobjects, and
symlinks are supported. There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# be6800a7 17-May-2010 NeilBrown <neilb@suse.de>

md: don't insist on valid event count for spare devices.

Devices which know that they are spares do not really need to have
an event count that matches the rest of the array, so there are no
data-in-sync issues. It is enough that the uuid matches.
So remove the requirement that the event count is up-to-date.

We currently still write out and event count on spares, but this
allows us in a year or 3 to stop doing that completely.

Signed-off-by: NeilBrown <neilb@suse.de>


# a8707c08 17-May-2010 NeilBrown <neilb@suse.de>

md: simplify updating of event count to sometimes avoid updating spares.

When updating the event count for a simple clean <-> dirty transition,
we try to avoid updating the spares so they can safely spin-down.
As the event_counts across an array must be +/- 1, this means
decrementing the event_count on a dirty->clean transition.
This is not always safe and we have to avoid the unsafe time.
We current do this with a misguided idea about it being safe or
not depending on whether the event_count is odd or even. This
approach only works reliably in a few common instances, but easily
falls down.

So instead, simply keep internal state concerning whether it is safe
or not, and always assume it is not safe when an array is first
assembled.

Signed-off-by: NeilBrown <neilb@suse.de>


# 75a73a29 07-May-2010 NeilBrown <neilb@suse.de>

md: restore ability of spare drives to spin down.

Some time ago we stopped the clean/active metadata updates
from being written to a 'spare' device in most cases so that
it could spin down and say spun down. Device failure/removal
etc are still recorded on spares.

However commit 51d5668cb2e3fd1827a55 broke this 50% of the time,
depending on whether the event count is even or odd.
The change log entry said:

This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain,

how ever the code makes no attempt to create that alignment, so it
could take arbitrarily long.

So when we find that clean/dirty is not aligned with odd/even,
force a second metadata-update immediately. There are already cases
where a second metadata-update is needed immediately (e.g. when a
device fails during the metadata update). We just piggy-back on that.

Reported-by: Joe Bryant <tenminjoe@yahoo.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# f2859af6 02-May-2010 Dan Williams <dan.j.williams@intel.com>

md: allow integers to be passed to md/level

e.g. allow md to interpret 'echo 4 > md/level' as a request for raid4.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# bb7f8d22 01-May-2010 Dan Williams <dan.j.williams@intel.com>

md: notify mdstat waiters of level change

Level modifications change the output of mdstat. The mdmon manager
thread is interested in these events for external metadata management.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 9e35b99c 05-Apr-2010 NeilBrown <neilb@suse.de>

md: don't unregister the thread in mddev_suspend

This is
- unnecessary because mddev_suspend is always followed by a call to
->stop, and each ->stop unregisters the thread, and
- a problem as it makes it awkwards to suspend and then resume a
device as we will want later.

Signed-off-by: NeilBrown <neilb@suse.de>


# fafd7fb0 31-Mar-2010 NeilBrown <neilb@suse.de>

md: factor out init code for an mddev

This is a simple factorisation that makes mddev_find easier to read.


Signed-off-by: NeilBrown <neilb@suse.de>


# 21a52c6d 31-Mar-2010 NeilBrown <neilb@suse.de>

md: pass mddev to make_request functions rather than request_queue

We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown <neilb@suse.de>


# cca9cf90 31-Mar-2010 NeilBrown <neilb@suse.de>

md: call md_stop_writes from md_stop

This moves the call to the other side of set_readonly, but that should
not be an issue.
This encapsulates in 'md_stop' all of the functionality for internally
stopping the array, leaving all the interactions with externalities
(sysfs, request_queue, gendisk) in do_md_stop.

Signed-off-by: NeilBrown <neilb@suse.de>


# a4bd82d0 28-Mar-2010 NeilBrown <neilb@suse.de>

md: split md_set_readonly out of do_md_stop

Using do_md_stop to set an array to read-only is a little confusing.
Now most of the common code has been factored out, split
md_set_readonly off in to a separate function.

Signed-off-by: NeilBrown <neilb@suse.de>


# a047e125 28-Mar-2010 NeilBrown <neilb@suse.de>

md: factor md_stop_writes out of do_md_stop.

Further refactoring of do_md_stop.
This one requires some explanation as it takes code from different
places in do_md_stop, so some re-ordering happens.

We only get into this part of do_md_stop if there are no active opens
of the device, so no writes can be happening and the device must have
been flushed. In md_stop_writes we want to stop any internal sources
of writes - i.e. resync - and flush out the metadata.

The only code that was previously before some of this code is
code to clean up the queue, the mddev, the gendisk, or sysfs, all
of which is probably better after code that makes active changes (i.e.
triggers writes).

Signed-off-by: NeilBrown <neilb@suse.de>


# 6177b472 28-Mar-2010 NeilBrown <neilb@suse.de>

md: start to refactor do_md_stop

do_md_stop is large and clunky, so hard to understand.

This is a first step of refactoring, pulling two simple
sub-functions out.

Signed-off-by: NeilBrown <neilb@suse.de>


# fe60b014 28-Mar-2010 NeilBrown <neilb@suse.de>

md: factor do_md_run to separate accesses to ->gendisk

As part of relaxing the binding between an mddev and gendisk,
we separate do_md_run into two functions.
md_run does all the work internal to md
do_md_run calls md_run and makes and changes to gendisk
that are required.

Signed-off-by: NeilBrown <neilb@suse.de>


# b821eaa5 28-Mar-2010 NeilBrown <neilb@suse.de>

md: remove ->changed and related code.

We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.

Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.

Signed-off-by: NeilBrown <neilb@suse.de>


# 49ce6cea 28-Mar-2010 NeilBrown <neilb@suse.de>

md: don't reference gendisk in getgeo

Using ->array_sectors rather than get_capacity() is more
direct and is a step towards relaxing the tight connection
between mddev and gendisk.

Signed-off-by: NeilBrown <neilb@suse.de>


# 49077326 24-Mar-2010 NeilBrown <neilb@suse.de>

md: move io accounting out of personalities into md_make_request

While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5cac7861 14-Apr-2010 Maciej Trela <Maciej.Trela@intel.com>

md: notify level changes through sysfs.

Level changes can be very significant, so make sure
to notify them via sysfs.

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 233fca36 14-Apr-2010 NeilBrown <neilb@suse.de>

md: Relax checks on ->max_disks when external metadata handling is used.

When metadata is being managed by user-space, md doesn't know
what the maximum number of devices allowed in an array is
so ->max_disks is 0. In this case we should allow any (+ve)
number of disks.

Signed-off-by: NeilBrown <neilb@suse.de>


# b7103107 14-Apr-2010 Maciej Trela <Maciej.Trela@intel.com>

md: Correctly handle device removal via sysfs

Writing "none" to "../md/dev-xx/slot" removes that device
from being an active part of the array, but it didn't
set ->raid_disk to -1 to record this fact.


Signed-off-by: Maciej Trela <Maciej.Trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 9af204cf 07-Mar-2010 Trela, Maciej <Maciej.Trela@intel.com>

md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover


Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 54071b38 07-Mar-2010 Trela Maciej <Maciej.Trela@intel.com>

md:Add support for Raid0->Raid5 takeover

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# c0cc75f8 21-Mar-2010 NeilBrown <neilb@suse.de>

md: discard StateChanged device flag.

This was needed when sysfs files could only be 'notified'
from process context. Now that we have sys_notify_direct,
we can call it directly from an interrupt.

Signed-off-by: NeilBrown <neilb@suse.de>


# a64c876f 14-Apr-2010 NeilBrown <neilb@suse.de>

md: manage redundancy group in sysfs when changing level.

Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.

This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.

When changing levels, we may need to add or remove attributes. When
changing RAID5 -> RAID6, we both add and remove the same thing. It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.


Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# b6eb127d 14-Apr-2010 NeilBrown <neilb@suse.de>

md: remove unneeded sysfs files more promptly

When an array is stopped we need to remove some
sysfs files which are dependent on the type of array.

We need to delay that deletion as deleting them while holding
reconfig_mutex can lead to deadlocks.

We currently delay them until the array is completely destroyed.
However it is possible to deactivate and then reactivate the array.
It is also possible to need to remove sysfs files when changing level,
which can potentially happen several times before an array is
destroyed.

So we need to delete these files more promptly: as soon as
reconfig_mutex is dropped.

We need to ensure this happens before do_md_run can restart the array,
so we use open_mutex for some extra locking. This is not deadlock
prone.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# e2218350 11-May-2010 Dan Williams <dan.j.williams@intel.com>

md: set mddev readonly flag on blkdev BLKROSET ioctl

When the user sets the block device to readwrite then the mddev should
follow suit. Otherwise, the BUG_ON in md_write_start() will be set to
trigger.

The reverse direction, setting mddev->ro to match a set readonly
request, can be ignored because the blkdev level readonly flag precludes
the need to have mddev->ro set correctly. Nevermind the fact that
setting mddev->ro to 1 may fail if the array is in use.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 1176568d 07-May-2010 NeilBrown <neilb@suse.de>

md: restore ability of spare drives to spin down.

Some time ago we stopped the clean/active metadata updates
from being written to a 'spare' device in most cases so that
it could spin down and say spun down. Device failure/removal
etc are still recorded on spares.

However commit 51d5668cb2e3fd1827a55 broke this 50% of the time,
depending on whether the event count is even or odd.
The change log entry said:

This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain,

how ever the code makes no attempt to create that alignment, so it
could take arbitrarily long.

So when we find that clean/dirty is not aligned with odd/even,
force a second metadata-update immediately. There are already cases
where a second metadata-update is needed immediately (e.g. when a
device fails during the metadata update). We just piggy-back on that.

Reported-by: Joe Bryant <tenminjoe@yahoo.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 52cf25d0 18-Jan-2010 Emese Revfy <re.emese@gmail.com>

Driver core: Constify struct sysfs_ops in struct kobj_type

Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime

* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime

* potentially better optimized code as the compiler
can assume that the const data cannot be changed

* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# ef286f6f 08-Feb-2010 NeilBrown <neilb@suse.de>

md: fix some lockdep issues between md and sysfs.

======
This fix is related to
http://bugzilla.kernel.org/show_bug.cgi?id=15142
but does not address that exact issue.
======

sysfs does like attributes being removed while they are being accessed
(i.e. read or written) and waits for the access to complete.

As accessing some md attributes takes the same lock that is held while
removing those attributes a deadlock can occur.

This patch addresses 3 issues in md that could lead to this deadlock.

Two relate to calling flush_scheduled_work while the lock is held.
This is probably a bad idea in general and as we use schedule_work to
delete various sysfs objects it is particularly bad.

In one case flush_scheduled_work is called from md_alloc (called by
md_probe) called from do_md_run which holds the lock. This call is
only present to ensure that ->gendisk is set. However we can be sure
that gendisk is always set (though possibly we couldn't when that code
was originally written. This is because do_md_run is called in three
different contexts:
1/ from md_ioctl. This requires that md_open has succeeded, and it
fails if ->gendisk is not set.
2/ from writing a sysfs attribute. This can only happen if the
mddev has been registered in sysfs which happens in md_alloc
after ->gendisk has been set.
3/ from autorun_array which is only called by autorun_devices, which
checks for ->gendisk to be set before calling autorun_array.
So the call to md_probe in do_md_run can be removed, and the check on
->gendisk can also go.


In the other case flush_scheduled_work is being called in do_md_stop,
purportedly to wait for all md_delayed_delete calls (which delete the
component rdevs) to complete. However there really isn't any need to
wait for them - they have already been disconnected in all important
ways.

The third issue is that raid5->stop() removes some attribute names
while the lock is held. There is already some infrastructure in place
to delay attribute removal until after the lock is released (using
schedule_work). So extend that infrastructure to remove the
raid5_attrs_group.

This does not address all lockdep issues related to the sysfs
"s_active" lock. The rest can be address by splitting that lockdep
context between symlinks and non-symlinks which hopefully will happen.

Signed-off-by: NeilBrown <neilb@suse.de>


# 404e4b43 29-Dec-2009 NeilBrown <neilb@suse.de>

md: allow a resync that is waiting for other resync to complete, to be aborted.

If two arrays share a device, then they will not both resync at the
same time. One will wait for the other to complete.
While waiting, the MD_RECOVERY_INTR flag is not checked so a device
failure, which would make the resync pointless, does not cause the
resync to abort, so the failed device cannot be removed (as it cannot
be remove while a resync is happening).

So add a test for MD_RECOVERY_INTR.

Reported-by: Brett Russ <bruss@netezza.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 7fb9dadc 29-Dec-2009 NeilBrown <neilb@suse.de>

md: remove unnecessary code from do_md_run

Since commit dfc7064500061677720fa26352963c772d3ebe6b,
->hot_remove_disks has not removed non-failed devices from
an array until recovery is no longer possible.
So the code in do_md_run to get around the fact that
md_check_recovery (which calls ->hot_remove_disks) would
remove partially-in-sync devices is no longer needed.

So remove it.

Signed-off-by: NeilBrown <neilb@suse.de>


# a2d79c32 21-Dec-2009 Dan Williams <dan.j.williams@intel.com>

md: make recovery started by do_md_run() visible via sync_action

By default md_do_sync() will perform recovery if no other actions are
specified. However, action_show() relies on MD_RECOVERY_RECOVER to be
set otherwise it returns 'idle'. So, add a missing set
MD_RECOVERY_RECOVER when starting recovery.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0f9552b5 29-Dec-2009 NeilBrown <neilb@suse.de>

md: fix small irregularity with start_ro module parameter

The start_ro modules parameter can be used to force arrays to be
started in 'auto-readonly' in which they are read-only until the first
write. This ensures that no resync/recovery happens until something
else writes to the device. This is important for resume-from-disk
off an md array.

However if an array is started 'readonly' (by writing 'readonly' to
the 'array_state' sysfs attribute) we want it to be really 'readonly',
not 'auto-readonly'.

So strengthen the condition to only set auto-readonly if the
array is not already read-only.

Signed-off-by: NeilBrown <neilb@suse.de>


# cbd19983 29-Dec-2009 NeilBrown <neilb@suse.de>

md: Fix unfortunate interaction with evms

evms configures md arrays by:
open device
send ioctl
close device

for each different ioctl needed.
Since 2.6.29, the device can disappear after the 'close'
unless a significant configuration has happened to the device.
The change made by "SET_ARRAY_INFO" can too minor to stop the device
from disappearing, but important enough that losing the change is bad.

So: make sure SET_ARRAY_INFO sets mddev->ctime, and keep the device
active as long as ctime is non-zero (it gets zeroed with lots of other
things when the array is stopped).

This is suitable for -stable kernels since 2.6.29.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# 7b75c2f8 14-Dec-2009 Joe Perches <joe@perches.com>

drivers/md/md.c: use %pU to print UUIDs

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e7d2860b 14-Dec-2009 André Goddard Rosa <andre.goddard@gmail.com>

tree-wide: convert open calls to remove spaces to skip_spaces() lib function

Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.

It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
text data bss dec hex filename
64688 584 592 65864 10148 (TOTALS-BEFORE)
64641 584 592 65817 10119 (TOTALS-AFTER)

Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".

Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
drivers/leds/led-class.c
drivers/leds/ledtrig-timer.c
drivers/video/output.c

@@
expression str;
@@

( // ignore skip_spaces cases
while (*str && isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06e3c817 12-Dec-2009 Dan Williams <dan.j.williams@intel.com>

md: add 'recovery_start' per-device sysfs attribute

Enable external metadata arrays to manage rebuild checkpointing via a
md/dev-XXX/recovery_start attribute which reflects rdev->recovery_offset

Also update resync_start_store to allow 'none' to be written, for
consistency.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 4e59ca7d 12-Dec-2009 Dan Williams <dan.j.williams@intel.com>

md: rcu_read_lock() walk of mddev->disks in md_do_sync()

Other walks of this list are either under rcu_read_lock() or the list
mutation lock (mddev_lock()). This protects against the improbable case of a
disk being removed from the array at the start of md_do_sync().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 93be75ff 13-Dec-2009 NeilBrown <neilb@suse.de>

md: integrate spares into array at earliest opportunity.

As v1.x metadata can record that a member of the array is
not completely recovered, it make sense to record that a
spare has become a regular member of the array at the earliest
opportunity.
So remove the tests on "recovery_offset > 0" in super_1_sync
as they really aren't needed, and schedule a metadata update
immediately after adding spares to a degraded array.

This means that if a crash happens immediately after a recovery
starts, the new device will be included in the array and recovery will
continue from wherever it was up to. Previously this didn't happen
unless recovery was at least 1/16 of the way through.

Signed-off-by: NeilBrown <neilb@suse.de>


# aa98aa31 13-Dec-2009 Arnd Bergmann <arnd@arndb.de>

md: move compat_ioctl handling into md.c

The RAID ioctls are only implemented in md.c, so the
handling for them should also be moved there from
fs/compat_ioctl.c.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Andre Noll <maan@systemlinux.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 0efb9e61 13-Dec-2009 NeilBrown <neilb@suse.de>

md: add MODULE_DESCRIPTION for all md related modules.

Suggested by Oren Held <orenhe@il.ibm.com>

Signed-off-by: NeilBrown <neilb@suse.de>


# 1e50915f 13-Dec-2009 Robert Becker <Rob.Becker@riverbed.com>

raid: improve MD/raid10 handling of correctable read errors.

We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts. Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 43a70507 13-Dec-2009 NeilBrown <neilb@suse.de>

md: support updating bitmap parameters via sysfs.

A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.
'chunksize' and 'time_base' must be set before 'location'
can be set.

'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.

'time_base' and 'backlog' can be updated at any time.


Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>


# 72e02075 13-Dec-2009 NeilBrown <neilb@suse.de>

md: factor out parsing of fixed-point numbers

safe_delay_store can parse fixed point numbers (for fractions
of a second). We will want to do that for another sysfs
file soon, so factor out the code.

Signed-off-by: NeilBrown <neilb@suse.de>


# 42a04b50 13-Dec-2009 NeilBrown <neilb@suse.de>

md: move offset, daemon_sleep and chunksize out of bitmap structure

... and into bitmap_info. These are all configuration parameters
that need to be set before the bitmap is created.

Signed-off-by: NeilBrown <neilb@suse.de>


# c3d9714e 13-Dec-2009 NeilBrown <neilb@suse.de>

md: collect bitmap-specific fields into one structure.

In preparation for making bitmap fields configurable via sysfs,
start tidying up by making a single structure to contain the
configuration fields.

Signed-off-by: NeilBrown <neilb@suse.de>


# a2826aa9 13-Dec-2009 NeilBrown <neilb@suse.de>

md: support barrier requests on all personalities.

Previously barriers were only supported on RAID1. This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device. When that completes - and if the original request was not
empty - we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail. If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail. That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet. So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted. Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>


# efa59339 13-Dec-2009 NeilBrown <neilb@suse.de>

md: don't reset curr_resync_completed after an interrupted resync

If a resync/recovery/check/repair is interrupted for some reason, it
can be useful to know exactly where it got up to.
So in that case, do not clear curr_resync_completed.
Initialise it when starting a resync/recovery/... instead.

Signed-off-by: NeilBrown <neilb@suse.de>


# c07b70ad 13-Dec-2009 NeilBrown <neilb@suse.de>

md: adjust resync_min usefully when resync aborts.

When a 'check' or 'repair' finished we should clear resync_min
so that a future check/repair will cover the whole array (by default).
However if it is interrupted, we should update resync_min to
where we got up to, so that when the check/repair continues it
just does the remainder of the array.

Signed-off-by: NeilBrown <neilb@suse.de>


# aa5cbd10 13-Dec-2009 NeilBrown <neilb@suse.de>

md/bitmap: protect against bitmap removal while being updated.

A write intent bitmap can be removed from an array while the
array is active.
When this happens, all IO is suspended and flushed before the
bitmap is removed.
However it is possible that bitmap_daemon_work is still running to
clear old bits from the bitmap. If it is, it can dereference the
bitmap after it has been freed.

So introduce a new mutex to protect bitmap_daemon_work and get it
before destroying a bitmap.

This is suitable for any current -stable kernel.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# 6d456111 16-Nov-2009 Eric W. Biederman <ebiederm@xmission.com>

sysctl: Drop & in front of every proc_handler.

For consistency drop & in front of every proc_handler. Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 0261cd9f 12-Nov-2009 NeilBrown <neilb@suse.de>

md: allow v0.91 metadata to record devices as being active but not in-sync.

This is a combination that didn't really make sense before.
However when a reshape is converting e.g. raid5 -> raid6, the extra
device is not fully in-sync, but is certainly active and contains
important data.
So allow that start to be meaningful and in particular get
the 'recovery_offset' value (which is needed for any non-in-sync
active device) from the reshape_position.

Signed-off-by: NeilBrown <neilb@suse.de>


# 894d2491 05-Nov-2009 Eric W. Biederman <ebiederm@xmission.com>

sysctl drivers: Remove dead binary sysctl support

Now that sys_sysctl is a wrapper around /proc/sys all of
the binary sysctl support elsewhere in the tree is
dead code.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Neil Brown <neilb@suse.de>
Cc: "James E.J. Bottomley" <James.Bottomley@suse.de>
Acked-by: Clemens Ladisch <clemens@ladisch.de> for drivers/char/hpet.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 5e865106 11-Nov-2009 NeilBrown <neilb@suse.de>

md: factor out updating of 'recovery_offset'.

Each device has its own 'recovery_offset' showing how far
recovery has progressed on the device.
As the only real significance of this is that fact that it can
be stored in the metadata and recovered at restart, and as
only 1.x metadata can do this, we were only updating
'recovery_offset' to 'curr_resync_completed' when updating
v1.x metadata.
But this is wrong, and we will shortly make limited use of this
field in v0.90 metadata.

So move the update into common code.

Signed-off-by: NeilBrown <neilb@suse.de>


# 24395a85 05-Nov-2009 NeilBrown <neilb@suse.de>

md: don't clear endpoint for resync when resync is interrupted.

If a 'sync_max' has been set (via sysfs), it is wrong to clear it
until a resync (or reshape or recovery ...) actually reached that
point.
So if a resync is interrupted (e.g. by device failure),
leave 'resync_max' unchanged.

This is particularly important for 'reshape' operations that do not
change the size of the array. For such operations mdadm needs to
monitor the reshape taking rolling backups of the section being
reshaped. If resync_max gets cleared, the reshape can get ahead of
mdadm and then the backups that mdadm creates are useless.

This is suitable for 2.6.31.y stable kernels.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 5e5e3e78 15-Oct-2009 NeilBrown <neilb@suse.de>

md: Fix handling of raid5 array which is being reshaped to fewer devices.

When a raid5 (or raid6) array is being reshaped to have fewer devices,
conf->raid_disks is the latter and hence smaller number of devices.
However sometimes we want to use a number which is the total number of
currently required devices - the larger of the 'old' and 'new' sizes.
Before we implemented reducing the number of devices, this was always
'new' i.e. ->raid_disks.
Now we need max(raid_disks, previous_raid_disks) in those places.

This particularly affects assembling an array that was shutdown while
in the middle of a reshape to fewer devices.

md.c needs a similar fix when interpreting the md metadata.

Signed-off-by: NeilBrown <neilb@suse.de>


# 3fa841d7 23-Sep-2009 NeilBrown <neilb@suse.de>

md: report device as congested when suspended

This should writeback from coming when the device is temporarily
suspended.

Signed-off-by: NeilBrown <neilb@suse.de>


# 0da3c619 23-Sep-2009 NeilBrown <neilb@suse.de>

md: Improve name of threads created by md_register_thread

The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.

So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.

This is simpler and more correct.

Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# a9f326eb 23-Sep-2009 NeilBrown <neilb@suse.de>

md: remove sparse waring "symbol xxx shadows an earlier one"

Rename some variable and remove some duplicate definitions
to avoid there warnings. None of them are actual errors.

Signed-off-by: NeilBrown <neilb@suse.de>


# 83d5cde4 21-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

const: make block_device_operations const

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80ffb3cc 17-Aug-2009 NeilBrown <neilb@suse.de>

Fix new incorrect error return from do_md_stop.

Recent commit c8c00a6915a2e3d10416e8bdd3138429beb96210
changed the exit paths in do_md_stop and was not quite
careful enough. There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.

Signed-off-by: NeilBrown <neilb@suse.de>


# 4d484a4a 12-Aug-2009 NeilBrown <neilb@suse.de>

md: allow upper limit for resync/reshape to be set when array is read-only

Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races. But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.

If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.

So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.

Signed-off-by: NeilBrown <neilb@suse.de>


# 51d5668c 12-Aug-2009 NeilBrown <neilb@suse.de>

md: never advance 'events' counter by more than 1.

When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'. This is to cope with
a system failure between updating the metadata on two difference
devices.

However there are currently times when we update the event count by
2. This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.

This is bad for the above reason. So change it to never increase by
two. This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost. The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.

Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.

Signed-off-by: NeilBrown <neilb@suse.de>


# c8c00a69 09-Aug-2009 NeilBrown <neilb@suse.de>

Remove deadlock potential in md_open

A recent commit:
commit 449aad3e25358812c43afc60918c5ad3819488e7

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL. So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 449aad3e 02-Aug-2009 NeilBrown <neilb@suse.de>

md: Use revalidate_disk to effect changes in size of device.

As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode. So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.

Signed-off-by: NeilBrown <neilb@suse.de>


# 70471daf 02-Aug-2009 NeilBrown <neilb@suse.de>

md: Handle growth of v1.x metadata correctly.

The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.

Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 3673f305 02-Aug-2009 NeilBrown <neilb@suse.de>

md: avoid array overflow with bad v1.x metadata

We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array. This isn't really safe.
So range-check the value first.

Signed-off-by: NeilBrown <neilb@suse.de>


# 3a981b03 02-Aug-2009 NeilBrown <neilb@suse.de>

md: when a level change reduces the number of devices, remove the excess.

When an array is changed from RAID6 to RAID5, fewer drives are
needed. So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# ac5e7113 02-Aug-2009 Andre Noll <maan@systemlinux.org>

md: Push down data integrity code to personalities.

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity. The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# ad361c98 06-Jul-2009 Joe Perches <joe@perches.com>

Remove multiple KERN_ prefixes from printk formats

Commit 5fd29d6ccbc98884569d6f3105aeca70858b3e0f ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.

<level> is now included in the output on each additional use.

Remove all uses of multiple KERN_<level>s in formats.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e62e58a5 30-Jun-2009 NeilBrown <neilb@suse.de>

md: use interruptible wait when duration is controlled by userspace.

User space can set various limits on an md array so that resync waits
when it gets to a certain point, or so that I/O is blocked for a short
while.
When md is waiting against one of these limit, it should use an
interruptible wait so as not to add to the load average, and so are
not to trigger a warning if the wait goes on for too long.

Signed-off-by: NeilBrown <neilb@suse.de>


# 0909dc44 30-Jun-2009 NeilBrown <neilb@suse.de>

md: tidy up error paths in md_alloc

As the recent bug in md_alloc showed, having a single exit path for
unlocking and putting is a good idea. So restructure md_alloc to have
a single mutex_unlock and mddev_put, and use gotos where necessary.

Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 1ec22eb2 30-Jun-2009 NeilBrown <neilb@suse.de>

md: fix error path when duplicate name is found on md device creation.

When an md device is created by name (rather than number) we need to
check that the name is not already in use. If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.

Cc: stable@kernel.org
Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# b8d966ef 30-Jun-2009 NeilBrown <neilb@suse.de>

md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.

If we try to modify one of the md/ sysfs files
suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 0894cc30 17-Jun-2009 Andre Noll <maan@systemlinux.org>

md: Move check for bitmap presence to personality code.

If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.

Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.

This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.

A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 8190e754 17-Jun-2009 NeilBrown <neilb@suse.de>

md: remove chunksize rounding from common code.

It is easiest to round sizes to multiples of chunk size in
the personality code for those personalities which care.
Those personalities now do the rounding, so we can
remove that function from common code.

Also remove the upper bound on the size of a chunk, and the lower
bound on the size of a device (1 chunk), neither of which really buy
us anything.

Signed-off-by: NeilBrown <neilb@suse.de>


# 1b57f132 17-Jun-2009 NeilBrown <neilb@suse.de>

md: move assignment of ->utime so that it never gets skipped.

Currently the assignment to utime gets skipped for 'external'
metadata. So move it to the top of the function so that it
always gets effected.
This is of largely cosmetic interest. Nothing actually depends
on ->utime being right for external arrays.
"mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
mdadm-3.0, this is not important for external metadata.

Signed-off-by: NeilBrown <neilb@suse.de>


# 8c6ac868 17-Jun-2009 Andre Noll <maan@systemlinux.org>

md: Push down reconstruction log message to personality code.

Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).

Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 50ac168a 17-Jun-2009 NeilBrown <neilb@suse.de>

md: merge reconfig and check_reshape methods.

The difference between these two methods is artificial.
Both check that a pending reshape is valid, and perform any
aspect of it that can be done immediately.
'reconfig' handles chunk size and layout.
'check_reshape' handles raid_disks.

So make them just one method.

Signed-off-by: NeilBrown <neilb@suse.de>


# 597a711b 17-Jun-2009 NeilBrown <neilb@suse.de>

md: remove unnecessary arguments from ->reconfig method.

Passing the new layout and chunksize as args is not necessary as
the mddev has fields for new_check and new_layout.

This is preparation for combining the check_reshape and reconfig
methods

Signed-off-by: NeilBrown <neilb@suse.de>


# 664e7c41 17-Jun-2009 Andre Noll <maan@systemlinux.org>

md: Convert mddev->new_chunk to sectors.

A straight-forward conversion which gets rid of some
multiplications/divisions/shifts. The patch also introduces a couple
of new ones, most of which are due to conf->chunk_size still being
represented in bytes. This will be cleaned up in subsequent patches.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 9d8f0363 17-Jun-2009 Andre Noll <maan@systemlinux.org>

md: Make mddev->chunk_size sector-based.

This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics. Since

is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
= is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 2ac06c33 16-Jun-2009 raz ben yehuda <raziebe@gmail.com>

md: prepare for non-power-of-two chunk sizes

Remove chunk size check from md as this is now performed in the run
function in each personality.

Replace chunk size power 2 code calculations by a regular division.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>


# b492b852 25-May-2009 NeilBrown <neilb@suse.de>

md: don't use locked_ioctl.

md has no need for the BKL - it does its own locking.
So md_ioctl doesn't need to be a locked_ioctl.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7a91ee1f 25-May-2009 NeilBrown <neilb@suse.de>

md: don't update curr_resync_completed without also updating reshape_position.

In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.

The reshape code updates both at the same time. However since
commit 97e4f42d62badb0f9fbc27c013e89bc1336a03bc
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.

Signed-off-by: NeilBrown <neilb@suse.de>


# b6a9ce68 25-May-2009 NeilBrown <neilb@suse.de>

md: export 'frozen' resync state through sysfs

The md resync engine has a 'frozen' state which ensures that
no resync/recovery. This is used to avoid races.

Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2b69c839 25-May-2009 NeilBrown <neilb@suse.de>

md: improve errno return when setting array_size

Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
E2BIG
if the size requested is too large. This makes it easier
for user-space to be sure what went wrong.

Signed-off-by: NeilBrown <neilb@suse.de>


# 62e1e389 25-May-2009 NeilBrown <neilb@suse.de>

md: always update level / chunk_size / layout when writing v1.x metadata.

We previously didn't update these fields when writing the metadata
because they could never change. They can now, so we better write
them.
v0.90 metadata always updated these fields.

Signed-off-by: NeilBrown <neilb@suse.de>


# e1defc4f 22-May-2009 Martin K. Petersen <martin.petersen@oracle.com>

block: Do away with the notion of hardsect_size

Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# c4647292 06-May-2009 NeilBrown <neilb@suse.de>

md: remove rd%d links immediately after stopping an array.

md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
rd2 -> dev-sda

indicates that the device with role '2' in the array is sda.

These links are only present when the array is active. They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.

So move the removal earlier so they are consistently only present when
the array is active.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5bf29597 06-May-2009 NeilBrown <neilb@suse.de>

md: remove ability to explicit set an inactive array to 'clean'.

Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.

It is unnecessary because the same can be achieved by writing
'active'. This activates and array, but it still remains 'clean'
until the first write.

It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').

Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races: One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array. Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.

So just disable the use of 'clean' to activate an array.

This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.

Reported-by: Rafal Marszewski <rafal.marszewski@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 110518bc 06-May-2009 Jan Engelhardt <jengelh@medozas.de>

md: constify VFTs


Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: NeilBrown <neilb@suse.de>


# dd71cf6b 06-May-2009 NeilBrown <neilb@suse.de>

md: tidy up status_resync to handle large arrays.

Two problems in status_resync.
1/ It still used Kilobytes as the basic block unit, while most code
now uses sectors uniformly.
2/ It doesn't allow for the possibility that max_sectors exceeds
the range of "unsigned long".

So
- change "max_blocks" to "max_sectors", and store sector numbers
in there and in 'resync'
- Make 'rt' a 'sector_t' so it can temporarily hold the number of
remaining sectors.
- use sector_div rather than normal division.
- change the magic '100' used to preserve precision to '32'.
+ making it a power of 2 makes division easier
+ it doesn't need to be as large as it was chosen when we averaged
speed over the entire run. Now we average speed over the last 30
seconds or so.

Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>


# c03f6a19 16-Apr-2009 NeilBrown <neilb@suse.de>

md: update sync_completed and reshape_position even more often.

There are circumstances when a user-space process might need to
"oversee" a resync/reshape process. For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).

The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
suspend IO to the first section and back it up
set 'sync_max' to the end of the section
wait for 'sync_completed' to reach that point
resume IO on the first section and move on to the next section.

However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.

It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.

One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that). This defeats some of the double buffering.

With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.

To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max. This will usually be a good time for user
space to update sync_max.

If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops. So the update rate will be unreasonably fast only for an
insignificant period of time.

Signed-off-by: NeilBrown <neilb@suse.de>


# acb180b0 14-Apr-2009 NeilBrown <neilb@suse.de>

md: improve usefulness and accuracy of sysfs file md/sync_completed.

The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.

We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).

So:
- make curr_resync_completed be uptodate a little more often,
particularly when raid5 reshape updates status in the metadata
- report curr_resync_completed in the sysfs file
- allow poll/select to report all updates to md/sync_completed.

This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.

Signed-off-by: NeilBrown <neilb@suse.de>


# 6d56e278 13-Apr-2009 NeilBrown <neilb@suse.de>

md: allow setting newly added device to 'in_sync' via sysfs.

When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.

So add that option.

Signed-off-by: NeilBrown <neilb@suse.de>


# d1a7c503 30-Mar-2009 NeilBrown <neilb@suse.de>

md: don't display meaningless values in sysfs files resync_start and sync_speed

When no resync if happening, both of these files currently have
meaningless values (is slightly different ways).
Change them to "none" in that case.

Signed-off-by: NeilBrown <neilb@suse.de>


# cea9c228 30-Mar-2009 NeilBrown <neilb@suse.de>

md: add explicit method to signal the end of a reshape.

Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.

This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.

The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.

The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.

This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.

Signed-off-by: NeilBrown <neilb@suse.de>


# b522adcd 30-Mar-2009 Dan Williams <dan.j.williams@intel.com>

md: 'array_size' sysfs attribute

Allow userspace to set the size of the array according to the following
semantics:

1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality

Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.

Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 1f403624 30-Mar-2009 Dan Williams <dan.j.williams@intel.com>

md: centralize ->array_sectors modifications

Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# b3546035 30-Mar-2009 NeilBrown <neilb@suse.de>

md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5.

2-drive raid5's aren't very interesting. But if you are converting
a raid1 into a raid5, you will at least temporarily have one. And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.

layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.

Signed-off-by: NeilBrown <neilb@suse.de>


# 245f46c2 30-Mar-2009 NeilBrown <neilb@suse.de>

md: add ->takeover method to support changing the personality managing an array

Implement this for RAID6 to be able to 'takeover' a RAID5 array. The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.

Signed-off-by: NeilBrown <neilb@suse.de>


# 409c57f3 30-Mar-2009 NeilBrown <neilb@suse.de>

md: enable suspend/resume of md devices.

To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).

Signed-off-by: NeilBrown <neilb@suse.de>


# e0cf8f04 30-Mar-2009 NeilBrown <neilb@suse.de>

md: md_unregister_thread should cope with being passed NULL

Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first. It is safer
to put the check inside md_unregister_thread itself.

Signed-off-by: NeilBrown <neilb@suse.de>


# 34817e8c 30-Mar-2009 NeilBrown <neilb@suse.de>

md: make sure new_level, new_chunksize, new_layout always have sensible values.

When an md array is undergoing a change, we have new_* fields that
show the new values.
When no change is happening, it is least confusing if these have
the same value as the normal fields.
This is true in most cases, but not when the values are set via sysfs.

So fix this up.

A subsequent patch will BUG_ON if these things aren't consistent.


Signed-off-by: NeilBrown <neilb@suse.de>


# dd8ac336 30-Mar-2009 Andre Noll <maan@systemlinux.org>

md: Represent raid device size in sectors.

This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.

All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 58c0fed4 30-Mar-2009 Andre Noll <maan@systemlinux.org>

md: Make mddev->size sector-based.

This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.

All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.

In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 575a80fa 30-Mar-2009 NeilBrown <neilb@suse.de>

md: be more consistent about setting WriteMostly flag when adding a drive to an array

When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from: the metadata on the disk or the
flags passed through the IOCTL.

For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).

This makes it awkward to clear, and is at best inconsistent.

As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.


Signed-off-by: NeilBrown <neilb@suse.de>


# 97e4f42d 30-Mar-2009 NeilBrown <neilb@suse.de>

md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash

Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.

One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed. And we need to know what has completed to record
how much is recovered. So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.

When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date. So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number. But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.


Signed-off-by: NeilBrown <neilb@suse.de>


# 43b2e5d8 30-Mar-2009 NeilBrown <neilb@suse.de>

md: move md_k.h from include/linux/raid/ to drivers/md/

It really is nicer to keep related code together..

Signed-off-by: NeilBrown <neilb@suse.de>


# bff61975 30-Mar-2009 NeilBrown <neilb@suse.de>

md: move lots of #include lines out of .h files and into .c

This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h

Remove include/raid/md.h as its only remaining use was to #include
other files.

Signed-off-by: NeilBrown <neilb@suse.de>


# 8b2b5c21 30-Mar-2009 NeilBrown <neilb@suse.de>

md: move LEVEL_* definition from md_k.h to md_u.h

.. as they are part of the user-space interface.
Also move MdpMinorShift into there so we can remove duplication.

Lastly move mdp_major in. It is less obviously part of the user-space
interface, but do_mounts_md.c uses it, and it is acting a bit like
user-space.

Signed-off-by: NeilBrown <neilb@suse.de>


# ef740c37 30-Mar-2009 Christoph Hellwig <hch@lst.de>

md: move headers out of include/linux/raid/

Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away. md.h is left where it is for now as there
are some uses from the outside.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>


# 3dbd8c2e 30-Mar-2009 Christoph Hellwig <hch@lst.de>

md: stop defining MAJOR_NR

MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
kernels, so no need to keep it around.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>


# 3f9d99c1 30-Mar-2009 Martin K. Petersen <martin.petersen@oracle.com>

MD data integrity support

md: Add support for data integrity to MD

If all subdevices support the same protection format the MD device is
flagged as integrity capable.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# eea1bf38 30-Mar-2009 NeilBrown <neilb@suse.de>

md: Fix is_mddev_idle test (again).

There are two problems with is_mddev_idle.

1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the
rest are 'long'.
So if sync_io were to wrap on a 64bit host, the value of
curr_events would go very negative suddenly, and take a very
long time to return to positive.

So do all calculations as 'int'. That gives us plenty of precision
for what we need.

2/ To initialise rdev->last_events we simply call is_mddev_idle, on
the assumption that it will make sure that last_events is in a
suitable range. It used to do this, but now it does not.
So now we need to be more explicit about initialisation.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5fd3a17e 04-Mar-2009 Dan Williams <dan.j.williams@intel.com>

md: fix deadlock when stopping arrays

Resolve a deadlock when stopping redundant arrays, i.e. ones that
require a call to sysfs_remove_group when shutdown. The deadlock is
summarized below:

Thread1 Thread2
------- -------
read sysfs attribute stop array
take mddev lock
sysfs_remove_group
sysfs_get_active
wait for mddev lock
wait for active

Sysrq-w:
--------
mdmon S 00000017 2212 4163 1
f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c
c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc
00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd
Call Trace:
[<c044d9dd>] ? debug_mutex_add_waiter+0x1d/0x58
[<c06ef451>] __mutex_lock_common+0x1d9/0x338
[<c06ef451>] ? __mutex_lock_common+0x1d9/0x338
[<c06ef5e3>] mutex_lock_interruptible_nested+0x33/0x3a
[<c0634553>] ? mddev_lock+0x14/0x16
[<c0634553>] mddev_lock+0x14/0x16
[<c0634eda>] md_attr_show+0x2a/0x49
[<c04e9997>] sysfs_read_file+0x93/0xf9
mdadm D 00000017 2812 4177 1
f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c
c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70
00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000
Call Trace:
[<c06eed1b>] schedule_timeout+0x1b/0x95
[<c06eed1b>] ? schedule_timeout+0x1b/0x95
[<c06eeb97>] ? wait_for_common+0x34/0xdc
[<c044fa8a>] ? trace_hardirqs_on_caller+0x18/0x145
[<c044fbc2>] ? trace_hardirqs_on+0xb/0xd
[<c06eec03>] wait_for_common+0xa0/0xdc
[<c0428c7c>] ? default_wake_function+0x0/0x12
[<c06eeccc>] wait_for_completion+0x17/0x19
[<c04ea620>] sysfs_addrm_finish+0x19f/0x1d1
[<c04e920e>] sysfs_hash_and_remove+0x42/0x55
[<c04eb4db>] sysfs_remove_group+0x57/0x86
[<c0638086>] do_md_stop+0x13a/0x499

This has been there for a while, but is easier to trigger now that mdmon
is closely watching sysfs.

Cc: <stable@kernel.org>
Reported-by: Jacek Danecki <jacek.danecki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 93dbb393 16-Feb-2009 Jens Axboe <jens.axboe@oracle.com>

block: fix bad definition of BIO_RW_SYNC

We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fec62ef4c3675621b9364a667954d4dd.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# de01dfad 06-Feb-2009 NeilBrown <neilb@suse.de>

md: Ensure an md array never has too many devices.

Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.

We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.

We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.

So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.

This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.

Cc: stable@kernel.org

Signed-off-by: NeilBrown <neilb@suse.de>


# 4044ba58 08-Jan-2009 NeilBrown <neilb@suse.de>

md: don't retry recovery of raid1 that fails due to error on source drive.

If a raid1 has only one working drive and it has a sector which
gives an error on read, then an attempt to recover onto a spare will
fail, but as the single remaining drive is not removed from the
array, the recovery will be immediately re-attempted, resulting
in an infinite recovery loop.

So detect this situation and don't retry recovery once an error
on the lone remaining drive is detected.

Allow recovery to be retried once every time a spare is added
in case the problem wasn't actually a media error.

Signed-off-by: NeilBrown <neilb@suse.de>


# efeb53c0 08-Jan-2009 NeilBrown <neilb@suse.de>

md: Allow md devices to be created by name.

Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.

Also, creating md devices by opening the device special file is a bit
awkward.

So this patch provides a new option for creating and naming devices.

Writing a name such as "md_home" to
/sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created. It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.

md devices that a created by an open are destroyed on the last
close when the device is inactive.
For named md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.

The name of the array must start 'md_' to avoid conflict with
other devices.

Signed-off-by: NeilBrown <neilb@suse.de>


# d3374825 08-Jan-2009 NeilBrown <neilb@suse.de>

md: make devices disappear when they are no longer needed.

Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.

If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.

So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open

there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.

Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.

So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.

2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.

Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).

As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().



Signed-off-by: NeilBrown <neilb@suse.de>


# a21d1504 08-Jan-2009 NeilBrown <neilb@suse.de>

md: centralise all freeing of an 'mddev' in 'md_free'

md_free is the .release handler for the md kobj_type.
So it makes sense to release all the objects referenced by
the mddev in there, rather than just prior to calling kobject_put
for what we think is the last time.

Signed-off-by: NeilBrown <neilb@suse.de>


# 8b765398 08-Jan-2009 NeilBrown <neilb@suse.de>

md: move allocation of ->queue from mddev_find to md_probe

It is more balanced to just do simple initialisation in mddev_find,
which allocates and links a new md device, and leave all the
more sophisticated allocation to md_probe (which calls mddev_find).
md_probe already allocated the gendisk. It should allocate the
queue too.

Signed-off-by: NeilBrown <neilb@suse.de>


# cd2ac932 08-Jan-2009 Cheng Renquan <crquan@gmail.com>

md: need another print_sb for mdp_superblock_1

md_print_devices is called in two code path: MD_BUG(...), and md_ioctl
with PRINT_RAID_DEBUG. it will dump out all in use md devices
information;

However, it wrongly processed two types of superblock in one:

The header file <linux/raid/md_p.h> has defined two types of superblock,
struct mdp_superblock_s (typedefed with mdp_super_t) according to md with
metadata 0.90, and struct mdp_superblock_1 according to md with metadata
1.0 and later,

These two types of superblock are very different,

The md_print_devices code processed them both in mdp_super_t, that would
lead to wrong informaton dump like:

[ 6742.345877]
[ 6742.345887] md: **********************************
[ 6742.345890] md: * <COMPLETE RAID STATE PRINTOUT> *
[ 6742.345892] md: **********************************
[ 6742.345896] md1: <ram7><ram6><ram5><ram4>
[ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
[ 6742.345909] md: rdev superblock:
[ 6742.345914] md: SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d
[ 6742.345918] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
[ 6742.345922] md: UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001
[ 6742.345924] D 0: DISK<N:0,(1,8),R:0,S:6>
[ 6742.345930] D 1: DISK<N:1,(1,10),R:1,S:6>
[ 6742.345933] D 2: DISK<N:2,(1,12),R:2,S:6>
[ 6742.345937] D 3: DISK<N:3,(1,14),R:3,S:6>
[ 6742.345942] md: THIS: DISK<N:3,(1,14),R:3,S:6>
...
[ 6742.346058] md0: <ram3><ram2><ram1><ram0>
[ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
[ 6742.346070] md: rdev superblock:
[ 6742.346073] md: SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c
[ 6742.346077] md: L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610
[ 6742.346081] md: UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000
[ 6742.346084] D 0: DISK<N:-1,(-1,-1),R:-1,S:-1>
[ 6742.346089] D 1: DISK<N:-1,(-1,-1),R:-1,S:-1>
[ 6742.346092] D 2: DISK<N:-1,(-1,-1),R:-1,S:-1>
[ 6742.346096] D 3: DISK<N:-1,(-1,-1),R:-1,S:-1>
[ 6742.346102] md: THIS: DISK<N:0,(0,0),R:0,S:0>
...
[ 6742.346219] md: **********************************
[ 6742.346221]

Here md1 is metadata 0.90.0, and md0 is metadata 1.2

After some more code to distinguish these two types of superblock, in this patch,

it will generate dump information like:

[ 7906.755790]
[ 7906.755799] md: **********************************
[ 7906.755802] md: * <COMPLETE RAID STATE PRINTOUT> *
[ 7906.755804] md: **********************************
[ 7906.755808] md1: <ram7><ram6><ram5><ram4>
[ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
[ 7906.755821] md: rdev superblock (MJ:0):
[ 7906.755826] md: SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3
[ 7906.755830] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
[ 7906.755834] md: UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001
[ 7906.755836] D 0: DISK<N:0,(1,8),R:0,S:6>
[ 7906.755842] D 1: DISK<N:1,(1,10),R:1,S:6>
[ 7906.755845] D 2: DISK<N:2,(1,12),R:2,S:6>
[ 7906.755849] D 3: DISK<N:3,(1,14),R:3,S:6>
[ 7906.755855] md: THIS: DISK<N:3,(1,14),R:3,S:6>
...
[ 7906.755972] md0: <ram3><ram2><ram1><ram0>
[ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
[ 7906.755984] md: rdev superblock (MJ:1):
[ 7906.755989] md: SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd>
[ 7906.755990] md: Name: "DG5:0" CT:1226410480
[ 7906.755998] md: L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0
[ 7906.755999] md: Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a
[ 7906.756001] md: (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829
[ 7906.756003] md: (MaxDev:384)
...
[ 7906.756113] md: **********************************
[ 7906.756116]

this md0 (metadata 1.2) information dumping is exactly according to struct
mdp_superblock_1.

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 159ec1fc 08-Jan-2009 Cheng Renquan <crquan@gmail.com>

md: use list_for_each_entry macro directly

The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
list_for_each_entry_safe, from <linux/list.h>, it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.

But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.

In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0c3573f1 08-Jan-2009 NeilBrown <neilb@suse.de>

md: use sysfs_notify_dirent to notify changes to md/sync_action.

There is no compelling need for this, but sysfs_notify_dirent is a
nicer interface and the change is good for consistency.

Signed-off-by: NeilBrown <neilb@suse.de>


# cb3ac42b 05-Nov-2008 NeilBrown <neilb@suse.de>

md: revert the recent addition of a call to the BLKRRPART ioctl.

It turns out that it is only safe to call blkdev_ioctl when the device
is actually open (as ->bd_disk is set to NULL on last close). And it
is quite possible for do_md_stop to be called when the device is not
open. So discard the call to blkdev_ioctl(BLKRRPART) which was
added in
commit 934d9c23b4c7e31840a895ba4b7e88d6413c81f3

It is just as easy to call this ioctl from userspace when needed (on
mdadm -S) so leave it out of the kernel

Signed-off-by: NeilBrown <neilb@suse.de>


# 934d9c23 28-Oct-2008 NeilBrown <neilb@suse.de>

md: destroy partitions and notify udev when md array is stopped.

md arrays are not currently destroyed when they are stopped - they
remain in /sys/block. Last time I tried this I tripped over locking
too much.

A consequence of this is that udev doesn't remove anything from /dev.
This is rather ugly.

As an interim measure until proper device removal can be achieved,
make sure all partitions are removed using the BLKRRPART ioctl, and
send a KOBJ_CHANGE when an md array is stopped.

Signed-off-by: NeilBrown <neilb@suse.de>


# 9a1c3542 22-Feb-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] pass fmode_t to blkdev_put()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a39907fa 02-Mar-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] switch md

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d4430d62 02-Mar-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] beginning of methods conversion

To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.

Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.

New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 92850bbd 20-Oct-2008 NeilBrown <neilb@suse.de>

md: allow extended partitions on md devices.

The new extended partition support provides a much nicer was
to have partitions on md devices that the 'mdp' alternate major.
We cannot really get rid of 'mdp' at this time, but we can
enable extended partitions as that will probably make life
easier for sysadmins.

Signed-off-by: NeilBrown <neilb@suse.de>


# 3c0ee63a 20-Oct-2008 NeilBrown <neilb@suse.de>

md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state

The 'state' file for a device reports, for example, when the device
has failed. Changes should be reported to userspace ASAP without
the possibility of blocking on low-memory. sysfs_notify does
have that possibility (as it takes a mutex which can be held
across a kmalloc) so use sysfs_notify_dirent instead.

Signed-off-by: NeilBrown <neilb@suse.de>


# b62b7590 20-Oct-2008 NeilBrown <neilb@suse.de>

md: use sysfs_notify_dirent to notify changes to md/array_state

Now that we have sysfs_notify_dirent, use it to notify changes
to md/array_state.
As sysfs_notify_dirent can be called in atomic context, we can
remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag.

Signed-off-by: NeilBrown <neilb@suse.de>


# a65e5d78 09-Jul-2008 Johannes Berg <johannes@sipsolutions.net>

remove CONFIG_KMOD from drivers

Straight forward conversions to CONFIG_MODULE; many drivers
include <linux/kmod.h> conditionally and then don't have any
other conditional code so remove it from those.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: video4linux-list@redhat.com
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-ppp@vger.kernel.org
Cc: dm-devel@redhat.com
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 97ce0a7f 24-Sep-2008 Dan Williams <dan.j.williams@gmail.com>

md: fix input truncation in safe_delay_store()

safe_delay_store() currently truncates the last character of input since
it tells strlcpy that the buffer can only hold 'len' characters, off by
one. sysfs already null terminates the buffer, so just increase the
last argument to strlcpy.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 25570727 14-Oct-2008 Stephen Rothwell <sfr@canb.auug.org.au>

md: build failure due to missing delay.h

Today's linux-next build (powerpc ppc64_defconfig) failed like this:

drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'

Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations"). I added
the following patch.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>


# 4bbf3771 12-Oct-2008 NeilBrown <neilb@suse.de>

md: Relax minimum size restrictions on chunk_size.

Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.

This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.

For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe. For other raid personalities, this restriction is
not needed at all and can be dropped.

So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.

Signed-off-by: NeilBrown <neilb@suse.de>


# d710e138 12-Oct-2008 NeilBrown <neilb@suse.de>

md: remove space after function name in declaration and call.

Having
function (args)
instead of
function(args)

make is harder to search for calls of particular functions.
So remove all those spaces.

Signed-off-by: NeilBrown <neilb@suse.de>


# fb4d8c76 12-Oct-2008 NeilBrown <neilb@suse.de>

md: Remove unnecessary #includes, #defines, and function declarations.

A lot of cruft has gathered over the years. Time to remove it.

Signed-off-by: NeilBrown <neilb@suse.de>


# 80268ee9 12-Oct-2008 NeilBrown <neilb@suse.de>

md: Don't try to set an array to 'read-auto' if it is already in that state.

'read-auto' is a variant of 'readonly' which will switch to writable
on the first write attempt.

Calling do_md_stop to set the array readonly when it is already readonly
returns an error. So make sure not to do that.

Signed-off-by: NeilBrown <neilb@suse.de>


# ea43ddd8 12-Oct-2008 NeilBrown <neilb@suse.de>

md: Allow metadata_version to be updated for externally managed metadata.

For externally managed metadata, the 'metadata_version' sysfs
attribute is really just a channel for user-space programs to
communicate about how the array is being managed.
It can be useful for this to be changed while the array is active.

Normally changes to metadata_version are not permitted while the array
is active. Change that so that if the metadata is externally managed,
the metadata_version can be changed to a different flavour of external
management.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7d3c6f87 12-Oct-2008 Chris Webb <chris@arachsys.com>

md: Fix rdev_size_store with size == 0


Fix rdev_size_store with size == 0.
size == 0 means to use the largest size allowed by the
underlying device and is used when modifying an active array.

This fixes a regression introduced by
commit d7027458d68b2f1752a28016dcf2ffd0a7e8f567

Cc: <stable@kernel.org>
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 074a7aca 25-Aug-2008 Tejun Heo <tj@kernel.org>

block: move stats from disk to part0

Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...

* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().

* {disk|all}_stat_*() are gone.

* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.

* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.

* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.

* Separate stats show code paths for disk are collapsed into part
stats show code paths.

* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()

While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0762b8bd 25-Aug-2008 Tejun Heo <tj@kernel.org>

block: always set bdev->bd_part

Till now, bdev->bd_part is set only if the bdev was for parts other
than part0. This patch makes bdev->bd_part always set so that code
paths don't have to differenciate common handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# ed9e1982 25-Aug-2008 Tejun Heo <tj@kernel.org>

block: implement and use {disk|part}_to_dev()

Implement {disk|part}_to_dev() and use them to access generic device
instead of directly dereferencing {disk|part}->dev. To make sure no
user is left behind, rename generic devices fields to __dev.

This is in preparation of unifying partition 0 handling with other
partitions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 9744197c 18-Sep-2008 NeilBrown <neilb@suse.de>

md: Don't wait UNINTERRUPTIBLE for other resync to finish

When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.

This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.

So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.

Signed-off-by: NeilBrown <neilb@suse.de>


# 271f5a9b 31-Aug-2008 NeilBrown <neilb@suse.de>

Remove invalidate_partition call from do_md_stop.

When stopping an md array, or just switching to read-only, we
currently call invalidate_partition while holding the mddev lock.
The main reason for this is probably to ensure all dirty buffers
are flushed (invalidate_partition calls fsync_bdev).

However if any dirty buffers are found, it will almost certainly cause
a deadlock as starting writeout will require an update to the
superblock, and performing that updates requires taking the mddev
lock - which is already held.

This deadlock can be demonstrated by running "reboot -f -n" with
a root filesystem on md/raid, and some dirty buffers in memory.

All other calls to stop an array should already happen after a flush.
The normal sequence is to stop using the array (e.g. umount) which
will cause __blkdev_put to call sync_blockdev. Then open the
array and issue the STOP_ARRAY ioctl while the buffers are all still
clean.

So this invalidate_partition is normally a no-op, except for one case
where it will cause a deadlock.

So remove it.

This patch possibly addresses the regression recored in
http://bugzilla.kernel.org/show_bug.cgi?id=11460
and
http://bugzilla.kernel.org/show_bug.cgi?id=11452

though it isn't yet clear how it ever worked.


Signed-off-by: NeilBrown <neilb@suse.de>


# 56ac36d7 07-Aug-2008 Dan Williams <dan.j.williams@intel.com>

md: cancel check/repair requests when recovery is needed

If a 'repair' is requested when an array is in a position to 'recover' raid1
will perform the repair while md believes a recovery is happening. Address
this at both ends, i.e. cancel check/repair requests upon detecting a
recover condition and do not call ->spare_active after completing a
check/repair.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# c89a8eee 04-Aug-2008 NeilBrown <neilb@suse.de>

Allow faulty devices to be removed from a readonly array.

Removing faulty devices from an array is a two stage process.
First the device is moved from being a part of the active array
to being similar to a spare device. Then it can be removed
by a request from user space.

The first step is currently not performed for read-only arrays,
so the second step can never succeed.

So allow readonly arrays to remove failed devices (which aren't
blocked).

Signed-off-by: NeilBrown <neilb@suse.de>


# dba034ee 04-Aug-2008 NeilBrown <neilb@suse.de>

Fail safely when trying to grow an array with a write-intent bitmap.

We cannot currently change the size of a write-intent bitmap.
So if we change the size of an array which has such a bitmap, it
tries to set bits beyond the end of the bitmap.

For now, simply reject any request to change the size of an array
which has a bitmap. mdadm can remove the bitmap and add a new one
after the array has changed size.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2b25000b 04-Aug-2008 NeilBrown <neilb@suse.de>

Restore force switch of md array to readonly at reboot time.

A recent patch allowed do_md_stop to know whether it was being called
via an ioctl or not, and thus where to allow for an extra open file
descriptor when checking if it is in use.
This broke then switch to readonly performed by the shutdown notifier,
which needs to work even when the array is still (apparently) active
(as md doesn't get told when the filesystem becomes readonly).

So restore this feature by pretending that there can be lots of
file descriptors open, but we still want do_md_stop to switch to
readonly.

Signed-off-by: NeilBrown <neilb@suse.de>


# 19052c0e 04-Aug-2008 NeilBrown <neilb@suse.de>

Make writes to md/safe_mode_delay immediately effective.

If we reduce the 'safe_mode_delay', it could still wait for the old
delay to completely expire before doing anything about safe_mode.
Thus the effect if the change is delayed.

To make the effect more immediate, run the timeout function
immediately if the delay was reduced. This may cause it to run
slightly earlier that required, but that is the safer option.

Signed-off-by: NeilBrown <neilb@suse.de>


# e5427135 29-Jul-2008 Dan Williams <dan.j.williams@intel.com>

md: do not count blocked devices as spares

remove_and_add_spares() assumes that failed devices have been hot-removed
from the array. Removal is skipped in the 'blocked' case so do not count a
device in this state as 'spare'.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# d8e64406 23-Jul-2008 Dan Williams <dan.j.williams@intel.com>

md: delay notification of 'active_idle' to the recovery thread

sysfs_notify might sleep, so do not call it from md_safemode_timeout.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 4b80991c 21-Jul-2008 NeilBrown <neilb@suse.de>

md: Protect access to mddev->disks list using RCU

All modifications and most access to the mddev->disks list are made
under the reconfig_mutex lock. However there are three places where
the list is walked without any locking. If a reconfig happens at this
time, havoc (and oops) can ensue.

So use RCU to protect these accesses:
- wrap them in rcu_read_{,un}lock()
- use list_for_each_entry_rcu
- add to the list with list_add_rcu
- delete from the list with list_del_rcu
- delay the 'free' with call_rcu rather than schedule_work

Note that export_rdev did a list_del_init on this list. In almost all
cases the entry was not in the list anymore so it was a no-op and so
safe. It is no longer safe as after list_del_rcu we may not touch
the list_head.
An audit shows that export_rdev is called:
- after unbind_rdev_from_array, in which case the delete has
already been done,
- after bind_rdev_to_array fails, in which case the delete isn't needed.
- before the device has been put on a list at all (e.g. in
add_new_disk where reading the superblock fails).
- and in autorun devices after a failure when the device is on a
different list.

So remove the list_del_init call from export_rdev, and add it back
immediately before the called to export_rdev for that last case.

Note also that ->same_set is sometimes used for lists other than
mddev->list (e.g. candidates). In these cases rcu is not needed.

Signed-off-by: NeilBrown <neilb@suse.de>


# f2ea68cf 21-Jul-2008 NeilBrown <neilb@suse.de>

md: only count actual openers as access which prevent a 'stop'

Open isn't the only thing that increments ->active. e.g. reading
/proc/mdstat will increment it briefly. So to avoid false positives
in testing for concurrent access, introduce a new counter that counts
just the number of times the md device it open.

Signed-off-by: NeilBrown <neilb@suse.de>


# f233ea5c 21-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Make mddev->array_size sector-based.

This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 15f4a5fd 20-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Make super_type->rdev_size_change() take sector-based sizes.

Also, change the type of the size parameter from unsigned long long to
sector_t and rename it to num_sectors.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# d07bd3bc 20-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Fix check for overlapping devices.

The checks in overlaps() expect all parameters either in block-based
or sector-based quantities. However, its single caller passes two
rdev->data_offset arguments as well as two rdev->size arguments, the
former being sector counts while the latter are measured in 1K blocks.

This could cause rdev_size_store() to accept an invalid size from user
space. Fix it by passing only sector-based quantities to overlaps().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# d7027458 11-Jul-2008 Neil Brown <neilb@suse.de>

md: Tidy up rdev_size_store a bit:

- used strict_strtoull in place of simple_strtoull
- use my_mddev in place of rdev->mddev (they have the same value)
and more significantly,
- don't adjust mddev->size to fit, rather reject changes which make
rdev->size smaller than mddev->size

Adjusting mddev->size is a hangover from bind_rdev_to_array which
does a similar thing. But it really is a better design to insist that
mddev->size is set as required, then the rdev->sizes are set to allow
for that. The previous way invites confusion.

Signed-off-by: NeilBrown <neilb@suse.de>


# 0f420358 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Turn rdev->sb_offset into a sector-based quantity.

Rename it to sb_start to make sure all users have been converted.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# b73df2d3 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Make calc_dev_sboffset() return a sector count.

As BLOCK_SIZE_BITS is 10 and

MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x),

the return value of calc_dev_sboffset() doubles. Fix up all three
callers accordingly.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# e7debaa4 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Replace calc_dev_size() by calc_num_sectors().

Number of sectors is the preferred unit for sizes of raid devices,
so change calc_dev_size() so that it returns this unit instead of
the number of 1K blocks.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# d71f9f88 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Make update_size() take the number of sectors.

Changing the internal representations of sizes of raid devices
from 1K blocks to sector counts (512B units) is desirable because
it allows to get rid of many divisions/multiplications and unnecessary
casts that are present in the current code.

This patch is a first step in this direction. It replaces the old
1K-based "size" argument of update_size() by "num_sectors" and
fixes up its two callers.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# df5b20cf 11-Jul-2008 Neil Brown <neilb@suse.de>

md: Better control of when do_md_stop is allowed to stop the array.

do_md_stop check the number of active users before allowing the array
to be stopped.
Two problems:
1/ it assumes the request is coming through an open file descriptor
(via ioctl) so it allows for that. This is not always the case.
2/ it doesn't do the check it the array hasn't been activated.
This is not good for cases when we use an inactive array to hold
some devices in a container.

Signed-off-by: Neil Brown <neilb@suse.de>


# 26ef379f 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: get_disk_info(): Don't convert between signed and unsigned and back.

The current code copies a signed int from user space, converts it to
unsigned and passes the unsigned value to find_rdev_nr() which expects
a signed value. Simply pass the signed value from user space directly.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 80fab1d7 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Simplify restart_array().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# ebc24337 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: alloc_disk_sb(): Return proper error value.

If alloc_page() fails, ENOMEM is a more suitable error value
than EINVAL.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# ce0c8e05 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Simplify sb_equal().

The only caller of sb_equal() tests the return value against
zero, so it's OK to return the negated return value of memcmp().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 05710466 11-Jul-2008 Andre Noll <maan@systemlinux.org>

md: Simplify uuid_equal().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 35020f1a 23-Mar-2008 Andre Noll <maan@systemlinux.org>

md: sb_equal(): Fix misleading printk.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 7f6ce769 23-Mar-2008 Andre Noll <maan@systemlinux.org>

md: Fix a typo in the comment to cmd_match().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 910d8cb3 25-Mar-2008 Andre Noll <maan@systemlinux.org>

md: Fix typo in array_state comment.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 9687a60c 25-Mar-2008 Andre Noll <maan@systemlinux.org>

md: sync_speed_show(): Trivial cleanups.

- Remove superfluous parentheses.
- Make format string match the type of the variable that is printed.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 13e53df3 25-Mar-2008 Andre Noll <maan@systemlinux.org>

md: do_md_run(): Fix misleading error message.

In case pers->run() succeeds but creating the bitmap fails, we
print an error message stating that pers->run() has failed.

Print this message only if pers->run() really failed.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# 2f9618ce 25-Apr-2008 Andre Noll <maan@systemlinux.org>

md: md_getgeo(): Move comment to proper position.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# bb57fc64 25-Apr-2008 Andre Noll <maan@systemlinux.org>

md: md_ioctl(): Fix misleading indentation.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>


# b5470dc5 27-Jun-2008 Dan Williams <dan.j.williams@intel.com>

md: resolve external metadata handling deadlock in md_allow_write

md_allow_write() marks the metadata dirty while holding mddev->lock and then
waits for the write to complete. For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.

Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN. The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.

md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.

[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 0cd17fec 27-Jun-2008 Chris Webb <chris@arachsys.com>

Support changing rdev size on running arrays.

From: Chris Webb <chris@arachsys.com>

Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to zero
to fill all the available space on the underlying device.

Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: Neil Brown <neilb@suse.de>


# 52664732 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Make sure all changes to md/dev-XX/state are notified

The important state change happens during an interrupt
in md_error. So just set a flag there and call sysfs_notify
later in process context.

Signed-off-by: Neil Brown <neilb@suse.de>


# a99ac971 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Make sure all changes to md/degraded are notified.

When a device fails, when a spare is activated, when
an array is reshaped, or when an array is started,
the extent to which the array is degraded can change.

Signed-off-by: Neil Brown <neilb@suse.de>


# 72a23c21 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Make sure all changes to md/sync_action are notified.

When the 'resync' thread starts or stops, when we explicitly
set sync_action, or when we determine that there is definitely nothing
to do, we notify sync_action.

To stop "sync_action" from occasionally showing the wrong value,
we introduce a new flags - MD_RECOVERY_RECOVER - to say that a
recovery is probably needed or happening, and we make sure
that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED.

Signed-off-by: Neil Brown <neilb@suse.de>


# 0fd62b86 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Make sure all changes to md/array_state are notified.

Changes in md/array_state could be of interest to a monitoring
program. So make sure all changes trigger a notification.

Exceptions:
changing active_idle to active is not reported because it
is frequent and not interesting.
changing active to active_idle is only reported on arrays
with externally managed metadata, as it is not interesting
otherwise.

Signed-off-by: Neil Brown <neilb@suse.de>


# c7d0c941 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Don't reject HOT_REMOVE_DISK request for an array that is not yet started.

There is really no need for this test here, and there are valid
cases for selectively removing devices from an array that
it not actually active.

Signed-off-by: Neil Brown <neilb@suse.de>


# 199050ea 27-Jun-2008 Neil Brown <neilb@notabene.brown>

rationalise return value for ->hot_add_disk method.

For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.

This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.

So convert all to return 0 for success or -errno on failure
and fix call sites to match.

Signed-off-by: Neil Brown <neilb@suse.de>


# 6c2fce2e 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Support adding a spare to a live md array with external metadata.

i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.

Signed-off-by: Neil Brown <neilb@suse.de>


# 8ed0a521 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Enable setting of 'offset' and 'size' of a hot-added spare.

offset_store and rdev_size_store allow control of the region of a
device which is to be using in an md/raid array.
They only allow these values to be set when an array is being assembled,
as changing them on an active array could be dangerous.
However when adding a spare device to an array, we might need to
set the offset and size before starting recovery. So allow
these values to be set also if "->raid_disk < 0" which indicates that
the device is still a spare.

Signed-off-by: Neil Brown <neilb@suse.de>


# 1a0fd497 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Don't try to make md arrays dirty if that is not meaningful.

Arrays personalities such as 'raid0' and 'linear' have no redundancy,
and so marking them as 'clean' or 'dirty' is not meaningful.
So always allow write requests without requiring a superblock update.

Such arrays types are detected by ->sync_request being NULL. If it is
not possible to send a sync request we don't need a 'dirty' flag because
all a dirty flag does is trigger some sync_requests.

Signed-off-by: Neil Brown <neilb@suse.de>


# f48ed538 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Close race in md_probe

There is a possible race in md_probe. If two threads call md_probe
for the same device, then one could exit (having checked that
->gendisk exists) before the other has called kobject_init_and_add,
thus returning an incomplete kobj which will cause problems when
we try to add children to it.

So extend the range of protection of disks_mutex slightly to
avoid this possibility.

Signed-off-by: Neil Brown <neilb@suse.de>


# 5e96ee65 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Allow setting start point for requested check/repair

This makes it possible to just resync a small part of an array.
e.g. if a drive reports that it has questionable sectors,
a 'repair' of just the region covering those sectors will
cause them to be read and, if there is an error, re-written
with correct data.

Signed-off-by: Neil Brown <neilb@suse.de>


# 9bbbca3a 27-Jun-2008 Neil Brown <neilb@notabene.brown>

Fix error paths if md_probe fails.

md_probe can fail (e.g. alloc_disk could fail) without
returning an error (as it alway returns NULL).
So when we call mddev_find immediately afterwards, we need
to check that md_probe actually succeeded. This means checking
that mdev->gendisk is non-NULL.

cc: <stable@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>


# a6d8113a 05-Jun-2008 Dan Williams <dan.j.williams@intel.com>

md: fix uninitialized use of mddev->recovery_wait

If an array was created with --assume-clean we will oops when trying to
set ->resync_max.

Fix this by initializing ->recovery_wait in mddev_find.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dfc70645 23-May-2008 NeilBrown <neilb@suse.de>

md: restart recovery cleanly after device failure.

When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.

Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 90b08710 23-May-2008 Bernd Schubert <bs@q-leap.de>

md: allow parallel resync of md-devices.

In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed. In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.

Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f54b0e9 23-May-2008 Dan Williams <dan.j.williams@intel.com>

md: notify userspace on 'stop' events

This additional notification to 'array_state' is needed to allow the
monitor application to learn about stop events via sysfs. The
sysfs_notify("sync_action") call that comes at the end of do_md_stop()
(via md_new_event) is insufficient since the 'sync_action' attribute has
been removed by this point.

(Seems like a sysfs-notify-on-removal patch is a better fix. Currently
removal updates the event count but does not wake up waiters)

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09a44cc1 23-May-2008 NeilBrown <neilb@suse.de>

md: notify userspace on 'write-pending' changes to array_state

When an array enters write pending, 'array_state' changes, so we must be
sure to sysfs_notify.

Also, when waiting for user-space to acknowledge 'write-pending' by
marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to
be cleared as that might not happen. So explicity test for the bits that
we are really interested in.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6bcfd601 23-May-2008 Christoph Hellwig <hch@lst.de>

md: kill file_path wrapper

Kill the trivial and rather pointless file_path wrapper around d_path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6bfe0b49 30-Apr-2008 Dan Williams <dan.j.williams@intel.com>

md: support blocking writes to an array on device failure

Allows a userspace metadata handler to take action upon detecting a device
failure.

Based on an original patch by Neil Brown.

Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11e2ede0 30-Apr-2008 Dan Williams <dan.j.williams@intel.com>

md: prevent duplicates in bind_rdev_to_array

Found when trying to reassemble an active externally managed array. Without
this check we hit the more noisy "sysfs duplicate" warning in the later call
to kobject_add.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 242b363e 30-Apr-2008 Dan Williams <dan.j.williams@intel.com>

md: remove a stray command from a copy and paste error in resync_start_store

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 648b629e 30-Apr-2008 NeilBrown <neilb@suse.de>

md: fix up switching md arrays between read-only and read-write

When setting an array to 'readonly' or to 'active' via sysfs, we must make the
appropriate set_disk_ro call too.

Also when switching to "read_auto" (which is like readonly, but blocks on the
first write so that metadata can be marked 'dirty') we need to be more careful
about what state we are changing from.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31a59e34 30-Apr-2008 NeilBrown <neilb@suse.de>

md: fix 'safemode' handling for external metadata.

'safemode' relates to marking an array as 'clean' if there has been no write
traffic for a while (a couple of seconds), to reduce the chance of the array
being found dirty on reboot.

->safemode is set to '1' when there have been no write for a while, and it
gets set to '0' when the superblock is updates with the 'clean' flag set.

This requires a few fixes for 'external' metadata:
- When an array is set to 'clean' via sysfs, 'safemode' must be cleared.
- when we write to an array that has 'safemode' set (there must have been
some delay in updating the metadata), we need to clear safemode.
- Don't try to update external metadata in md_check_recovery for safemode
transitions - it won't work.

Also, don't try to support "immediate safe mode" (safemode==2) for external
metadata, it cannot really work (the safemode timeout can be set very low if
this is really needed).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d897dbf9 30-Apr-2008 NeilBrown <neilb@suse.de>

md: reinitialise more mddev fields in do_md_stop.

I keep finding problems where an mddev gets reused and some fields has a value
from a previous usage that confuses the new usage. So clear all fields that
could possible need clearing when calling do_md_stop.

Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8377bc80 30-Apr-2008 NeilBrown <neilb@suse.de>

md: skip all metadata update processing when using external metadata.

All the metadata update processing for external metadata is on in user-space
or through the sysfs interfaces, so make "md_update_sb" a no-op in that case.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6a51830e 30-Apr-2008 Dan Williams <dan.j.williams@intel.com>

md: fix use after free when removing rdev via sysfs

rdev->mddev is no longer valid upon return from entry->store() when the
'remove' command is given.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c7705f344 29-Apr-2008 Denis V. Lunev <den@openvz.org>

drivers: use non-racy method for proc entries creation (2)

Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Neil Brown <neilb@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75ad23bc 29-Apr-2008 Nick Piggin <npiggin@suse.de>

block: make queue flags non-atomic

We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 9a7b2b0f 28-Apr-2008 Harvey Harrison <harvey.harrison@gmail.com>

md: fix integer as NULL pointer warnings in md.c

drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer
drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer

Add some braces to match the else-block as well.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fdefa4d8 21-Apr-2008 Nick Andrew <nick@nick-andrew.net>

RAID: remove trailing space from printk line

drivers/md/*.[ch] contains only one more printk line with a trailing space.
Remove it.

Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>


# 0e82989d 19-Mar-2008 NeilBrown <neilb@suse.de>

md: remove the 'super' sysfs attribute from devices in an 'md' array

Exposing the binary blob which is the md 'super-block' via sysfs doesn't
really fit with the whole sysfs model, and ever since commit
8118a859dc7abd873193986c77a8d9bdb877adc8 ("sysfs: fix off-by-one error
in fill_read_buffer()") it doesn't actually work at all (as the size of
the blob is often one page).

(akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG)

So just remove it altogether. It isn't really useful.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52720ae7 10-Mar-2008 NeilBrown <neilb@suse.de>

md: fix formatting error in /proc/mdstat

If an md array is "auto-read-only", then this appears in /proc/mdstat as

/dev/md0: active(auto-read-only)

whereas if it is truely readonly, it appears as

/dev/md0: active (read-only)

The difference being a space.

One program known to parse this file expects the space and gets badly
confused. It will be fixed, but it would be best if what the kernel generates
is more consistent too.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27c529bb 04-Mar-2008 NeilBrown <neilb@suse.de>

md: lock access to rdev attributes properly

When we access attributes of an rdev (component device on an md array) through
sysfs, we really need to lock the array against concurrent changes. We
currently do that when we change an attribute, but not when we read an
attribute. We need to lock when reading as well else rdev->mddev could become
NULL while we are accessing it.

So add appropriate locking (mddev_lock) to rdev_attr_show.

rdev_size_store requires some extra care as well as it needs to unlock the
mddev while scanning other mddevs for overlapping regions. We currently
assume that rdev->mddev will still be unchanged after the scan, but that
cannot be certain. So take a copy of rdev->mddev for use at the end of the
function.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 25156198 04-Mar-2008 NeilBrown <neilb@suse.de>

md: make sure a reshape is started when device switches to read-write

A resync/reshape/recovery thread will refuse to progress when the array is
marked read-only. So whenever it mark it not read-only, it is important to
wake up thread resync thread. There is one place we didn't do this.

The problem manifests if the start_ro module parameters is set, and a raid5
array that is in the middle of a reshape (restripe) is started. The array
will initially be semi-read-only (meaning it acts like it is readonly until
the first write). So the reshape will not proceed.

On the first write, the array will become read-write, but the reshape will not
be started, and there is no event which will ever restart that thread.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0fae18f 04-Mar-2008 NeilBrown <neilb@suse.de>

md: clean up irregularity with raid autodetect

When a raid1 array is stopped, all components currently get added to the list
for auto-detection. However we should really only add components that were
found by autodetection in the first place. So add a flag to record that
information, and use it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1801f85 04-Mar-2008 NeilBrown <neilb@suse.de>

md: guard against possible bad array geometry in v1 metadata

Make sure the data doesn't start before the end of the superblock when the
superblock is at the start of the device.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c32c2f63 14-Feb-2008 Jan Blunck <jblunck@suse.de>

d_path: Make seq_path() use a struct path argument

seq_path() is always called with a dentry and a vfsmount from a struct path.
Make seq_path() take it directly as an argument.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73c34431 06-Feb-2008 NeilBrown <neilb@suse.de>

md: change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING.

Finish ITERATE_ to for_each conversion.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d089c6af 06-Feb-2008 NeilBrown <neilb@suse.de>

md: change ITERATE_RDEV to rdev_for_each

As this is more in line with common practice in the kernel. Also swap the
args around to be more like list_for_each.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 29ac4aa3 06-Feb-2008 NeilBrown <neilb@suse.de>

md: change INTERATE_MDDEV to for_each_mddev

As this is more consistent with kernel style.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 20a49ff6 06-Feb-2008 NeilBrown <neilb@suse.de>

md: change a few 'int' to 'size_t' in md

As suggested by Andrew Morton.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 177a99b2 06-Feb-2008 NeilBrown <neilb@suse.de>

md: fix use-after-free bug when dropping an rdev from an md array

Due to possible deadlock issues we need to use a schedule work to kobject_del
an 'rdev' object from a different thread.

A recent change means that kobject_add no longer gets a refernce, and
kobject_del doesn't put a reference. Consequently, we need to explicitly hold
a reference to ensure that the last reference isn't dropped before the
scheduled work get a chance to call kobject_del.

Also, rename delayed_delete to md_delayed_delete to that it is more obvious in
a stack trace which code is to blame.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a17184a9 06-Feb-2008 NeilBrown <neilb@suse.de>

md: allow an md array to appear with 0 drives if it has external metadata

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca388059 06-Feb-2008 NeilBrown <neilb@suse.de>

md: lock address when changing attributes of component devices

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5d79adb 06-Feb-2008 NeilBrown <neilb@suse.de>

md: allow devices to be shared between md arrays

Currently, a given device is "claimed" by a particular array so that it cannot
be used by other arrays.

This is not ideal for DDF and other metadata schemes which have their own
partitioning concept.

So for externally managed metadata, just claim the device for md in general,
require that "offset" and "size" are set properly for each device, and make
sure that if a device is included in different arrays then the active sections
do not overlap.

This involves adding another flag to the rdev which makes it awkward to set
"->flags = 0" to clear certain flags. So now clear flags explicitly by name
when we want to clear things.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1ec4a939 06-Feb-2008 NeilBrown <neilb@suse.de>

md: set and test the ->persistent flag for md devices more consistently

If you try to start an array for which the number of raid disks is listed as
zero, md will currently try to read metadata off any devices that have been
given. This was done because the value of raid_disks is used to signal
whether array details have been provided by userspace (raid_disks > 0) or must
be read from the devices (raid_disks == 0).

However for an array without persistent metadata (or with externally managed
metadata) this is the wrong thing to do. So we add a test in do_md_run to
give an error if raid_disks is zero for non-persistent arrays.

This requires that mddev->persistent is set corrently at this point, which it
currently isn't for in-kernel autodetected arrays.

So set ->persistent for autodetect arrays, and remove the settign in
super_*_validate which is now redundant.

Also clear ->persistent when stopping an array so it is consistently zero when
starting an array.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c6207277 06-Feb-2008 NeilBrown <neilb@suse.de>

md: allow a maximum extent to be set for resyncing

This allows userspace to control resync/reshape progress and synchronise it
with other activities, such as shared access in a SAN, or backing up critical
sections during a tricky reshape.

Writing a number of sectors (which must be a multiple of the chunk size if
such is meaningful) causes a resync to pause when it gets to that point.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c303da6d 06-Feb-2008 NeilBrown <neilb@suse.de>

md: give userspace control over removing failed devices when external metdata in use

When a device fails, we must not allow an further writes to the array until
the device failure has been recorded in array metadata. When metadata is
managed externally, this requires some synchronisation...

Allow/require userspace to explicitly remove failed devices from active
service in the array by writing 'none' to the 'slot' attribute. If this
reduces the number of failed devices to 0, the write block will automatically
be lowered.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e691063a 06-Feb-2008 NeilBrown <neilb@suse.de>

md: support 'external' metadata for md arrays

- Add a state flag 'external' to indicate that the metadata is managed
externally (by user-space) so important changes need to be
left of user-space to handle.
Alternates are non-persistant ('none') where there is no stable metadata -
after the array is stopped there is no record of it's status - and
internal which can be version 0.90 or version 1.x
These are selected by writing to the 'metadata' attribute.

- move the updating of superblocks (sync_sbs) to after we have checked if
there are any superblocks or not.

- New array state 'write_pending'. This means that the metadata records
the array as 'clean', but a write has been requested, so the metadata has
to be updated to record a 'dirty' array before the write can continue.
This change is reported to md by writing 'active' to the array_state
attribute.

- tidy up marking of sb_dirty:
- don't set sb_dirty when resync finishes as md_check_recovery
calls md_update_sb when the sync thread finishes anyway.
- Don't set sb_dirty in multipath_run as the array might not be dirty.
- don't mark superblock dirty when switching to 'clean' if there
is no internal superblock (if external, userspace can choose to
update the superblock whenever it chooses to).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c10997f6 20-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: convert drivers/* from kobject_unregister() to kobject_put()

There is no need for kobject_unregister() anymore, thanks to Kay's
kobject cleanup changes, so replace all instances of it with
kobject_put().


Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# f9cb074b 17-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: rename kobject_init_ng() to kobject_init()

Now that the old kobject_init() function is gone, rename
kobject_init_ng() to kobject_init() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# b2d6db58 17-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: rename kobject_add_ng() to kobject_add()

Now that the old kobject_add() function is gone, rename kobject_add_ng()
to kobject_add() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 649316b2 17-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: convert drivers/md/md.c to use kobject_init/add_ng()

This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Neil Brown <neilb@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# edfaa7c3 21-May-2007 Kay Sievers <kay.sievers@vrfy.org>

Driver core: convert block from raw kobjects to core devices

This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one
directory. For compatibility /sys/block is created and contains symlinks
to the disks.

/sys/class/block
|-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
|-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1
|-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10
|-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5
|-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6
|-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7
|-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8
|-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9
`-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

/sys/block/
|-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
`-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 3830c62f 17-Dec-2007 Greg Kroah-Hartman <gregkh@suse.de>

Kobject: change drivers/md/md.c to use kobject_init_and_add

Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.

Cc: Neil Brown <neilb@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 2ad8b1ef 07-Nov-2007 Alan D. Brunelle <Alan.Brunelle@hp.com>

Add UNPLUG traces to all appropriate places

Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.

Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# ba25f9dc 19-Oct-2007 Pavel Emelyanov <xemul@openvz.org>

Use helpers to obtain task pid in printks

The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d7f3d291 17-Oct-2007 Iustin Pop <iusty@k1024.org>

md: expose the degraded status of an assembled array through sysfs

The 'degraded' attribute is useful to quickly determine if the array is
degraded, instead of parsing 'mdadm -D' output or relying on the other
techniques (number of working devices against number of defined devices,
etc.). The md code already keeps track of this attribute, so it's useful to
export it.

Signed-off-by: Iustin Pop <iusty@k1024.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b12ab6d 17-Oct-2007 NeilBrown <neilb@suse.de>

md: 'sync_action' in sysfs returns wrong value for readonly arrays

When an array is started read-only, MD_RECOVERY_NEEDED can be set but no
recovery will be running. This causes 'sync_action' to report the wrong
value.

We could remove the test for MD_RECOVERY_NEEDED, but doing so would leave a
small gap after requesting a sync action, where 'sync_action' would still
report the old value.

So make sure that for a read-only array, 'sync_action' always returns 'idle'.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d936ec1 17-Oct-2007 Michael J. Evans <mjevans1983@gmail.com>

md: software Raid autodetect dev list not array

In current release kernels the md module (Software RAID) uses a static
array (dev_t[128]) to store partition/device info temporarily for
autostart.

I discovered this (and that the devices are added as disks/partitions are
discovered at boot) while I was debugging why only one of my MD arrays would
come up whole, while all the others were short a disk.

I eventually discovered that it was enumerating through all of 9 of my 11 hds
(2 had only 4 partitions apiece) while the other 9 have 15 partitions (I
wanted 64 per drive...). The last partition of the 8th drive in my 9 drive
raid 5 sets wasn't added, thus making the final md array short both a parity
and data disk, and it was started later, elsewhere.

This patch replaces that static array with a list.

[akpm@linux-foundation.org: removed unused var]
Signed-off-by: Michael J. Evans <mjevans1983@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd5d8062 16-Oct-2007 Jens Axboe <jens.axboe@oracle.com>

block: convert blkdev_issue_flush() to use empty barriers

Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 19c38de8 12-Sep-2007 Greg Kroah-Hartman <gregkh@suse.de>

kobjects: fix up improper use of the kobject name field

A number of different drivers incorrect access the kobject name field
directly. This is not correct as the name might not be in the array.
Use the proper accessor function instead.


# 6712ecf8 26-Sep-2007 NeilBrown <neilb@suse.de>

Drop 'size' argument from bio_endio and bi_end_io

As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 165125e1 24-Jul-2007 Jens Axboe <jens.axboe@oracle.com>

[BLOCK] Get rid of request_queue_t typedef

Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 4ad13663 17-Jul-2007 NeilBrown <neilb@suse.de>

md: change bitmap_unplug and others to void functions

bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
to print a message if it returns non-zero, but that message is already printed
by bitmap_file_kick.

write_page returns an error which is not consistently checked. It always
causes BITMAP_WRITE_ERROR to be set on an error, and that can more
conveniently be checked.

When the return of write_page is checked, an error causes bitmap_file_kick to
be called - so move that call into write_page - and protect against recursive
calls into bitmap_file_kick.

bitmap_update_sb returns an error that is never checked.

So make these 'void' and be consistent about checking the bit.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0d76d70 17-Jul-2007 NeilBrown <neilb@suse.de>

md: check that internal bitmap does not overlap other data

We current completely trust user-space to set up metadata describing an
consistant array. In particlar, that the metadata, data, and bitmap do not
overlap.

But userspace can be buggy, and it is better to report an error than corrupt
data. So put in some appropriate checks.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 713f6ab1 17-Jul-2007 NeilBrown <neilb@suse.de>

md: improve the is_mddev_idle test fix

Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing
we want to do with them is a signed comparison, and fix up the comment which
had become quite wrong.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df968c4e 17-Jul-2007 NeilBrown <neilb@suse.de>

md: improve message about invalid superblock during autodetect

People try to use raid auto-detect with version-1 superblocks (which is not
supported) and get confused when they are told they have an invalid
superblock.

So be more explicit, and say it it is not a valid v0.90 superblock.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83144186 17-Jul-2007 Rafael J. Wysocki <rjw@rjwysocki.net>

Freezer: make kernel threads nonfreezable by default

Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves. This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.

It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.

The patch causes all kernel threads to be nonfreezable by default (ie. to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE. It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear. Additionally, it updates documentation to
describe the freezing of tasks more accurately.

[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 685784aa 09-Jul-2007 Dan Williams <dan.j.williams@intel.com>

xor: make 'xor_blocks' a library routine for use with async_tx

The async_tx api tries to use a dma engine for an operation, but will fall
back to an optimized software routine otherwise. Xor support is
implemented using the raid5 xor routines. For organizational purposes this
routine is moved to a common area.

The following fixes are also made:
* rename xor_block => xor_blocks, suggested by Adrian Bunk
* ensure that xor.o initializes before md.o in the built-in case
* checkpatch.pl fixes
* mark calibrate_xor_blocks __init, Adrian Bunk

Cc: Adrian Bunk <bunk@stusta.de>
Cc: NeilBrown <neilb@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# a778b73f 23-May-2007 NeilBrown <neilb@suse.de>

md: fix bug with linear hot-add and elsewhere

Adding a drive to a linear array seems to have stopped working, due to changes
elsewhere in md, and insufficient ongoing testing...

So the patch to make linear hot-add work in the first place introduced a
subtle bug elsewhere that interracts poorly with older version of mdadm.

This fixes it all up.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 435b71be 10-May-2007 NeilBrown <neilb@suse.de>

md: improve the is_mddev_idle test

During a 'resync' or similar activity, md checks if the devices in the
array are otherwise active and winds back resync activity when they are.
This test in done in is_mddev_idle, and it is somewhat fragile - it
sometimes thinks there is non-sync io when there isn't.

The test compares the total sectors of io (disk_stat_read) with the sectors
of resync io (disk->sync_io). This has problems because total sectors gets
updated when a request completes, while resync io gets updated when the
request is submitted. The time difference can cause large differenced
between the two which do not actually imply non-resync activity. The test
currently allows for some fuzz (+/- 4096) but there are some cases when it
is not enough.

The test currently looks for any (non-fuzz) difference, either positive or
negative. This clearly is not needed. Any non-sync activity will cause
the total sectors to grow faster than the sync_io count (never slower) so
we only need to look for a positive differences.

If we do this then the amount of in-flight sync io will never cause the
appearance of non-sync IO. Once enough non-sync IO to worry about starts
happening, resync will be slowed down and the measurements will thus be
more precise (as there is less in-flight) and control of resync will still
be suitably responsive.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 44ce6294 09-May-2007 Linus Torvalds <torvalds@woody.linux-foundation.org>

Revert "md: improve partition detection in md array"

This reverts commit 5b479c91da90eef605f851508744bfe8269591a0.

Quoth Neil Brown:

"It causes an oops when auto-detecting raid arrays, and it doesn't
seem easy to fix.

The array may not be 'open' when do_md_run is called, so
bdev->bd_disk might be NULL, so bd_set_size can oops.

This whole approach of opening an md device before it has been
assembled just seems to get more and more painful. I think I'm going
to have to come up with something clever to provide both backward
comparability with usage expectation, and sane integration into the
rest of the kernel."

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b479c91 09-May-2007 NeilBrown <neilb@suse.de>

md: improve partition detection in md array

md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.

However that doesn't happen until the array is opened, which is later
than some people would like.

So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.

This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 08a02ecd 09-May-2007 NeilBrown <neilb@suse.de>

md: allow reshape_position for md arrays to be set via sysfs

"reshape_position" records how much progress has been made on a "reshape"
(adding drives, changing layout or chunksize).

When it is set, the number of drives, layout and chunksize can have
two possible values, an old an a new.

So allow these different values to be visible, and allow both old and new to
be set: Set the old ones first, then the reshape_position, then the new
values.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d167f09 09-May-2007 NeilBrown <neilb@suse.de>

md: stop using csum_partial for checksum calculation in md

If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
use it. We shouldn't really be using csum_partial anyway as it is an
internal-to-networking interface.

So replace it with C code to do the same thing. Speed is not crucial here, so
something simple and correct is best.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e11e93fa 09-May-2007 NeilBrown <neilb@suse.de>

md: move test for whether level supports bitmap to correct place

We need to check for internal-consistency of superblock in load_super.
validate_super is for inter-device consistency.

With the test in the wrong place, a badly created array will confuse md rather
an produce sensible errors.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3f94b40 09-May-2007 Martin Peschke <mp3@de.ibm.com>

md: cleanup: use seq_release_private() where appropriate

We can save some lines of code by using seq_release_private().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 50511da3 09-May-2007 Ahmed S. Darwish <darwish.07@gmail.com>

drivers/md.c: Use ARRAY_SIZE macro when appropriate

Use ARRAY_SIZE macro already defined in kernel.h

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f98393a6 06-May-2007 Peter Zijlstra <a.p.zijlstra@chello.nl>

mm: remove destroy_dirty_buffers from invalidate_bdev()

Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).

find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
quilt add $file;
sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5792a285 04-Apr-2007 NeilBrown <neilb@suse.de>

[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs

A device can be removed from an md array via e.g.
echo remove > /sys/block/md3/md/dev-sde/state

This will try to remove the 'dev-sde' subtree which will deadlock
since
commit e7b0d26a86943370c04d6833c6edba2a72a6e240

With this patch we run the kobject_del via schedule_work so as to
avoid the deadlock.

Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e55e2f5 26-Mar-2007 NeilBrown <neilb@suse.de>

[PATCH] md: convert compile time warnings into runtime warnings

... still not sure why we need this ....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 041ae52e 26-Mar-2007 NeilBrown <neilb@suse.de>

[PATCH] md: clear the congested_fn when stopping a raid5

If this mddev and queue got reused for another array that doesn't register a
congested_fn, this function would get called incorretly.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b4c4c7b8 28-Feb-2007 NeilBrown <neilb@suse.de>

[PATCH] md: restart a (raid5) reshape that has been aborted due to a read/write error

An error always aborts any resync/recovery/reshape on the understanding that
it will immediately be restarted if that still makes sense. However a reshape
currently doesn't get restarted. With this patch it does.

To avoid restarting when it is not possible to do work, we call into the
personality to check that a reshape is ok, and strengthen raid5_check_reshape
to fail if there are too many failed devices.

We also break some code out into a separate function: remove_and_add_spares as
the indent level for that code was getting crazy.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d1b5380c 28-Feb-2007 NeilBrown <neilb@suse.de>

[PATCH] md: clean out unplug and other queue function on md shutdown

The mddev and queue might be used for another array which does not set these,
so they need to be cleared.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7dd5e7c3 28-Feb-2007 NeilBrown <neilb@suse.de>

[PATCH] md: move warning about creating a raid array on partitions of the one device

md tries to warn the user if they e.g. create a raid1 using two partitions of
the same device, as this does not provide true redundancy.

However it also warns if a raid0 is created like this, and there is nothing
wrong with that.

At the place where the warning is currently printer, we don't necessarily know
what level the array will be, so move the warning from the point where the
device is added to the point where the array is started.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b4d4147 14-Feb-2007 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff1d28ef 14-Feb-2007 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] sysctl: md: remove unnecessary insert_at_head flag

The sysctls used by the md driver are have unique binary numbers so remove the
insert_at_head flag as it serves no useful purpose.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa027c2a 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com>

[PATCH] mark struct file_operations const 4

Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

[akpm@sdl.org: dvb fix]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a2275d6 26-Jan-2007 NeilBrown <neilb@suse.de>

[PATCH] md: fix potential memalloc deadlock in md

If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
it is possible for a deadlock to eventuate.

This happens if the array was marked 'clean', and the memalloc triggers a
write-out to the md device.

For the writeout to succeed, the array must be marked 'dirty', and that
requires getting the mddev_lock.

So, before attempting a GFP_KERNEL allocation while holding the lock, make
sure the array is marked 'dirty' (unless it is currently read-only).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1031be7a 26-Jan-2007 NeilBrown <neilb@suse.de>

[PATCH] md: make sure the events count in an md array never returns to zero

Now that we sometimes step the array events count backwards (when
transitioning dirty->clean where nothing else interesting has happened - so
that we don't need to write to spares all the time), it is possible for the
event count to return to zero, which is potentially confusing and triggers and
MD_BUG.

We could possibly remove the MD_BUG, but is just as easy, and probably safer,
to make sure we never return to zero.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3f9d7b0d 22-Dec-2006 NeilBrown <neilb@suse.de>

[PATCH] md: fix a few problems with the interface (sysfs and ioctl) to md

While developing more functionality in mdadm I found some bugs in md...

- When we remove a device from an inactive array (write 'remove' to
the 'state' sysfs file - see 'state_store') would should not
update the superblock information - as we may not have
read and processed it all properly yet.

- initialise all raid_disk entries to '-1' else the 'slot sysfs file
will claim '0' for all devices in an array before the array is
started.

- all '\n' not to be present at the end of words written to
sysfs files

- when we use SET_ARRAY_INFO to set the md metadata version,
set the flag to say that there is persistant metadata.

- allow GET_BITMAP_FILE to be called on an array that hasn't
been started yet.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 17571284 10-Dec-2006 NeilBrown <neilb@suse.de>

[PATCH] md: assorted md and raid1 one-liners

Fix few bugs that meant that:
- superblocks weren't alway written at exactly the right time (this
could show up if the array was not written to - writting to the array
causes lots of superblock updates and so hides these errors).

- restarting device recovery after a clean shutdown (version-1 metadata
only) didn't work as intended (or at all).

1/ Ensure superblock is updated when a new device is added.
2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
The body of this if takes one of two branches depending on whether
MD_RECOVERY_SYNC is set, so testing it in the clause of the if
is wrong.
3/ Flag superblock for updating after a resync/recovery finishes.
4/ If we find the neeed to restart a recovery in the middle (version-1
metadata only) make sure a full recovery (not just as guided by
bitmaps) does get done.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fdee8ae4 10-Dec-2006 Jeff Garzik <jeff@garzik.org>

[PATCH] MD: conditionalize some code

The autorun code is only used if this module is built into the static
kernel image. Adjust #ifdefs accordingly.

Signed-off-by: Jeff Garzik <jeff@garzik.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0d4ca600 10-Dec-2006 NeilBrown <neilb@suse.de>

[PATCH] md: tidy up device-change notification when an md array is stopped

An md array can be stopped leaving all the setting still in place, or it can
torn down and destroyed. set_capacity and other change notifications only
happen in the latter case, but should happen in both.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c649bb9c 08-Dec-2006 Josef Sipek <jsipek@fsl.cs.sunysb.edu>

[PATCH] struct path: convert md

Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d63a5a74 08-Dec-2006 NeilBrown <neilb@suse.de>

[PATCH] lockdep: avoid lockdep warning in md

md_open takes ->reconfig_mutex which causes lockdep to complain. This
(normally) doesn't have deadlock potential as the possible conflict is with a
reconfig_mutex in a different device.

I say "normally" because if a loop were created in the array->member hierarchy
a deadlock could happen. However that causes bigger problems than a deadlock
and should be fixed independently.

So we flag the lock in md_open as a nested lock. This requires defining
mutex_lock_interruptible_nested.

Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2e7b651d 08-Dec-2006 Peter Zijlstra <a.p.zijlstra@chello.nl>

[PATCH] remove the old bd_mutex lockdep annotation

Remove the old complex and crufty bd_mutex annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7dfb7103 06-Dec-2006 Nigel Cunningham <ncunningham@linuxmail.org>

[PATCH] Add include/linux/freezer.h and move definitions from sched.h

Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.

[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4b438a23 08-Nov-2006 Rafael J. Wysocki <rjw@rjwysocki.net>

[PATCH] md: do not freeze md threads for suspend

If there's a swap file on a software RAID, it should be possible to use this
file for saving the swsusp's suspend image. Also, this file should be
available to the memory management subsystem when memory is being freed before
the suspend image is created.

For the above reasons it seems that md_threads should not be frozen during the
suspend and the appended patch makes this happen, but then there is the
question if they don't cause any data to be written to disks after the suspend
image has been created, provided that all filesystems are frozen at that time.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2f471303 08-Nov-2006 NeilBrown <neilb@suse.de>

[PATCH] md: change ONLINE/OFFLINE events to a single CHANGE event

It turns out that CHANGE is preferred to ONLINE/OFFLINE for various reasons
(not least of which being that udev understands it already).

So remove the recently added KOBJ_OFFLINE (no-one is likely to care anyway)
and change the ONLINE to a CHANGE event

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7870db4c 02-Nov-2006 NeilBrown <neilb@suse.de>

[PATCH] md: send online/offline uevents when an md array starts/stops

This allows udev to do something intelligent when an array becomes
available.

Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 01ab5662 28-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: simplify checking of available size when resizing an array

When "mdadm --grow --size=xxx" is used to resize an array (use more or less of
each device), we check the new siza against the available space in each
device.

We already have that number recorded in rdev->size, so calculating it is
pointless (and wrong in one obscure case).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2b6e8459 28-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: fix bug where spares don't always get rebuilt properly when they become live

If save_raid_disk is >= 0, then the device could be a device that is already
in sync that is being re-added. So we need to default this value to -1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1c05b4bc 21-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: endian annotation for v1 superblock access

Includes a couple of bugfixes found by sparse.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e24650c2 17-Oct-2006 Akinobu Mita <akinobu.mita@gmail.com>

[PATCH] md: fix /proc/mdstat refcounting

I have seen mdadm oops after successfully unloading md module.

This patch privents from unloading md module while
mdadm is polling /proc/mdstat.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Akinbou Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5842730d 06-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: fix bug where new drives added to an md array sometimes don't sync properly

This fixes a bug introduced in 2.6.18.

If a drive is added to a raid1 using older tools (mdadm-1.x or raidtools)
then it will be included in the array without any resync happening.

It has been submitted for 2.6.18.1.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 52e5f9d1 03-Oct-2006 Eric Sesterhenn <snakebyte@gmx.de>

BUG_ON cleanup for drivers/md/

This changes two if() BUG(); usages to BUG_ON(); so people
can disable it safely.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 3a0f5bbb 03-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: add error reporting to superblock write failure

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e8703fe1 03-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: remove MAX_MD_DEVS which is an arbitrary limit

Once upon a time we needed to fixed limit to the number of md devices,
probably because we preallocated some array. This need no longer exists, but
we still have an arbitrary limit.

So remove MAX_MD_DEVS and allow as many devices as we can fit into the 'minor'
part of a device number.

Also remove some useless noise at init time (which reports MAX_MD_DEVS) and
remove MD_THREAD_NAME_MAX which hasn't been used for a while.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 61df9d91 03-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: make messages about resync/recovery etc more specific

It is possible to request a 'check' of an md/raid array where the whole array
is read and consistancies are reported.

This uses the same mechanisms as 'resync' and so reports in the kernel logs
that a resync is being started. This understandably confuses/worries people.

Also the text in /proc/mdstat suggests a 'resync' is happen when it is just a
check.

This patch changes those messages to be more specific about what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9b1d1dac 03-Oct-2006 Paul Clements <paul.clements@steeleye.com>

[PATCH] md: new sysfs interface for setting bits in the write-intent-bitmap

Add a new sysfs interface that allows the bitmap of an array to be dirtied.
The interface is write-only, and is used as follows:

echo "1000" > /sys/block/md2/md/bitmap

(dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk
bitmaps of array md2)

echo "1000-2000" > /sys/block/md1/md/bitmap

(dirty the bits for chunks 1000-2000 in md1's bitmap)

This is useful, for example, in cluster environments where you may need to
combine two disjoint bitmaps into one (following a server failure, after a
secondary server has taken over the array). By combining the bitmaps on
the two servers, a full resync can be avoided (This was discussed on the
list back on March 18, 2005, "[PATCH 1/2] md bitmap bug fixes" thread).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 850b2b42 03-Oct-2006 NeilBrown <neilb@suse.de>

[PATCH] md: replace magic numbers in sb_dirty with well defined bit flags

Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
Some device state has changed requiring superblock update
on all devices.
MD_CHANGE_CLEAN
The array has transitions from 'clean' to 'dirty' or back,
requiring a superblock update on active devices, but possibly
not on spares
MD_CHANGE_PENDING
A superblock update is underway.

We wait for an update to complete by waiting for all flags to be clear. A
flag can be set at any time, even during an update, without risk that the
change will be lost.

Stop exporting md_update_sb - isn't needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fbedac04 03-Oct-2006 Adrian Bunk <bunk@stusta.de>

[PATCH] md: the scheduled removal of the START_ARRAY ioctl for md

This patch contains the scheduled removal of the START_ARRAY ioctl for md.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 84692195 27-Aug-2006 NeilBrown <neilb@suse.de>

[PATCH] md: avoid backward event updates in md superblock when degraded.

If we
- shut down a clean array,
- restart with one (or more) drive(s) missing
- make some changes
- pause, so that they array gets marked 'clean',
the event count on the superblock of included drives
will be the same as that of the removed drives.
So adding the removed drive back in will cause it
to be included with no resync.

To avoid this, we only update the eventcount backwards when the array
is not degraded. In this case there can (should) be no non-connected
drives that we can get confused with, and this is the particular case
where updating-backwards is valuable.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d0a0a5ee 10-Jul-2006 Andrew Morton <akpm@osdl.org>

[PATCH] md: fix oops in error-handling

During early MD setup (superblock reading), we don't have a personality yet.
But the error-handling code tries to dereference mddev->pers. Fix.

Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 67463acb 10-Jul-2006 NeilBrown <neilb@suse.de>

[PATCH] md: require CAP_SYS_ADMIN for (re-)configuring md devices via sysfs

The ioctl requires CAP_SYS_ADMIN, so sysfs should too. Note that we don't
require CAP_SYS_ADMIN for reading attributes even though the ioctl does.
There is no reason to limit the read access, and much of the information is
already available via /proc/mdstat

Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 80ca3a44 10-Jul-2006 NeilBrown <neilb@suse.de>

[PATCH] md: unify usage of symbolic names for perms

Some places we use number (0660) someplaces names (S_IRUGO). Change all
numbers to be names, and change 0655 to be what it should be.

Also make some formatting more consistent.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ff4e8d9a 10-Jul-2006 NeilBrown <neilb@suse.de>

[PATCH] md: fix resync speed calculation for restarted resyncs

We introduced 'io_sectors' recently so we could count the sectors that causes
io during resync separate from sectors which didn't cause IO - there can be a
difference if a bitmap is being used to accelerate resync.

However when a speed is reported, we find the number of sectors processed
recently by subtracting an oldish io_sectors count from a current
'curr_resync' count. This is wrong because curr_resync counts all sectors,
not just io sectors.

So, add a field to mddev to store the curren io_sectors separately from
curr_resync, and use that in the calculations.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0b8c9de0 10-Jul-2006 NeilBrown <neilb@suse.de>

[PATCH] md: delay starting md threads until array is completely setup

When an array is started we start one or two threads (two if there is a
reshape or recovery that needs to be completed).

We currently start these *before* the array is completely set up and in
particular before queue->queuedata is set. If the thread actually starts
very quickly on another CPU, we can end up dereferencing queue->queuedata
and oops.

This patch also makes sure we don't try to start a recovery if a reshape is
being restarted.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 31b65a0d 10-Jul-2006 NeilBrown <neilb@suse.de>

[PATCH] md: set desc_nr correctly for version-1 superblocks

This has to be done in ->load_super, not ->validate_super

Without this, hot-adding devices to an array doesn't always
work right - though there is a work around in mdadm-2.5.2 to
make this less of an issue.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 663d440e 03-Jul-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] lockdep: annotate blkdev nesting

Teach special (recursive) locking code to the lock validator.

Effects on non-lockdep kernels:

- the introduction of the following function variants:

extern struct block_device *open_partition_by_devnum(dev_t, unsigned);

extern int blkdev_put_partition(struct block_device *);

static int
blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags);

which on non-lockdep are the same as open_by_devnum(), blkdev_put()
and blkdev_get().

- a subclass parameter to do_open(). [unused on non-lockdep]

- a subclass parameter to __blkdev_put(), which is a new internal
function for the main blkdev_put*() functions. [parameter unused
on non-lockdep kernels, except for two sanity check WARN_ON()s]

these functions carry no semantical difference - they only express
object dependencies towards the lockdep subsystem.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# ce7b0f46 20-Jun-2005 Greg Kroah-Hartman <gregkh@suse.de>

[PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed

And remove the now unneeded number field.
Also fixes all drivers that set these fields.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# ff23eca3 20-Jun-2005 Greg Kroah-Hartman <gregkh@suse.de>

[PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree

Also fixes up all files that #include it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 8ab5e4c1 20-Jun-2005 Greg Kroah-Hartman <gregkh@suse.de>

[PATCH] devfs: Remove devfs_remove() function from the kernel tree

Removes the devfs_remove() function and all callers of it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 1a715c5c 20-Jun-2005 Greg Kroah-Hartman <gregkh@suse.de>

[PATCH] devfs: Remove devfs_mk_bdev() function from the kernel tree

Removes the devfs_mk_bdev() function and all callers of it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 95dc112a 20-Jun-2005 Greg Kroah-Hartman <gregkh@suse.de>

[PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree

Removes the devfs_mk_dir() function and all callers of it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 05381954 26-Jun-2006 Adrian Bunk <bunk@stusta.de>

[PATCH] drivers/md/md.c: make code static

Make needlessly global code static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f655675b 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Allow the write_mostly flag to be set via sysfs

It appears in /sys/mdX/md/dev-YYY/state
and can be set or cleared by writing 'writemostly' or '-writemostly'
respectively.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a94213b1 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Allow resync_start to be set and queried via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d4dbd025 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Allow raid 'layout' to be read and set via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 45dc2de1 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Allow rdev state to be set via sysfs

The md/dev-XXX/state file can now be written:

"faulty" simulates an error on the device
"remove" removes the device from the array (if it is not busy)

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9e653b63 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Set/get state of array via sysfs

This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.

Array states/settings:

clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.

clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync

write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.

active-idle
like active, but no writes have been seen for a while (100msec).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 42543769 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Don't write dirty/clean update to spares - leave them alone

- record the 'event' count on each individual device (they
might sometimes be slightly different now)
- add a new value for 'sb_dirty': '3' means that the super
block only needs to be updated to record a clean<->dirty
transition.
- Prefer odd event numbers for dirty states and even numbers
for clean states
- Using all the above, don't update the superblock on
a spare device if the update is just doing a clean-dirty
transition. To accomodate this, a transition from
dirty back to clean might now decrement the events counter
if nothing else has changed.

The net effect of this is that spare drives will not see any IO requests
during normal running of the array, so they can go to sleep if that is what
they want to do.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 07d84d10 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Allow re-add to work on array without bitmaps

When an array has a bitmap, a device can be removed and re-added and only
blocks changes since the removal (as recorded in the bitmap) will be resynced.

It should be possible to do a similar thing to arrays without bitmaps. i.e.
if a device is removed and re-added and *no* changes have been made in the
interim, then the add should not require a resync.

This patch allows that option. This means that when assembling an array one
device at a time (e.g. during device discovery) the array can be enabled
read-only as soon as enough devices are available, but extra devices can still
be added without causing a resync.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# acc55e22 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md/bitmap: tidy up i_writecount handling in md/bitmap

md/bitmap modifies i_writecount of a bitmap file to make sure that no-one else
writes to it. The reverting of the change is sometimes done twice, and there
is one error path where it is omitted.

This patch tidies that up.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d7375ab3 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md/bitmap: fix online removal of file-backed bitmaps

When "mdadm --grow /dev/mdX --bitmap=none" is used to remove a filebacked
bitmap, the bitmap was disconnected from the array, but the file wasn't closed
(until the array was stopped).

The file also wasn't closed if adding the bitmap file failed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5e56341d 26-Jun-2006 Adrian Bunk <bunk@stusta.de>

[PATCH] md: make md_print_devices() static

This patch makes the needlessly global md_print_devices() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7c7546cc 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow a linear array to have drives added while active

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5fd6c1dc 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow checkpoint of recovery with version-1 superblock

For a while we have had checkpointing of resync. The version-1 superblock
allows recovery to be checkpointed as well, and this patch implements that.

Due to early carelessness we need to add a feature flag to signal that the
recovery_offset field is in use, otherwise older kernels would assume that a
partially recovered array is in fact fully recovered.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a8a55c38 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: remove nuisance message at shutdown

At shutdown, we switch all arrays to read-only, which creates a message for
every instantiated array, even those which aren't actually active.

So remove the message for non-active arrays.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 16f17b39 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: increase the delay before marking metadata clean, and make it configurable

When a md array has been idle (no writes) for 20msecs it is marked as 'clean'.
This delay turns out to be too short for some real workloads. So increase it
to 200msec (the time to update the metadata should be a tiny fraction of that)
and make it sysfs-configurable.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9443a1d1 26-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] md: remove useless ioctl warning

This warning was slightly useful back in 2.2 days, but is more an annoyance
now. It makes it awkward to add new ioctls (that we we are likely to do that
in the current climate, but it is possible).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c331eb04 30-May-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Fix badness in sysfs_notify caused by md_new_event

From: NeilBrown <neilb@suse.de>

If an error is reported by a drive in a RAID array (which is done via
bi_end_io - in interrupt context), we call md_error and md_new_event which
calls sysfs_notify. However sysfs_notify grabs a mutex and so cannot be
called in interrupt context.

This patch just creates a variant of md_new_event which avoids the sysfs
call, and uses that. A better fix for later is to arrange for the event to
be called from user-context.

Note: avoiding the sysfs call isn't a problem as an error will not, by
itself, modify the sync_action attribute. (We do still need to
wake_up(&md_event_waiters) as an error by itself will modify /proc/mdstat).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c71d4887 25-May-2006 Neil Brown <neilb@suse.de>

[PATCH] Unlock md devices when stopping them on reboot.

otherwise we get nasty messages about locks not being released.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2adc7d47 20-May-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Fix inverted test for 'repair' directive.

We should be able to write 'repair' to /sys/block/mdX/md/sync_action,
however due to and inverted test, that always given EINVAL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5dc5cf7d 20-Apr-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] md: locking fix

- fix mddev_lock() usage bugs in md_attr_show() and md_attr_store().
[they did not anticipate the possibility of getting a signal]

- remove mddev_lock_uninterruptible() [unused]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4508a7a7 19-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] sysfs: Allow sysfs attribute files to be pollable

It works like this:
Open the file
Read all the contents.
Call poll requesting POLLERR or POLLPRI (so select/exceptfds works)
When poll returns,
close the file and go to top of loop.
or lseek to start of file and go back to the 'read'.

Events are signaled by an object manager calling
sysfs_notify(kobj, dir, attr);

If the dir is non-NULL, it is used to find a subdirectory which
contains the attribute (presumably created by sysfs_create_group).

This has a cost of one int per attribute, one wait_queuehead per kobject,
one int per open file.

The name "sysfs_notify" may be confused with the inotify
functionality. Maybe it would be nice to support inotify for sysfs
attributes as well?

This patch also uses sysfs_notify to allow /sys/block/md*/md/sync_action
to be pollable

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 926ce2d8 31-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Remove some code that can sleep from under a spinlock

And remove the comments that were put in inplace of a fix too....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# df5b89b3 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Convert reconfig_sem to reconfig_mutex

... being careful that mutex_trylock is inverted wrt down_trylock

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 48c9c27b 27-Mar-2006 Arjan van de Ven <arjan@infradead.org>

[PATCH] sem2mutex: drivers/md

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8ddeeae5 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Fix md grow/size code to correctly find the maximum available space

An md array can be asked to change the amount of each device that it is using,
and in particular can be asked to use the maximum available space. This
currently only works if the first device is not larger than the rest. As
'size' gets changed and so 'fit' becomes wrong. So check if a 'fit' is
required early and don't corrupt it.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e464eafd 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Support suspending of IO to regions of an md array

This allows user-space to access data safely. This is needed for raid5
reshape as user-space needs to take a backup of the first few stripes before
allowing reshape to commence.

It will also be useful in cluster-aware raid1 configurations so that all
cluster members can leave a section of the array untouched while a
resync/recovery happens.

A 'start' and 'end' of the suspended range are written to 2 sysfs attributes.
Note that only one range can be suspended at a time.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 16484bf5 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Make 'reshape' a possible sync_action action

This allows reshape to be triggerred via sysfs (which is the only way to start
it happening).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 63c70c4f 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Split reshape handler in check_reshape and start_reshape

check_reshape checks validity and does things that can be done instantly -
like adding devices to raid1. start_reshape initiates a restriping process to
convert the whole array.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f6705578 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Checkpoint and allow restart of raid5 reshape

We allow the superblock to record an 'old' and a 'new' geometry, and a
position where any conversion is up to. The geometry allows for changing
chunksize, layout and level as well as number of devices.

When using verion-0.90 superblock, we convert the version to 0.91 while the
conversion is happening so that an old kernel will refuse the assemble the
array. For version-1, we use a feature bit for the same effect.

When starting an array we check for an incomplete reshape and restart the
reshape process if needed. If the reshape stopped at an awkward time (like
when updating the first stripe) we refuse to assemble the array, and let
user-space worry about it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 29269553 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Final stages of raid5 expand code

This patch adds raid5_reshape and end_reshape which will start and finish the
reshape processes.

raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set, to discourage
accidental use.

Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

and Make sure that you have backups, just in case.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ccfcc3c1 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Core of raid5 resize process

This patch provides the core of the resize/expand process.

sync_request notices if a 'reshape' is happening and acts accordingly.

It allocated new stripe_heads for the next chunk-wide-stripe in the target
geometry, marking them STRIPE_EXPANDING.

Then it finds which stripe heads in the old geometry can provide data needed
by these and marks them STRIPE_EXPAND_SOURCE. This causes stripe_handle to
read all blocks on those stripes.

Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that are
needed are copied into the corresponding STRIPE_EXPANDING stripe_head. Once a
STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY and then
is written out and released.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ad01c9e3 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Allow stripes to be expanded in preparation for expanding an array

Before a RAID-5 can be expanded, we need to be able to expand the stripe-cache
data structure.

This requires allocating new stripes in a new kmem_cache. If this succeeds,
we copy cache pages over and release the old stripes and kmem_cache.

We then allocate new pages. If that fails, we leave the stripe cache at it's
new size. It isn't worth the effort to shrink it back again.

Unfortuanately this means we need two kmem_cache names as we, for a short
period of time, we have two kmem_caches. So they are raid5/%s and
raid5/%s-alt

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4588b42e 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Update status_resync to handle LARGE devices

status_resync - used by /proc/mdstat to report the status of a resync, assumes
that device sizes will always fit into an 'unsigned long' This is no longer
the case...

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1be7892f 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Fix the 'failed' count for version-0 superblocks

We are counting failed devices twice, once of the device that is failed, and
once for the hole that has been left in the array. Remove the former so
'failed' matches 'missing'. Storing these counts in the superblock is a bit
silly anyway....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c5a10f62 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Add '4' to the list of levels for which bitmaps are supported

I really should make this a function of the personality....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 89e5c8b5 27-Mar-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Make sure QUEUE_FLAG_CLUSTER is set properly for md.

This flag should be set for a virtual device iff it is set for all underlying
devices.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5463c790 27-Mar-2006 Jun'ichi Nomura <j-nomura@ce.jp.nec.com>

[PATCH] dm/md dependency tree in sysfs: md to use bd_claim_by_disk

Use bd_claim_by_disk.

Following symlinks are created if md0 is built from sda and sdb
/sys/block/md0/slaves/sda --> /sys/block/sda
/sys/block/md0/slaves/sdb --> /sys/block/sdb
/sys/block/sda/holders/md0 --> /sys/block/md0
/sys/block/sdb/holders/md0 --> /sys/block/md0

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1312f40e 12-Mar-2006 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] regularize blk_cleanup_queue() use

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8ed75463 03-Feb-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Make sure rdev->size gets set for version-1 superblocks

Sometimes it doesn't so make the code more like the version-0 code which
works.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 29fc7e3e 03-Feb-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Assorted little md fixes

- version-1 superblock
+ The default_bitmap_offset is in sectors, not bytes.
+ the 'size' field in the superblock is in sectors, not KB
- raid0_run should return a negative number on error, not '1'
- raid10_read_balance should not return a valid 'disk' number if
->rdev turned out to be NULL
- kmem_cache_destroy doesn't like being passed a NULL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 284ae7ca 03-Feb-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Handle overflow of mdu_array_info_t->size better

mdu_array_info_t->size is 'int', which isn't big enough for the size (in KB of
each component in) some arrays.

So rather than a random overflow, set size to -1 when it cannot be set
correctly.

To update aspect on an array, userspace will sometimes:
get_array_info
change one field
set_array_info

in this case, we don't want the '-1' in 'size' to change to size, or look like
a size change at all. So test for that in update_array_info.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 978f946b 02-Feb-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Don't remove bitmap from md array when switching to read-only

While a read-only array doesn't not really need a bitmap, we should
not remove the bitmap when switching an array to read-only because
a/ There is no code to re-add the bitmap which switching to read-write,
b/ There is insufficient locking - the bitmap could be accessed while it is
being removed.

Cc: Reuben Farrelly <reuben-lkml@reub.net>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f0ca340c 02-Feb-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Make sure array geometry changes persist with version-1 superblocks

super_1_sync only updates fields in the superblock that might have changed.

'raid_disks' and 'size' could have changed, but this information doesn't get
updated.... until this patch.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6d89332b 02-Feb-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Fix device-size updates in md

As 'array_size' is a 'sector_t', it may overflow inappropriately when shifted
10 bits. So We should cast it to a loff_t first.

There are two places with this problem, but the second (in update_raid_disks)
isn't needed so just remove it:
The only personality that handles ->reshape currently is raid1,
and it doesn't change the size of the array.
When added for raid5/6, reshape again won't change the size of the array,
at least not straight away.
This code might be need for reshaping 'linear' but linear->shape,
if implemented, should probably do the i_size_write itself.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 17115e03 16-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: Clear clevel whenever level is set.

The 'level' of an md array can be set as either a number of a string. When
one is set, the other must be marked 'undefined'. This wasn't being done
in one place: where new arrays are created.

Result: if md1 is a raid1, it is stopped and a raid5 is created there, it
might still appear to be a raid1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1edf80d3 12-Jan-2006 Neil Brown <neilb@suse.de>

[PATCH] md: remove slashes from disk names when creation dev names in sysfs

e.g. The sx8 driver uses names like sx8/0.

This would make a md component dev name like

/sys/block/md0/md/dev-sx8/0

which is not allowed. So we change the '/' to '!' just like
fs/partitions/check.c(register_disk) does.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1b1dcc1b 09-Jan-2006 Jes Sorensen <jes@sgi.com>

[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem

This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.

Modified-by: Ingo Molnar <mingo@elte.hu>

(finished the conversion)

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# a885c8c4 08-Jan-2006 Christoph Hellwig <hch@lst.de>

[PATCH] Add block_device_operations.getgeo block device method

HDIO_GETGEO is implemented in most block drivers, and all of them have to
duplicate the code to copy the structure to userspace, as well as getting
the start sector. This patch moves that to common code [1] and adds a
->getgeo method to fill out the raw kernel hd_geometry structure. For many
drivers this means ->ioctl can go away now.

[1] the s390 block drivers are odd in this respect. xpram sets ->start
to 4 always which seems more than odd, and the dasd driver shifts
the start offset around, probably because of it's non-standard
sector size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@suse.de>
Cc: <mike.miller@hp.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo Giarrusso <blaisorblade@yahoo.it>
Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Markus Lidel <Markus.Lidel@shadowconnect.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 88202a0c 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow sync-speed to be controlled per-device

Also export current (average) speed and status in sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6d7ff738 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: support adding new devices to md arrays via sysfs

Writing major:minor to md/new_dev will bind that device to the array.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 83303b61 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow available size of component devices to be set via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6961ece4 06-Jan-2006 Andrew Morton <akpm@osdl.org>

[PATCH] md-export-rdev-data_offset-via-sysfs-fix

drivers/md/md.c: In function `offset_show':
drivers/md/md.c:1670: warning: long long unsigned int format, different type arg (arg 3)

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 93c8cad0 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: export rdev->data_offset via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 014236d2 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: expose device slot information via sysfs

This the role that a device has in an array can be viewed and set.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2bf071bf 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: keep better track of dev/array size when assembling md arrays

Move the checks - that dev size is never less than array size - into
bind_rdev_to_array to make sure it always happens properly (there is one place
where currently it doesn't).

Also reject any superblock which claims an array size smaller than the device
in question can hold.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# da943b99 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow md/raid_disks to be settable

If array is active, try to reshape, else just set the value.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4dbcdc75 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: count corrected read errors per drive

Store this total in superblock (As appropriate), and make it available to
userspace via sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d9d166c2 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow array level to be set textually via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8bb93aac 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: expose md metadata format in sysfs

Allow it to be set to a particular version, or 'none'.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a35b0d69 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow md array component size to be accessed and set via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3b34380a 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow chunk_size to be settable through sysfs

... only before array is started of course.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 03c902e1 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: fix rdev->pending counts in raid1

When we do a user-requested check/repair, we lose count of the outstanding
requests...

Also make sure that when anything is written to md/sync_action, the
RECOVERY_NEEDED flag is set and the thread is woken up so any changes take
effect.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 07dbd377 06-Jan-2006 Adrian Bunk <bunk@stusta.de>

[PATCH] drivers/md/md.c: make md_new_event() static

Make the needlessly global function md_new_event() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2989ddbd 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: make a couple of names in md.c static

.. because they aren't used outside md.c

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bce74dac 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: helper function to match commands written to sysfs files

Commands written to sysfs files may, or my not, be \n terminated. We want to
accept with case. For this we use cmd_match.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2604b703 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: remove personality numbering from md

md supports multiple different RAID level, each being implemented by a
'personality' (which is often in a separate module).

These personalities have fairly artificial 'numbers'. The numbers
are use to:
1- provide an index into an array where the various personalities
are recorded
2- identify the module (via an alias) which implements are particular
personality.

Neither of these uses really justify the existence of personality numbers.
The array can be replaced by a linked list which is searched (array lookup
only happens very rarely). Module identification can be done using an alias
based on level rather than 'personality' number.

The current 'raid5' modules support two level (4 and 5) but only one
personality. This slight awkwardness (which was handled in the mapping from
level to personality) can be better handled by allowing raid5 to register 2
personalities.

With this change in place, the core md module does not need to have an
exhaustive list of all possible personalities, so other personalities can be
added independently.

This patch also moves the check for chunksize being non-zero into the ->run
routines for the personalities that need it, rather than having it in core-md.
This has a side effect of allowing 'faulty' and 'linear' not to have a
chunk-size set.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a8745db2 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: convert recently exported symbol to GPL

...because that seems to be the preferred practice these days.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9ffae0cf 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: convert md to use kzalloc throughout

Replace multiple kmalloc/memset pairs with kzalloc calls.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2d1f3b5d 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: clean up 'page' related names in md

Substitute:

page_cache_get -> get_page
page_cache_release -> put_page
PAGE_CACHE_SHIFT -> PAGE_SHIFT
PAGE_CACHE_SIZE -> PAGE_SIZE
PAGE_CACHE_MASK -> PAGE_MASK
__free_page -> put_page

because we aren't using the page cache, we are just using pages.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d7603b7e 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: make /proc/mdstat pollable

With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.

It is similar to the poll-ability of /proc/mounts, though different in that:

We always report that the file is readable (because face it, it is, even if
only for EOF).

We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).

Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.

md_new_event takes an 'mddev' which isn't currently used, but it will be soon.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ddaf22ab 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: attempt to auto-correct read errors in raid1

On a read-error we suspend the array, then synchronously read the block from
other arrays until we find one where we can read it. Then we try writing the
good data back everywhere and make sure it works. If any write or subsequent
read fails, only then do we fail the device out of the array.

To be able to suspend the array, we need to also keep track of how many
requests are queued for handling by raid1d.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6cce3b23 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: write intent bitmap support for raid10

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b15c2e57 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: move bitmap_create to after md array has been initialised

This is important because bitmap_create uses
mddev->resync_max_sectors
and that doesn't have a valid value until after the array
has been initialised (with pers->run()).
[It doesn't make a difference for current personalities that
support bitmaps, but will make a difference for raid10]

This has the added advantage of meaning with can move the thread->timeout
manipulation inside the bitmap.c code instead of sprinkling identical code
throughout all personalities.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6ff8d8ec 06-Jan-2006 NeilBrown <neilb@suse.de>

[PATCH] md: allow dirty raid[456] arrays to be started at boot

See patch to md.txt for more details

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bcb97940 19-Dec-2005 Neil Brown <neilb@suse.de>

[PATCH] md: Change case of raid level reported in sys/mdX/md/level

I had thought that keeping the reported tail level clearly different
from the module name was a good idea, but I've changed my mind.

'raid5' is better and probably less confusing than 'RAID-5'.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b2a2703c 28-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: set default_bitmap_offset properly in set_array_info

If an array is created using set_array_info, default_bitmap_offset isn't set
properly meaning that an internal bitmap cannot be hot-added until the array
is stopped and re-assembled.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c0e48521 18-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: fix is_mddev_idle calculation now that disk/sector accounting happens when request completes

md needs to monitor the rate of requests to its devices when doing
resync/recovery so that it can back-off when there is non-resync IO. It
does this by comparing resync IO, which it counts, with total IO which is
taken from disk_stats.

disk_stats were recently changed to account sectors when a request
completes instead of when it is queued. This upsets md's calculations.

We could do the sync_io accounting at the end of requests too, but that has
problems. If an underlying device is an md array, the accounting will
still be done when the request is submitted. This could be changed for
some raid levels, but it cannot be changed for raid0 or linear without
substantial code changes.

So instead, we increase the error that is_mddev_idle allows, up to the
maximum amount of resync IO that can be in flight at any time. The
calculation is current fragile as each personality as different limits for
in-flight resync. This should be fixed up.

For now, this simple patch fixes the problem.

Increasing the error margin decreases the sensitivity to non-resync IO. To
partially compensate for this, the time to wait when non-resync IO is
detected is increased so that less steady IO is required to keep the resync
at bay.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 93588e22 15-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: make md threads interruptible again

Despite the fact that md threads don't need to be signalled, and won't
respond to signals anyway, we need to have an 'interruptible' wait, else
they stay in 'D' state and add to the load average.

(akpm: the signal_pending() test is unneeded - we'll fix that up in the next
round. For now, leave it there because that's how the code used to be).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e8a00334 15-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: mark START_ARRAY deprecated with a date

This was marked deprecated "after 2.6" back in the 2.5 days. But now it
seems there isn't going to be any "after 2.6", and we deprecate by date
now. So set a date.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bb636547 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: document sysfs usage of md, and make a couple of small refinements

Document in Documentation/md.txt the files that now appear in sysfs, and make
a couple of small refinements to exactly when 'level' and 'raid_disks' are
empty, to make it match the documentation.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7eec314d 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: improve 'scan_mode' and rename it to 'sync_action'

The current sync_action for an array can be one of

idle - nothing happening
resync - reduncancy being recalcualted
recover - missing device being recoverred to spare
check - user initiated check of redundancy
repair - like resync but user-initiated and ignores
bitmap optimisation.

Each of these strings can also be written to the 'sync_action' file to cause
that action to happen (if appropriate).

While 'sync' is not technically correct, as a recovery is *not* a 'sync', I
think it is the most servicable word here. Also 'action' is a strong word
than 'mode'.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 787453c2 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: complete conversion of md to use kthreads

There are a few loose ends following the conversion of md to use kthreads:

- Some fields in mdk_thread_t that aren't needed (kthreads does it's own
completion and manages it's own name).

- thread->run is now never NULL, so no need to check

- Some tests for signal_pending that aren't needed (As we don't use signals
to stop threads any more)

- Some flush_signals are not needed

- Some waits are interruptible and don't need to be.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fd9d49ca 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: ignore auto-readonly flag for arrays where it isn't meaningful

The 'auto-readonly' flag (which suppresses resync and superblock updates until
the first write) is not meaningful for personalities that don't support resync
or superblock writes (raid0, linear, etc).

So clear the setting early to avoid it confusing anything - e.g. appearing in
/proc/mdstat

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8e1b39d6 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: only try to print recovery/resync status for personalities that support recovery

The introduction of 'resync=PENDING' (for read-only devices) caused that
message to appear for non-syncable arrays like raid0 and linear. Simplest
thing is to not try to print any resync info unless the personality clearly
supports it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 411036fa 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: split off some md attributes in sysfs to a separate group

Some, but not all, md array support data redundancy and hence support checking
and restoring that redundancy (resync, rebuild).

Some attributes apply specifically to functions involving this redundancy, and
so should only appear for md arrays for which they are meaningful. i.e. they
should not appear for raid0, linear, multpath, faulty.

This patch separates these into a distinct group and creates the group only if
the personality supports sync_request.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 96de1e66 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: fix some locking and module refcounting issues with md's use of sysfs

1/ I really should be using the __ATTR macros for defining attributes, so
that the .owner field get set properly, otherwise modules can be removed
while sysfs files are open. This also involves some name changes of _show
routines.

2/ Always lock the mddev (against reconfiguration) for all sysfs attribute
access. This easily avoid certain races and is completely consistant with
other interfaces (ioctl and /proc/mdstat both always lock against
reconfiguration).

3/ raid5 attributes must check that the 'conf' structure actually exists
(the array could have been stopped while an attribute file was open).

4/ A missing 'kfree' from when the raid5_conf_t was converted to have a
kobject embedded, and then converted back again.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f637b9f9 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: make sure /block link in /sys/.../md/ goes to correct devices

If a block_device is a partition, then it's kobject is
bdev->bd_part->kobj
otherwise (if it is a full device), the kobject is
bdev->bd_disk->kobj

As md wants back-links to the correct object (whether partition or not), we
need to respect this difference... (Thus current code shows a link to the
whole device, whether we are using a partition or not, which is wrong).

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f91de92e 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: allow md arrays to be started read-only (module parameter).

When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.

So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro

When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.

The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 19133a42 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: Remove attempt to use dynamic names in sysfs for component devices on an MD array.

With version-0.90 superblock, component devices on an md device to not have
any stable name related to the array -(version-1 assigns a fixed index when
a device is added to an array, and this remains despit any hot-swap).

The intial code for making these devices appear in sysfs used dynamic
names, which would change whenever a hot-spare was swapped for a failed or
missing device. This turns out not to be practical in sysfs for a number
of reasons.

This patch changes then naming of component devices to be based on the
result of 'bdevname'. This is stable and should be unique.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a9701a30 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: support BIO_RW_BARRIER for md/raid1

We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....

So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.

We initially assumes barriers are OK.

When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers. This will usually clear the flags fairly quickly.

If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.

When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried. So raid1d is enhanced to do this,
and when any bio write completes (i.e. no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.

We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
1/ the device used to support BARRIER, but now doesn't. Few devices
change like this, though raid1 can!
or
2/ the array has no persistent superblock, so there was no opportunity to
pre-test for barriers when writing the superblock.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bd926c63 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: make md on-disk bitmaps not host-endian

Current bitmaps use set_bit et.al and so are host-endian, which means
not-portable. Oops.

Define a new version number (4) for which bitmaps are little-endian.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b2d444d7 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: convert 'faulty' and 'in_sync' fields to bits in 'flags' field

This has the advantage of removing the confusion caused by 'rdev_t' and
'mddev_t' both having 'in_sync' fields.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ba22dcbf 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: improvements to raid5 handling of read errors

Two refinements to the 'attempt-overwrite-on-read-error' mechanism.
1/ If the array is read-only, don't attempt an over-write.
2/ If there are more than max_nr_stripes read errors on a device with
no success, fail the drive. This will make sure a dead
drive will be eventually kicked even when we aren't trying
to rewrite (which would normally kick a dead drive more quickly.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 007583c9 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: change raid5 sysfs attribute to not create a new directory

There isn't really a need for raid5 attributes to be an a subdirectory,
so this patch moves them from
/sys/block/mdX/md/raid5/attribute
to
/sys/block/mdX/md/attribute

This suggests that all md personalities should co-operate about
namespace usage, but that shouldn't be a problem.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 31399d9e 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: minor MD fixes

1/ Use reduce stack usage, because 'gcc' apparently doesn't overlay
different variables that are in separate scopes...

2/ Use test_bit instead of ( .. & 1<< ..) which in this case is buggy.

Thanks to Andrew Morton

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9c791977 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: fix ref-counting problems with kobjects in md

Thanks Greg.

Cc: Greg KH <greg@kroah.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9d88883e 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: teach raid5 the difference between 'check' and 'repair'.

With this, raid5 can be asked to check parity without repairing it. It also
keeps a count of the number of incorrect parity blocks found (mismatches) and
reports them through sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 24dd469d 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: allow a manual resync with md

You can trigger a 'check' with
echo check > /sys/block/mdX/md/scan_mode
or a check-and-repair errors with
echo repair > /sys/block/mdX/md/scan_mode

and read the current state from the same file.

Note: personalities need to know the different between 'check' and 'repair',
but don't yet. Until they do, 'check' will be the same as 'repair' and will
just do a normal resync pass.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 86e6ffdd 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: extend md sysfs support to component devices.

Each device in an md array how has a corresponding
/sys/block/mdX/md/devNN/
directory which can contain attributes. Currently there is only 'state' which
summarises the state, nd 'super' which has a copy of the superblock, and
'block' which is a symlink to the block device.

Also, /sys/block/mdX/md/rdNN represents slot 'NN' in the array, and is a
symlink to the relevant 'devNN'. Obviously spare devices do not have a slot
in the array, and so don't have such a symlink.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# eae1701f 08-Nov-2005 NeilBrown <neilb@suse.de>

[PATCH] md: initial sysfs support for md

Start using kobjects in mddevs, and provide a couple of simple attributes
(level and disks). Attributes live in
/sys/block/mdX/md/attr-name

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a362357b 01-Nov-2005 Jens Axboe <axboe@suse.de>

[BLOCK] Unify the seperate read/write io stat fields into arrays

Instead of having ->read_sectors and ->write_sectors, combine the two
into ->sectors[2] and similar for the other fields. This saves a branch
several places in the io path, since we don't have to care for what the
actual io direction is. On my x86-64 box, that's 200 bytes less text in
just the core (not counting the various drivers).

Signed-off-by: Jens Axboe <axboe@suse.de>


# 8712e553 26-Oct-2005 NeilBrown <neilb@suse.de>

[PATCH] md: make sure mdthreads will always respond to kthread_stop

There are still a couple of cases where md threads (the resync/recovery
thread) is not interruptible since the change to use kthreads. All places
there it tests "signal_pending", it should also test kthread_should_stop,
as with this patch.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6985c43f 19-Oct-2005 NeilBrown <neilb@suse.de>

[PATCH] Three one-liners in md.c

The main problem fixes is that in certain situations stopping md arrays may
take longer than you expect, or may require multiple attempts. This would
only happen when resync/recovery is happening.

This patch fixes three vaguely related bugs.

1/ The recent change to use kthreads got the setting of the
process name wrong. This fixes it.
2/ The recent change to use kthreads lost the ability for
md threads to be signalled with SIG_KILL. This restores that.
3/ There is a long standing bug in that if:
- An array needs recovery (onto a hot-spare) and
- The recovery is being blocked because some other array being
recovered shares a physical device and
- The recovery thread is killed with SIG_KILL
Then the recovery will appear to have completed with no IO being
done, which can cause data corruption.
This patch makes sure that incomplete recovery will be treated as
incomplete.

Note that any kernel affected by bug 2 will not suffer the problem of bug
3, as the signal can never be delivered. Thus the current 2.6.14-rc
kernels are not susceptible to data corruption. Note also that if arrays
are shutdown (with "mdadm -S" or "raidstop") then the problem doesn't
occur. It only happens if a SIGKILL is independently delivered as done by
'init' when shutting down.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 338cec32 10-Sep-2005 Adrian Bunk <bunk@stusta.de>

[PATCH] merge some from Rusty's trivial patches

This patch contains the most trivial from Rusty's trivial patches:
- spelling fixes
- remove duplicate includes

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 61181565 09-Sep-2005 NeilBrown <neilb@suse.de>

[PATCH] md: really get sb_size setting right in all cases

There was another case where sb_size wasn't being set, so instead do the
sensible thing and set if when filling in the content of a superblock. That
ensures that whenever we write a superblock, the sb_size MUST be set.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 188c18fd 09-Sep-2005 NeilBrown <neilb@suse.de>

[PATCH] md: make sure the new 'sb_size' is set properly device added without pre-existing superblock.

There are two ways to add devices to an md/raid array.

It can have superblock written to it, and then given to the md driver,
which will read the superblock (the new way)

or

md can be told (through SET_ARRAY_INFO) the shape of the array, and
the told about individual drives, and md will create the required
superblock (the old way).

The newly introduced sb_size was only set for drives being added the
new way, not the old ways. Oops :-(

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b325a32e 09-Sep-2005 NeilBrown <neilb@suse.de>

[PATCH] md: report spare drives in /proc/mdstat

Just like failed drives have (F), so spare drives now have (S).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1cd6bf19 09-Sep-2005 NeilBrown <neilb@suse.de>

[PATCH] md: add information about superblock version to /proc/mdstat

Leave it unchanged if the original (0.90) is used, incase it might be a
compatability problem.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 720a3dc3 09-Sep-2005 NeilBrown <neilb@suse.de>

[PATCH] md: use queue_hardsect_size instead of block_size for md superblock size calc.

Doh. I want the physical hard-sector-size, not the current block size...

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 53e87fbb 09-Sep-2005 NeilBrown <neilb@suse.de>

[PATCH] md: choose better default offset for bitmap.

On reflection, a better default location for hot-adding bitmaps with version-1
superblocks is immediately after the superblock. There might not be much room
there, but there is usually atleast 3k, and that is a good start.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a6fb0934 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: use kthread infrastructure in md

Switch MD to use the kthread infrastructure, to simplify the code and get rid
of tasklist_lock abuse in md_unregister_thread.

Also don't flush signals in md_thread, as the called thread will always do
that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 934ce7c8 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: write-intent bitmap support for raid6

This is a direct port of the raid5 patch.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 72626685 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: add write-intent-bitmap support to raid5

Most awkward part of this is delaying write requests until bitmap updates have
been flushed.

To achieve this, we have a sequence number (seq_flush) which is incremented
each time the raid5 is unplugged.

If the raid thread notices that this has changed, it flushes bitmap changes,
and assigned the value of seq_flush to seq_write.

When a write request arrives, it is given the number from seq_write, and that
write request may not complete until seq_flush is larger than the saved seq
number.

We have a new queue for storing stripes which are waiting for a bitmap flush
and an extra flag for stripes to record if the write was 'degraded' and so
should not clear the a bit in the bitmap.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0002b271 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: limit size of sb read/written to appropriate amount

version-1 superblocks are not (normally) 4K long, and can be of variable size.
Writing the full 4K can cause corruption (but only in non-default
configurations).

With this patch the super-block-flavour can choose a size to read, and set a
size to write based on what it finds.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 71c0805c 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: allow md to load a superblock with feature-bit '1' set

As this is used to flag an internal bitmap.

Also, introduce symbolic names for feature bits.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7b1e35f6 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: allow hot-adding devices to arrays with non-persistant superblocks.

It is possibly (and occasionally useful) to have a raid1 without persistent
superblocks. The code in add_new_disk for adding a device to such an array
always tries to read a superblock.

This will obviously fail.

So do the appropriate test and call md_import_device with
appropriate args.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8ddf9efe 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: support write-mostly device in raid1

This allows a device in a raid1 to be marked as "write mostly". Read requests
will only be sent if there is no other option.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 36fa3063 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: all hot-add and hot-remove of md intent logging bitmaps

Both file-bitmaps and superblock bitmaps are supported.

If you add a bitmap file on the array device, you lose.

This introduces a 'default_bitmap_offset' field in mddev, as the ioctl used
for adding a superblock bitmap doesn't have room for giving an offset. Later,
this value will be setable via sysfs.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1923b99a 09-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: don't allow new md/bitmap file to be set if one already exists

... otherwise we loose a reference and can never free the file.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 87162a28 09-Sep-2005 Al Viro <viro@ZenIV.linux.org.uk>

[PATCH] trivial __user annotations (md)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 657390d2 26-Aug-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: clear the 'recovery' flags when starting an md array.

It's possible for this to still have flags in it and a previous instance
has been stopped, and that confused the new array using the same mddev.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 72008652 26-Aug-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: create a MODULE_ALIAS for md corresponding to its block major number.

I just discovered this is needed for module auto-loading.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 005eca5e 22-Aug-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: make sure resync gets started when array starts.

We weren't actually waking up the md thread after setting
MD_RECOVERY_NEEDED when assembling an array, so it is possible to lose a
race and not actually start resync.

So add a call to md_wakeup_thread, and while we are at it, remove all the
"if (mddev->thread)" guards as md_wake_thread does its own checking.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9223214e 18-Aug-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: make sure mddev->bitmap_offset gets cleared between array instantiations.

... otherwise we might try to load a bitmap from an array which hasn't one.

The bug is that if you create an array with an internal bitmap, shut it down,
and then create an array with the same md device, the md drive will assume it
should have a bitmap too. As the array can be created with a different md
device, it is mostly an inconvenience. I'm pretty sure there is no risk of
data corruption.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6b8b3e8a 04-Aug-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: make sure md bitmap updates are flushed when array is stopped.

The recent change to never ignore the bitmap, revealed that the bitmap isn't
begin flushed properly when an array is stopped.

We call bitmap_daemon_work three times as there is a three-stage pipeline for
flushing updates to the bitmap file.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# aa1595e9 04-Aug-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: make 'md' and alias for 'md-mod'

Until the bitmap code was added,

modprobe md

would load the md module. But now the md module is called 'md-mod', so we
really need an alias for backwards comparability.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# efd8be2a 04-Aug-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: remove a stray debugging printk.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 77933d72 27-Jul-2005 Jesper Juhl <juhl@dif.dk>

[PATCH] clean up inline static vs static inline

`gcc -W' likes to complain if the static keyword is not at the beginning of
the declaration. This patch fixes all remaining occurrences of "inline
static" up with "static inline" in the entire kernel tree (140 occurrences in
47 files).

While making this change I came across a few lines with trailing whitespace
that I also fixed up, I have also added or removed a blank line or two here
and there, but there are no functional changes in the patch.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f8b58edf 27-Jun-2005 Neil Brown <neilb@cse.unsw.edu.au>

[PATCH] md: bio leak fix

insert a missing bio_put when writting the md superblock.

Without this we have a steady growth in the "bio" slab.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3e1d1d28 25-Jun-2005 Christoph Lameter <christoph@lameter.com>

[PATCH] Cleanup patch for process freezing

1. Establish a simple API for process freezing defined in linux/include/sched.h:

frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now

2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h

3. Fix numerous locations where try_to_freeze is manually done by a driver

4. Remove the argument that is no longer necessary from two function calls.

5. Some whitespace cleanup

6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).

This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 990a8baf 21-Jun-2005 Jesper Juhl <juhl-lkml@dif.dk>

[PATCH] md: remove unneeded NULL checks before kfree

This patch removes some unneeded checks of pointers being NULL before
calling kfree() on them. kfree() handles NULL pointers just fine, checking
first is pointless.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 39730960 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] Two small fixes for md verion-1 superblocks.

1/ Must typecast int to (sector_t) before inverting or we
might not invert enough bits.

2/ When "bitmap_offset" was added to mdp_superblock_1, we didn't increase
the count of words-used (96 to 100).

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7bfa19f2 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: allow md to update multiple superblocks in parallel.

currently, md updates all superblocks (one on each device) in series. It
waits for one write to complete before starting the next. This isn't a big
problem as superblock updates don't happen that often.

However it is neater to do it in parallel, and if the drives in the array have
gone to "sleep" after a period of idleness, then waking them is parallel is
faster (and someone else should be worrying about power drain).

Futher, we will need parallel superblock updates for a future patch which
keeps the intent-logging bitmap near the superblock.

Also remove the silly code that retired superblock updates 100 times. This
simply never made sense.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a654b9d8 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: allow md intent bitmap to be stored near the superblock.

This provides an alternate to storing the bitmap in a separate file. The
bitmap can be stored at a given offset from the superblock. Obviously the
creator of the array must make sure this doesn't intersect with data....
After is good for version-0.90 superblocks.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3d310eb7 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: fix deadlock due to md thread processing delayed requests.

Before completing a 'write' the md superblock might need to be updated.
This is best done by the md_thread.

The current code schedules this up and queues the write request for later
handling by the md_thread.

However some personalities (Raid5/raid6) will deadlock if the md_thread
tries to submit requests to its own array.

So this patch changes things so the processes submitting the request waits
for the superblock to be written and then submits the request itself.

This fixes a recently-created deadlock in raid5/raid6

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 41158c7e 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: optimise reconstruction when re-adding a recently failed drive.

When an array is degraded, bit in the intent-bitmap are never cleared. So if
a recently failed drive is re-added, we only need to reconstruct the block
that are still reflected in the bitmap.

This patch adds support for this re-adding.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5f40402d 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: call bitmap_daemon_work regularly

bitmap_daemon_work clears bits in the bitmap for blocks that haven't been
written to for a while. It needs to be called regularly to make sure the
bitmap doesn't endup full of ones .... but it wasn't.

So call it from the increasingly-inaptly-named md_check_recovery

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 78d742d8 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: a couple of tidyups relating to the bitmap file.

1/ When init from disk, it is a BUG if there is nowhere
to init from,
2/ use seq_path to print path in /proc/mdstat

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 32a7627c 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: optimised resync using Bitmap based intent logging

With this patch, the intent to write to some block in the array can be logged
to a bitmap file. Each bit represents some number of sectors and is set
before any update happens, and only cleared when all writes relating to all
sectors are complete.

After an unclean shutdown, information in this bitmap can be used to optimise
resync - only sectors which could be out-of-sync need to be updated.

Also if a drive is removed and then added back into an array, the recovery can
make use of the bitmap to optimise reconstruction. This is not implemented in
this patch.

Currently the bitmap is stored in a file which must (obviously) be stored on a
separate device.

The patch only provided infrastructure. It does not update any personalities
to bitmap intent logging.

Md arrays can still be used with no bitmap file. This patch has minimal
impact on such arrays.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 57afd89f 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: improve the interface to sync_request

1/ change the return value (which is number-of-sectors synced)
from 'int' to 'sector_t'.
The number of sectors is usually easily small enough to fit
in an int, but if resync needs to abort, it may want to return
the total number of remaining sectors, which could be large.
Also errors cannot be returned as negative numbers now, so use
0 instead
2/ Add a 'skipped' return parameter to allow the array to report
that it skipped the sectors. This allows md to take this into account
in the speed calculations.
Currently there is no important skipping, but the bitmap-based-resync
that is coming will use this.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 06d91a5f 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: improve locking on 'safemode' and move superblock writes

When md marks the superblock dirty before a write, it calls
generic_make_request (to write the superblock) from within
generic_make_request (to write the first dirty block), which could cause
problems later.

With this patch, the superblock write is always done by the helper thread, and
write request are delayed until that write completes.

Also, the locking around marking the array dirty and writing the superblock is
improved to avoid possible races.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fca4d848 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: merge md_enter_safemode into md_check_recovery

md_enter_safemode checks if it is time to mark the md superblock as 'clean'.
i.e. if all writes have completed and a suitable delay has passed.

This is currently called from md_handle_safemode which in-turn is called
(almost) every time md_check_recovery is called, and from the end of
md_do_sync which causes the mddev->thread to run, which will always call
md_check_recovery as well.

So it doesn't need to be a separate function and fits quite well into
md_check_recovery.

The "almost" is because multipathd calls md_check_recovery but not
md_handle_safemode. This is OK because the code from md_enter_safemode is a
no-op if mddev->safemode == 0, which it always is for a multipathd (providing
we don't allow it to be set to 2 on a signal...)

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c361777f 21-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: make sure recovery happens when add_new_disk is used for hot_add

Currently if add_new_disk is used to hot-add a drive to a degraded array,
recovery doesn't start ... because we didn't tell it to.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 75c96f85 05-May-2005 Adrian Bunk <bunk@stusta.de>

[PATCH] make some things static

This patch makes some needlessly global identifiers static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Arjan van de Ven <arjanv@infradead.org>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a757e64c 16-Apr-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: remove a number of misleading calls to MD_BUG

The conditions that cause these calls to MD_BUG are not kernel bugs, just
oddities in what userspace is asking for.

Also convert analyze_sbs to return void, and the value it returned was
always 0.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d28446fe 16-Apr-2005 NeilBrown <neilb@cse.unsw.edu.au>

[PATCH] md: close a small race in md thread deregistration

There is a tiny race when de-registering an MD thread, in that the thread
could disappear before it is set a SIGKILL, causing send_sig to have
problems.

This is most easily closed by holding tasklist_lock between enabling the
thread to exit (setting ->run to NULL) and telling it to exit.

(akpm: ick. Needs to use kthread API and stop using signals)

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# baaa2c51 16-Apr-2005 Neil Brown <neilb@cse.unsw.edu.au>

[PATCH] Avoid deadlock in sync_page_io by using GFP_NOIO

..as sync_page_io can be called on the write-out path.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!