History log of /linux-master/drivers/md/md.h
Revision Date Author Comments
# 396799eb 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: remove mddev->queue

Just use the request_queue from the gendisk pointer in the relatively
few places that sill need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de


# e305fce1 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add queue limit helpers

Add a few helpers that wrap the block queue limits API for use in MD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-5-hch@lst.de


# 176df894 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add a mddev_is_dm helper

Add a helper to check for a DM-mapped MD device instead of using
the obfuscated ->gendisk or ->queue NULL checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de


# 28be4fd3 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add a mddev_add_trace_msg helper

Add a small wrapper around blk_add_trace_msg that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de


# c396b90e 03-Mar-2024 Christoph Hellwig <hch@lst.de>

md: add a mddev_trace_remap helper

Add a helper to trace bio remapping that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de


# 41425f96 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com


# 5625ff8b 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid: add a new helper prepare_suspend() in md_personality

There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-8-yukuai1@huaweicloud.com


# 503f9d43 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

md: add a new helper reshape_interrupted()

The helper will be used for dm-raid456 later to detect the case that
reshape can't make progress.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-5-yukuai1@huaweicloud.com


# 314e9af0 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

md: export helper md_is_rdwr()

There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com


# 7a2347e2 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

md: export helpers to stop sync_thread

Add new helpers:

void md_idle_sync_thread(struct mddev *mddev);
void md_frozen_sync_thread(struct mddev *mddev);
void md_unfrozen_sync_thread(struct mddev *mddev);

The helpers will be used in dm-raid in later patches to fix regressions
and prevent calling md_reap_sync_thread() directly.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com


# 2c27d09d 29-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md/raid1: record nonrot rdevs while adding/removing rdevs to conf

For raid1, each read will iterate all the rdevs from conf and check if
any rdev is non-rotational, then choose rdev with minimal IO inflight
if so, or rdev with closest distance otherwise.

Disk nonrot info can be changed through sysfs entry:

/sys/block/[disk_name]/queue/rotational

However, consider that this should only be used for testing, and user
really shouldn't do this in real life. Record the number of non-rotational
disks in conf, to avoid checking each rdev in IO fast path and simplify
read_balance() a little bit.

Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com


# 3a0f007b 29-Feb-2024 Yu Kuai <yukuai3@huawei.com>

md: add a new helper rdev_has_badblock()

The current api is_badblock() must pass in 'first_bad' and
'bad_sectors', however, many caller just want to know if there are
badblocks or not, and these caller must define two local variable that
will never be used.

Add a new helper rdev_has_badblock() that will only return if there are
badblocks or not, remove unnecessary local variables and replace
is_badblock() with the new helper in many places.

There are no functional changes, and the new helper will also be used
later to refactor read_balance().

Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com


# a28d893e 23-Jan-2024 Christian Brauner <brauner@kernel.org>

md: port block device access to file

Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-4-adbd023e19cc@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# c891f1fd 25-Nov-2023 Yu Kuai <yukuai3@huawei.com>

md: remove flag RemoveSynchronized

rcu is not used correctly here, because synchronize_rcu() is called
before replacing old value, for example:

remove_and_add_spares // other path
synchronize_rcu
// called before replacing old value
set_bit(RemoveSynchronized)
rcu_read_lock()
rdev = conf->mirros[].rdev
pers->hot_remove_disk
conf->mirros[].rdev = NULL;
if (!test_bit(RemoveSynchronized))
synchronize_rcu
/*
* won't be called, and won't wait
* for concurrent readers to be done.
*/
// access rdev after remove_and_add_spares()
rcu_read_unlock()

Fortunately, there is a separate rcu protection to prevent such rdev
to be freed:

md_kick_rdev_from_array //other path
rcu_read_lock()
rdev = conf->mirros[].rdev
list_del_rcu(&rdev->same_set)

rcu_read_unlock()
/*
* rdev can be removed from conf, but
* rdev won't be freed.
*/
synchronize_rcu()
free rdev

Hence remove this useless flag and prepare to remove rcu protection to
access rdev from 'conf'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com


# 78b7b13f 16-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: cleanup pers->prepare_suspend()

pers->prepare_suspend() is not used anymore and can be removed.

Reverts following three commit:

- commit 431e61257d63 ("md: export md_is_rdwr() and is_md_suspended()")
- commit 3e00777d5157 ("md: add a new api prepare_suspend() in
md_personality")
- commit 868bba54a3bc ("md/raid5: fix a deadlock in the case that reshape
is interrupted")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231016100240.540474-1-yukuai1@huaweicloud.com


# 2b16a525 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: rename __mddev_suspend/resume() back to mddev_suspend/resume()

Now that the old apis are removed, __mddev_suspend/resume() can be
renamed to their original names.

This is done by:

sed -i "s/__mddev_suspend/mddev_suspend/g" *.[ch]
sed -i "s/__mddev_resume/mddev_resume/g" *.[ch]

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com


# 4717c028 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: remove old apis to suspend the array

Now that mddev_suspend() and mddev_resume() is not used anywhere, remove
them, and remove 'MD_ALLOW_SB_UPDATE' and 'MD_UPDATING_SB' as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-19-yukuai1@huaweicloud.com


# b4128c00 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: cleanup mddev_create/destroy_serial_pool()

Now that except for stopping the array, all the callers already suspend
the array, there is no need to suspend anymore, hence remove the second
parameter.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-15-yukuai1@huaweicloud.com


# f45461e2 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: add new helpers to suspend/resume and lock/unlock array

The new helpers suspend the array first and then lock the array,

Prepare to refactor from:

mddev_lock/lock_nointr
mddev_suspend
...
mddev_resuem
mddev_lock

With:

mddev_suspend_and_lock/lock_nointr
...
mddev_unlock_and_resume

After all the use cases is refactored, mddev_suspend/resume() will be
removed.

And mddev_suspend_and_lock() will also replace mddev_lock() for the case
that the array will be reconfigured, in order to synchronize with io to
prevent problems in many corner cases.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-6-yukuai1@huaweicloud.com


# 714d2015 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: add new helpers to suspend/resume array

Advantages for new apis:
- reconfig_mutex is not required;
- the weird logical that suspend array hold 'reconfig_mutex' for
mddev_check_recovery() to update superblock is not needed;
- the specail handling, 'pers->prepare_suspend', for raid456 is not
needed;
- It's safe to be called at any time once mddev is allocated, and it's
designed to be used from slow path where array configuration is changed;
- the new helpers is designed to be called before mddev_lock(), hence
it support to be interrupted by user as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-5-yukuai1@huaweicloud.com


# b8494823 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: initialize 'writes_pending' while allocating mddev

Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.

On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:

array_state_store
set_in_sync
if (!mddev->in_sync) -> in_sync is used for all levels
// access writes_pending

There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.

By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com


# d58eff83 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: initialize 'active_io' while allocating mddev

'active_io' is used for mddev_suspend() and it's initialized in
md_run(), this restrict that 'reconfig_mutex' must be held and
"mddev->pers" must be set before calling mddev_suspend().

Initialize 'active_io' early so that mddev_suspend() is safe to call
once mddev is allocated, this will be helpful to refactor
mddev_suspend() in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com


# ac619781 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: use separate work_struct for md_start_sync()

It's a little weird to borrow 'del_work' for md_start_sync(), declare
a new work_struct 'sync_work' for md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com


# 9f0f5a30 27-Sep-2023 Jan Kara <jack@suse.cz>

md: Convert to bdev_open_by_dev()

Convert md to use bdev_open_by_dev() and pass the handle around. We also
don't need the 'Holder' flag anymore so remove it.

CC: linux-raid@vger.kernel.org
CC: Song Liu <song@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-11-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 99892147 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: fix warning for holder mismatch from export_rdev()

Commit a1d767191096 ("md: use mddev->external to select holder in
export_rdev()") fix the problem that 'claim_rdev' is used for
blkdev_get_by_dev() while 'rdev' is used for blkdev_put().

However, if mddev->external is changed from 0 to 1, then 'rdev' is used
for blkdev_get_by_dev() while 'claim_rdev' is used for blkdev_put(). And
this problem can be reporduced reliably by following:

New file: mdadm/tests/23rdev-lifetime

devname=${dev0##*/}
devt=`cat /sys/block/$devname/dev`
pid=""
runtime=2

clean_up_test() {
pill -9 $pid
echo clear > /sys/block/md0/md/array_state
}

trap 'clean_up_test' EXIT

add_by_sysfs() {
while true; do
echo $devt > /sys/block/md0/md/new_dev
done
}

remove_by_sysfs(){
while true; do
echo remove > /sys/block/md0/md/dev-${devname}/state
done
}

echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed"

add_by_sysfs &
pid="$pid $!"

remove_by_sysfs &
pid="$pid $!"

sleep $runtime
exit 0

Test cmd:

./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime

Test result:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 960 at block/bdev.c:618 blkdev_put+0x27c/0x330
Modules linked in: multipath md_mod loop
CPU: 0 PID: 960 Comm: test Not tainted 6.5.0-rc2-00121-g01e55c376936-dirty #50
RIP: 0010:blkdev_put+0x27c/0x330
Call Trace:
<TASK>
export_rdev.isra.23+0x50/0xa0 [md_mod]
mddev_unlock+0x19d/0x300 [md_mod]
rdev_attr_store+0xec/0x190 [md_mod]
sysfs_kf_write+0x52/0x70
kernfs_fop_write_iter+0x19a/0x2a0
vfs_write+0x3b5/0x770
ksys_write+0x74/0x150
__x64_sys_write+0x22/0x30
do_syscall_64+0x40/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fix the problem by recording if 'rdev' is used as holder.

Fixes: a1d767191096 ("md: use mddev->external to select holder in export_rdev()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825025532.1523008-3-yukuai1@huaweicloud.com


# 7eb8ff02 03-Aug-2023 Li Lingfeng <lilingfeng3@huawei.com>

md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread

Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in
action_store"") removed the scenario of calling md_unregister_thread()
without holding mddev->reconfig_mutex, so add a lock holding check before
acquiring mddev->sync_thread by passing mdev to md_unregister_thread().

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>


# c687297b 21-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: also clone new io if io accounting is disabled

Currently, 'active_io' is grabbed before make_reqeust() is called, and
it's dropped immediately make_reqeust() returns. Hence 'active_io'
actually means io is dispatching, not io is inflight.

For raid0 and raid456 that io accounting is enabled, 'active_io' will
also be grabbed when bio is cloned for io accounting, and this 'active_io'
is dropped until io is done.

Always clone new bio so that 'active_io' will mean that io is inflight,
raid1 and raid10 will switch to use this method in later patches.

Now that bio will be cloned even if io accounting is disabled, also
rename related structure from '*_acct_*' to '*_clone_*'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com


# c567c86b 21-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: move initialization and destruction of 'io_acct_set' to md.c

'io_acct_set' is only used for raid0 and raid456, prepare to use it for
raid1 and raid10, so that io accounting from different levels can be
consistent.

By the way, follow up patches will also use this io clone mechanism to
make sure 'active_io' represents in flight io, not io that is dispatching,
so that mddev_suspend will wait for io to be done as designed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com


# 130443d6 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md: refactor idle/frozen_sync_thread() to fix deadlock

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry

// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

2) Trigger a recovery:

raid10_sync_request
raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

action_store
mddev_lock
md_unregister_thread
kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.

[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com


# 6f56f0c4 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md: add a mutex to synchronize idle and frozen in action_store()

Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.

Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com


# 4934b640 21-Jun-2023 Yu Kuai <yukuai3@huawei.com>

md: fix 'delete_mutex' deadlock

Commit 3ce94ce5d05a ("md: fix duplicate filename for rdev") introduce a
new lock 'delete_mutex', and trigger a new deadlock:

t1: remove rdev t2: sysfs writer

rdev_attr_store rdev_attr_store
mddev_lock
state_store
md_kick_rdev_from_array
lock delete_mutex
list_add mddev->deleting
unlock delete_mutex
mddev_unlock
mddev_lock
...
lock delete_mutex
kobject_del
// wait for sysfs writers to be done
mddev_unlock
lock delete_mutex
// wait for delete_mutex, deadlock

'delete_mutex' is used to protect the list 'mddev->deleting', turns out
that this list can be protected by 'reconfig_mutex' directly, and this
lock can be removed.

Fix this problem by removing the lock, and use 'reconfig_mutex' to
protect the list. mddev_unlock() will move this list to a local list to
be handled after 'reconfig_mutex' is dropped.

Fixes: 3ce94ce5d05a ("md: fix duplicate filename for rdev")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621142933.1395629-1-yukuai1@huaweicloud.com


# a022325a 29-May-2023 Yu Kuai <yukuai3@huawei.com>

md/md-bitmap: add a new helper to unplug bitmap asynchrously

If bitmap is enabled, bitmap must update before submitting write io, this
is why unplug callback must move these io to 'conf->pending_io_list' if
'current->bio_list' is not empty, which will suffer performance
degradation.

A new helper md_bitmap_unplug_async() is introduced to submit bitmap io
in a kworker, so that submit bitmap io in raid10_unplug() doesn't require
that 'current->bio_list' is empty.

This patch prepare to limit the number of plugged bio.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com


# 44693154 22-May-2023 Yu Kuai <yukuai3@huawei.com>

md: protect md_thread with rcu

Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:

1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().

Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.

Fix this problem by protecting all md_thread with rcu.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com


# 3ce94ce5 22-May-2023 Yu Kuai <yukuai3@huawei.com>

md: fix duplicate filename for rdev

Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.

Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():

1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);

So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.

flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.

sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.

A new mdadm regression test is proposed as well([1]).

[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/

Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com


# 3e00777d 11-May-2023 Yu Kuai <yukuai3@huawei.com>

md: add a new api prepare_suspend() in md_personality

There are no functional changes, the new api will be used later to do
special handling for raid456 in md_suspend().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-5-yukuai1@huaweicloud.com


# 431e6125 11-May-2023 Yu Kuai <yukuai3@huawei.com>

md: export md_is_rdwr() and is_md_suspended()

The two apis will be used later to fix a deadlock in raid456, there are
no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230512015610.821290-4-yukuai1@huaweicloud.com


# c31fea2f 06-Mar-2023 Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

md: add error_handlers for raid0 and linear

After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10")
MD_BROKEN must be set if array is failed because state_store() checks it.
If it is set then -EBUSY is returned to userspace.

For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
mdadm is unable to trigger clean-up actions. It is a regression.

This patch adds appropriate error_handler for raid0 and linear. The
error handler sets MD_BROKEN for this device.

Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com


# 76fed014 02-Feb-2023 Xiao Ni <xni@redhat.com>

md: account io_acct_set usage with active_io

io_acct_set was enabled for raid0/raid5 io accounting. bios that contain
md_io_acct are allocated in the i/o path. There isn't a good method to
monitor if these bios are all finished and freed. In the takeover process,
io_acct_set (which is used for bios with md_io_acct) need to be freed.
However, if some bios finish after io_acct_set is freed, it may trigger
the following panic:

[ 6973.767999] RIP: 0010:mempool_free+0x52/0x80
[ 6973.786098] Call Trace:
[ 6973.786549] md_end_io_acct+0x31/0x40
[ 6973.787227] blk_update_request+0x224/0x380
[ 6973.787994] blk_mq_end_request+0x1a/0x130
[ 6973.788739] blk_complete_reqs+0x35/0x50
[ 6973.789456] __do_softirq+0xd7/0x2c8
[ 6973.790114] ? sort_range+0x20/0x20
[ 6973.790763] run_ksoftirqd+0x2a/0x40
[ 6973.791400] smpboot_thread_fn+0xb5/0x150
[ 6973.792114] kthread+0x10b/0x130
[ 6973.792724] ? set_kthread_struct+0x50/0x50
[ 6973.793491] ret_from_fork+0x1f/0x40

Fix this by increasing and decreasing active_io for each bio with
md_io_acct so that mddev_suspend() will wait until all bios from
io_acct_set finish before freeing io_acct_set.

Reported-by: Fine Fan <ffan@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# 72adae23 30-Jan-2023 Xiao Ni <xni@redhat.com>

md: Change active_io to percpu

Now the type of active_io is atomic. It's used to count how many ios are
in the submitting process and it's added and decreased very time. But it
only needs to check if it's zero when suspending the raid. So we can
switch atomic to percpu to improve the performance.

After switching active_io to percpu type, we use the state of active_io
to judge if the raid device is suspended. And we don't need to wake up
->sb_wait in md_handle_request anymore. It's done in the callback function
which is registered when initing active_io. The argument mddev->suspended
is only used to count how many users are trying to set raid to suspend
state.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# d57d9d69 29-Nov-2022 Christoph Hellwig <hch@lst.de>

md: mark md_kick_rdev_from_array static

md_kick_rdev_from_array is only used in md.c, so unexport it and mark
the symbol static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>


# 34cb92c0 23-Jul-2022 Christoph Hellwig <hch@lst.de>

md: return the allocated devices from md_alloc

Two callers of md_alloc want to use the newly allocated devices, so
return it instead of letting them find it cumbersomely after the
allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 12a6caf2 19-Jul-2022 Christoph Hellwig <hch@lst.de>

md: only delete entries from all_mddevs when the disk is freed

This ensures device names don't get prematurely reused. Instead add a
deleted flag to skip already deleted devices in mddev_get and other
places that only want to see live mddevs.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 05ce7fb9 31-May-2022 Chris Webb <chris@arachsys.com>

md: Explicitly create command-line configured devices

Boot-time assembly of arrays with md= command-line arguments breaks when
CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
calls blkdev_get_by_dev(), assuming this implicitly creates the block
device.

Fix this by attempting to md_alloc() the array first. As in the probe path,
ignore any error as failure is caught by blkdev_get_by_dev() anyway.

Signed-off-by: Chris Webb <chris@arachsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# eac58d08 08-Jun-2022 Logan Gunthorpe <logang@deltatee.com>

md: Use enum for overloaded magic numbers used by mddev->curr_resync

Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.

The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4ce4c73f 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

md/core: Combine two sync_page_io() arguments

Improve uniformity in the kernel of handling of request operation and
flags by passing these as a single argument.

Cc: Song Liu <song@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d0a18034 06-Jun-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

Revert "md: don't unregister sync_thread with reconfig_mutex held"

The 07reshape5intr test is broke because of below path.

md_reap_sync_thread
-> mddev_unlock
-> md_unregister_thread(&mddev->sync_thread)

And md_check_recovery is triggered by,

mddev_unlock -> md_wakeup_thread(mddev->thread)

then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
reshape_position can't be updated accordingly.

Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>


# 8b48ec23 12-Feb-2021 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: don't unregister sync_thread with reconfig_mutex held

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.

Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <song@kernel.org>


# 9631abdb 22-Mar-2022 Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

md: Set MD_BROKEN for RAID1 and RAID10

There is no direct mechanism to determine raid failure outside
personality. It is done by checking rdev->flags after executing
md_error(). If "faulty" flag is not set then -EBUSY is returned to
userspace. -EBUSY means that array will be failed after drive removal.

Mdadm has special routine to handle the array failure and it is executed
if -EBUSY is returned by md.

There are at least two known reasons to not consider this mechanism
as correct:
1. drive can be removed even if array will be failed[1].
2. -EBUSY seems to be wrong status. Array is not busy, but removal
process cannot proceed safe.

-EBUSY expectation cannot be removed without breaking compatibility
with userspace. In this patch first issue is resolved by adding support
for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
next commit.

The idea is to set the MD_BROKEN if we are sure that raid is in failed
state now. This is done in each error_handler(). In md_error() MD_BROKEN
flag is checked. If is set, then -EBUSY is returned to userspace.

As in previous commit, it causes that #mdadm --set-faulty is able to
fail array. Previously proposed workaround is valid if optional
functionality[1] is disabled.

[1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
RAID1/RAID10.")

Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>


# 10fa225c 09-Feb-2022 Christoph Hellwig <hch@lst.de>

scsi: md: Remove WRITE_SAME support

There are no more end-users of REQ_OP_WRITE_SAME left, so we can start
deleting it.

Link: https://lore.kernel.org/r/20220209082828.2629273-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


# 0c031fd3 10-Dec-2021 Xiao Ni <xni@redhat.com>

md: Move alloc/free acct bioset in to personality

bioset acct is only needed for raid0 and raid5. Therefore, md_run only
allocates it for raid0 and raid5. However, this does not cover
personality takeover, which may cause uninitialized bioset. For example,
the following repro steps:

mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1
mdadm --wait /dev/md0
mkfs.xfs /dev/md0
mdadm /dev/md0 --grow -l5
mount /dev/md0 /mnt

causes panic like:

[ 225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 225.934903] #PF: supervisor instruction fetch in kernel mode
[ 225.935639] #PF: error_code(0x0010) - not-present page
[ 225.936361] PGD 0 P4D 0
[ 225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
[ 225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706
[ 225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014
[ 225.939922] RIP: 0010:0x0
[ 225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246
[ 225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39
[ 225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800
[ 225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01
[ 225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58
[ 225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040
[ 225.946709] FS: 00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000
[ 225.947814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0
[ 225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 225.951414] Call Trace:
[ 225.951787] <TASK>
[ 225.952120] mempool_alloc+0xe5/0x250
[ 225.952625] ? mempool_resize+0x370/0x370
[ 225.953187] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 225.953862] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 225.954464] ? sched_clock_cpu+0x15/0x120
[ 225.955019] ? find_held_lock+0xac/0xd0
[ 225.955564] bio_alloc_bioset+0x1ed/0x2a0
[ 225.956080] ? lock_downgrade+0x3a0/0x3a0
[ 225.956644] ? bvec_alloc+0xc0/0xc0
[ 225.957135] bio_clone_fast+0x19/0x80
[ 225.957651] raid5_make_request+0x1370/0x1b70
[ 225.958286] ? sched_clock_cpu+0x15/0x120
[ 225.958797] ? __lock_acquire+0x8b2/0x3510
[ 225.959339] ? raid5_get_active_stripe+0xce0/0xce0
[ 225.959986] ? lock_is_held_type+0xd8/0x130
[ 225.960528] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 225.961135] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 225.961703] ? sched_clock_cpu+0x15/0x120
[ 225.962232] ? lock_release+0x27a/0x6c0
[ 225.962746] ? do_wait_intr_irq+0x130/0x130
[ 225.963302] ? lock_downgrade+0x3a0/0x3a0
[ 225.963815] ? lock_release+0x6c0/0x6c0
[ 225.964348] md_handle_request+0x342/0x530
[ 225.964888] ? set_in_sync+0x170/0x170
[ 225.965397] ? blk_queue_split+0x133/0x150
[ 225.965988] ? __blk_queue_split+0x8b0/0x8b0
[ 225.966524] ? submit_bio_checks+0x3b2/0x9d0
[ 225.967069] md_submit_bio+0x127/0x1c0
[...]

Fix this by moving alloc/free of acct bioset to pers->run and pers->free.

While we are on this, properly handle md_integrity_register() error in
raid0_run().

Fixes: daee2024715d (md: check level before create and exit io_acct_set)
Cc: stable@vger.kernel.org
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>


# 54679486 04-Oct-2021 Guoqing Jiang <guoqing.jiang@linux.dev>

md: remove unused argument from md_new_event

Actually, mddev is not used by md_new_event.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 50b4aecf 09-Aug-2021 Christoph Hellwig <hch@lst.de>

block: remove GENHD_FL_UP

Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c32dc040 28-May-2021 Rikard Falkeborn <rikard.falkeborn@gmail.com>

md: Constify attribute_group structs

The attribute_group structs are never modified, they're only passed to
sysfs_create_group() and sysfs_remove_group(). Make them const to allow
the compiler to put them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>


# 10764815 25-May-2021 Guoqing Jiang <jgq516@gmail.com>

md: add io accounting for raid0 and raid5

We introduce a new bioset (io_acct_set) for raid0 and raid5 since they
don't own clone infrastructure to accounting io. And the bioset is added
to mddev instead of to raid0 and raid5 layer, because with this way, we
can put common functions to md.h and reuse them in raid0 and raid5.

Also struct md_io_acct is added accordingly which includes io start_time,
the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct
to get related io status.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>


# ad3fc798 25-May-2021 Guoqing Jiang <jgq516@gmail.com>

md: revert io stats accounting

The commit 41d2d848e5c0 ("md: improve io stats accounting") could cause
double fault problem per the report [1], and also it is not correct to
change ->bi_end_io if md don't own it, so let's revert it.

And io stats accounting will be replemented in later commits.

[1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t

Fixes: 41d2d848e5c0 ("md: improve io stats accounting")
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>


# a8698707 25-May-2021 Christoph Hellwig <hch@lst.de>

block: move bd_mutex to struct gendisk

Replace the per-block device bd_mutex with a per-gendisk open_mutex,
thus simplifying locking wherever we deal with partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cf78408f 04-Feb-2021 Xiao Ni <xni@redhat.com>

md: add md_submit_discard_bio() for submitting discard bio

Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# a78f18da 26-Jan-2021 Christoph Hellwig <hch@lst.de>

md: remove bio_alloc_mddev

bio_alloc_mddev is never called with a NULL mddev, and ->bio_set is
initialized in md_run, so it always must be initialized as well. Just
open code the remaining call to bio_alloc_bioset.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 309dca30 24-Jan-2021 Christoph Hellwig <hch@lst.de>

block: store a block_device pointer in struct bio

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 77a68698 13-Dec-2020 Mike Snitzer <snitzer@redhat.com>

Revert "md: change mddev 'chunk_sectors' from int to unsigned"

This reverts commit 6ffeb1c3f8226244c08105bcdbeecc04bad6b89a.

This change caused unexpected v5.10 raid6 mount failures, see:
https://lkml.org/lkml/2020/12/14/7

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6ffeb1c3 12-Dec-2020 Mike Snitzer <snitzer@redhat.com>

md: change mddev 'chunk_sectors' from int to unsigned

Commit e2782f560c29 ("Revert "dm raid: remove unnecessary discard
limits for raid10"") exposed compiler warnings introduced by commit
e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10"):

In file included from ./include/linux/kernel.h:14,
from ./include/asm-generic/bug.h:20,
from ./arch/x86/include/asm/bug.h:93,
from ./include/linux/bug.h:5,
from ./include/linux/mmdebug.h:5,
from ./include/linux/gfp.h:5,
from ./include/linux/slab.h:15,
from drivers/md/dm-raid.c:8:
drivers/md/dm-raid.c: In function ‘raid_io_hints’:
./include/linux/minmax.h:18:28: warning: comparison of distinct pointer types lacks a cast
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
^~
./include/linux/minmax.h:32:4: note: in expansion of macro ‘__typecheck’
(__typecheck(x, y) && __no_side_effects(x, y))
^~~~~~~~~~~
./include/linux/minmax.h:42:24: note: in expansion of macro ‘__safe_cmp’
__builtin_choose_expr(__safe_cmp(x, y), \
^~~~~~~~~~
./include/linux/minmax.h:51:19: note: in expansion of macro ‘__careful_cmp’
#define min(x, y) __careful_cmp(x, y, <)
^~~~~~~~~~~~~
./include/linux/minmax.h:84:39: note: in expansion of macro ‘min’
__x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
^~~
drivers/md/dm-raid.c:3739:33: note: in expansion of macro ‘min_not_zero’
limits->max_discard_sectors = min_not_zero(rs->md.chunk_sectors,
^~~~~~~~~~~~

Fix this by changing the chunk_sectors member of 'struct mddev' from
int to 'unsigned int' to match the type used for the 'chunk_sectors'
member of 'struct queue_limits'. Various MD code still uses 'int' but
none of it appears to ever make use of signed int; and storing
positive signed int in unsigned is perfectly safe.

Reported-by: Song Liu <songliubraving@fb.com>
Fixes: e2782f560c29 ("Revert "dm raid: remove unnecessary discard limits for raid10"")
Fixes: e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10")
Cc: stable@vger,kernel.org # e0910c8e4f87 was marked for stable@
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 57a0f3a8 09-Dec-2020 Song Liu <songliubraving@fb.com>

Revert "md: add md_submit_discard_bio() for submitting discard bio"

This reverts commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0.

Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/
Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 81ba3c24 10-Nov-2020 Pankaj Gupta <pankaj.gupta@cloud.ionos.com>

md: improve variable names in md_flush_request()

This patch improves readability by using better variable names
in flush request coalescing logic.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 4245e52d 02-Sep-2020 Christoph Hellwig <hch@lst.de>

md: don't detour through bd_contains for the gendisk

bd_disk is set on all block devices, including those for partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2628089b 24-Aug-2020 Xiao Ni <xni@redhat.com>

md: add md_submit_discard_bio() for submitting discard bio

Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# de09077c 01-Sep-2020 Christoph Hellwig <hch@lst.de>

block: remove revalidate_disk()

Remove the now unused helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7e0adbfc 07-Jun-2020 Christoph Hellwig <hch@lst.de>

md: rewrite md_setup_drive to avoid ioctls

md_setup_drive knows it works with md devices, so it is rather pointless
to open a file descriptor and issue ioctls. Just call directly into the
relevant low-level md routines after getting a handle to the device using
blkdev_get_by_dev instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a6a0506 07-Jun-2020 Christoph Hellwig <hch@lst.de>

md: remove the kernel version of md_u.h

mdp_major can just move to drivers/md/md.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>


# d82fa81c 06-Jun-2020 Christoph Hellwig <hch@lst.de>

md: replace the RAID_AUTORUN ioctl with a direct function call

Instead of using a spcial RAID_AUTORUN ioctl that only exists for
non-modular builds and is only called from the early init code, just
call the actual function directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>


# e1a86dbb 14-Jul-2020 Junxiao Bi <junxiao.bi@oracle.com>

md: fix deadlock causing by sysfs_notify

The following deadlock was captured. The first process is holding 'kernfs_mutex'
and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
this pending bio list would be flushed by second process 'md127_raid1', but
it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
path, removed all of them.

PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file"
#0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
#1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
#2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
#3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
#4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
#5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
#6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
#7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
#8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
#9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
#10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
#11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
#12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
#13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
#14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
#15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
#16 [ffffb87c4df37958] new_slab at ffffffff9a251430
#17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
#18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
#19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
#20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
#21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
#22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
#23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
#24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
#25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
#26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
#27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
#28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
#29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
#30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
#31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
#32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
#33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905
RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890
RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c
R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0
R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1"
#0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
#1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
#2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
#3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
#4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
#5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
#6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
#7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
#8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
#9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
#10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
#11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
#12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
#13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
#14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 41d2d848 03-Jul-2020 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: improve io stats accounting

Use generic io accounting functions to manage io stats. There was an
attempt to do this earlier in commit 18c0b223cf99 ("md: use generic io
stats accounting functions to simplify io stat accounting"), but it did
not include a call to generic_end_io_acct() and caused issues with
tracking in-flight IOs, so it was later removed in commit 74672d069b29
("md: fix md io stats accounting broken").

This patch attempts to fix this by using both disk_start_io_acct() and
disk_end_io_acct(). To make it possible, a struct md_io is allocated for
every new md bio, which includes the io start_time. A new mempool is
introduced for this purpose. We override bio->bi_end_io with our own
callback and call disk_start_io_acct() before passing the bio to
md_handle_request(). When it completes, we call disk_end_io_acct() and
the original bi_end_io callback.

This adds correct statistics about in-flight IOs and IO processing time,
interpreted e.g. in iostat as await, svctm, aqu-sz and %util.

It also fixes a situation where too many IOs where reported if a bio was
re-submitted to the mddev, because io accounting is now performed only
on newly arriving bios.

Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 21cf8661 01-Jul-2020 Christoph Hellwig <hch@lst.de>

writeback: remove bdi->congested_fn

Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 78f57ef9 09-Apr-2020 Coly Li <colyli@suse.de>

md: use memalloc scope APIs in mddev_suspend()/mddev_resume()

In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().

The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.

But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.

What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.

This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.

Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.

Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 69b00b5b 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: introduce a new struct for IO serialization

Obviously, IO serialization could cause the degradation of
performance a lot. In order to reduce the degradation, so a
rb interval tree is added in raid1 to speed up the check of
collision.

So, a rb root is needed in md_rdev, then abstract all the
serialize related members to a new struct (serial_in_rdev),
embed it into md_rdev.

Of course, we need to free the struct if it is not needed
anymore, so rdev/rdevs_uninit_serial are added accordingly.
And they should be called when destroty memory pool or can't
alloc memory.

And we need to consider to call mddev_destroy_serial_pool
in case serialize_policy/write-behind is disabled, bitmap
is destroyed or in __md_stop_writes.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3938f5fb 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: add serialize_policy sysfs node for raid1

With the new sysfs node, we can use it to control if raid1 array
wants io serialization or not. So mddev_create_serial_pool and
mddev_destroy_serial_pool are called in serialize_policy_store
to enable or disable the serialization.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 11d3a9f6 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: prepare for enable raid1 io serialization

1. The related resources (spin_lock, list and waitqueue) are needed for
address raid1 reorder overlap issue too, in this case, rdev is set to
NULL for mddev_create/destroy_serial_pool which implies all rdevs need
to handle these resources.

And also add "is_suspend" to mddev_destroy_serial_pool since it will
be called under suspended situation, which also makes both create and
destroy pool have same arguments.

2. Introduce rdevs_init_serial which is called if raid1 io serialization
is enabled since all rdevs need to init related stuffs.

3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should
be called between suspend and resume.

No need to export mddev_create_serial_pool since it is only called in
md-mod module.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 404659cf 23-Dec-2019 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: rename wb stuffs

Previously, wb_info_pool and wb_list stuffs are introduced
to address potential data inconsistence issue for write
behind device.

Now rename them to serial related name, since the same
mechanism will be used to address reorder overlap write
issue for raid1.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 775d7831 16-Sep-2019 David Jeffery <djeffery@redhat.com>

md: improve handling of bio with REQ_PREFLUSH in md_flush_request()

If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.

Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function. If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.

Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.

Fixes: 2bc13b83e629 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 62f7b198 03-Sep-2019 Guilherme G. Piccoli <gpiccoli@canonical.com>

md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone

Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.

In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.

This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.

A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.

With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.

Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 9d4b45d6 19-Aug-2019 NeilBrown <neilb@suse.com>

md: don't report active array_state until after revalidate_disk() completes.

Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready. As soon as
it appear to be zero, fsck can be run. If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show(). This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared. This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice. This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b67e5 ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 9a567843 24-Jul-2019 Guoqing Jiang <jgq516@gmail.com>

md: allow last device to be forcibly removed from RAID1/RAID10.

When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed. This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action. This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.

Signed-off-by: NeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 963c555e 14-Jun-2019 Guoqing Jiang <gqjiang@suse.com>

md: introduce mddev_create/destroy_wb_pool for the change of member device

Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# 3e148a32 19-Jun-2019 Guoqing Jiang <gqjiang@suse.com>

md/raid1: fix potential data inconsistency issue with write behind device

For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# af1a8899 20-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version you should have received a copy of the gnu general
public license for example usr src linux copying if not write to the
free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2bc13b83 29-Mar-2019 NeilBrown <neilb@suse.com>

md: batch flush requests.

Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 10.509 78.305
Flush 2078376 0.013 10.094
Close 21787697 0.019 18.821
LockX 96580 0.007 3.184
Mkdir 384 0.008 0.062
Rename 1255883 0.191 23.534
ReadX 46495589 0.020 14.230
WriteX 14790591 7.123 60.706
Unlink 5989118 0.440 54.551
UnlockX 96580 0.005 2.736
FIND_FIRST 10393845 0.042 12.079
SET_FILE_INFORMATION 2415558 0.129 10.088
QUERY_FILE_INFORMATION 4711725 0.005 8.462
QUERY_PATH_INFORMATION 26883327 0.032 21.715
QUERY_FS_INFORMATION 4929409 0.010 8.238
NTCreateX 29660080 0.100 53.268

Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 256 8.326 36.761
Flush 693291 3.974 180.269
Close 7266404 0.009 36.929
LockX 32160 0.006 0.840
Mkdir 128 0.008 0.021
Rename 418755 0.063 29.945
ReadX 15498708 0.007 7.216
WriteX 4932310 22.482 267.928
Unlink 1997557 0.109 47.553
UnlockX 32160 0.004 1.110
FIND_FIRST 3465791 0.036 7.320
SET_FILE_INFORMATION 805825 0.015 1.561
QUERY_FILE_INFORMATION 1570950 0.005 2.403
QUERY_PATH_INFORMATION 8965483 0.013 14.277
QUERY_FS_INFORMATION 1643626 0.009 3.314
NTCreateX 9892174 0.061 41.278

Throughput 345.009 MB/sec (sync open) 128 clients 128 procs
max_latency=267.939 m

3. With patch1 and patch2
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 9.570 54.588
Flush 2061354 0.666 15.102
Close 21604811 0.012 25.697
LockX 95770 0.007 1.424
Mkdir 384 0.008 0.053
Rename 1245411 0.096 12.263
ReadX 46103198 0.011 12.116
WriteX 14667988 7.375 60.069
Unlink 5938936 0.173 30.905
UnlockX 95770 0.005 4.147
FIND_FIRST 10306407 0.041 11.715
SET_FILE_INFORMATION 2395987 0.048 7.640
QUERY_FILE_INFORMATION 4672371 0.005 9.291
QUERY_PATH_INFORMATION 26656735 0.018 19.719
QUERY_FS_INFORMATION 4887940 0.010 7.654
NTCreateX 29410811 0.059 28.551

Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs
max_latency=60.075 ms

Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4bc034d3 29-Mar-2019 NeilBrown <neilb@suse.com>

Revert "MD: fix lock contention for flush bios"

This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
The bios thus submitted will be queued on current->bio_list and not
submitted immediately. As the bios are allocated from a mempool,
this can theoretically result in a deadlock - all the pool of requests
could be in various ->bio_list queues and a subsequent mempool_alloc
could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
It handles this by submitting many requests in parallel - all of which
are identical and so most of which do nothing useful.
It would be more efficient to just send one lower-level request, but
allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7564beda 18-Oct-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster/raid10: support add disk under grow mode

For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.

Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.

For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.

Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.

2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 0357ba27 02-Jul-2018 Guoqing Jiang <gqjiang@suse.com>

md-cluster: show array's status more accurate

When resync or recovery is happening in one node,
other nodes don't show the appropriate info now.

For example, when create an array in master node
without "--assume-clean", then assemble the array
in slave nodes, you can see "resync=PENDING" when
read /proc/mdstat in slave nodes. However, the info
is confusing since "PENDING" status is introduced
for start array in read-only mode.

We introduce RESYNCING_REMOTE flag to indicate that
resync thread is running in remote node. The flags
is set when node receive RESYNCING msg. And we clear
the REMOTE flag in following cases:

1. resync or recover is finished in master node,
which means slaves receive msg with both lo
and hi are set to 0.
2. node continues resync/recovery in recover_bitmaps.
3. when resync_finish is called.

Then we show accurate information in status_resync
by check REMOTE flags and with other conditions.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# afeee514 20-May-2018 Kent Overstreet <kent.overstreet@gmail.com>

md: convert to bioset_init()/mempool_init()

Convert md to embedded bio sets.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5a409b4f 20-May-2018 Xiao Ni <xni@redhat.com>

MD: fix lock contention for flush bios

There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.

Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.

This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:

=================Without the patch============================
Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 11165 167.595 5879.560
Close 107469 1.391 2231.094
LockX 384 0.003 0.019
Rename 5944 2.141 1856.001
ReadX 208121 0.003 0.074
WriteX 98259 1925.402 15204.895
Unlink 25198 13.264 3457.268
UnlockX 384 0.001 0.009
FIND_FIRST 47111 0.012 0.076
SET_FILE_INFORMATION 12966 0.007 0.065
QUERY_FILE_INFORMATION 27921 0.004 0.085
QUERY_PATH_INFORMATION 124650 0.005 5.766
QUERY_FS_INFORMATION 22519 0.003 0.053
NTCreateX 141086 4.291 2502.812

Throughput 3.7181 MB/sec (sync open) 128 clients 128 procs max_latency=15204.905 ms

=================With the patch============================
Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 4500 174.134 406.398
Close 48195 0.060 467.062
LockX 256 0.003 0.029
Rename 2324 0.026 0.360
ReadX 78846 0.004 0.504
WriteX 66832 562.775 1467.037
Unlink 5516 3.665 1141.740
UnlockX 256 0.002 0.019
FIND_FIRST 16428 0.015 0.313
SET_FILE_INFORMATION 6400 0.009 0.520
QUERY_FILE_INFORMATION 17865 0.003 0.089
QUERY_PATH_INFORMATION 47060 0.078 416.299
QUERY_FS_INFORMATION 7024 0.004 0.032
NTCreateX 55921 0.854 1141.452

Throughput 11.744 MB/sec (sync open) 128 clients 128 procs max_latency=1467.041 ms

The test is done on raid1 disk with two rotational disks

V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version

V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.

Fix a bug from v2. It should set flush_pending to 1 at first.

V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe. It is only safe to call rdev_dec_pending() on rdevs
that you previously called
atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.

Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4b6c1060 02-Feb-2018 Heinz Mauelshagen <heinzm@redhat.com>

md: fix md_write_start() deadlock w/o metadata devices

If no metadata devices are configured on raid1/4/5/6/10
(e.g. via dm-raid), md_write_start() unconditionally waits
for superblocks to be written thus deadlocking.

Fix introduces mddev->has_superblocks bool, defines it in md_run()
and checks for it in md_write_start() to conditionally avoid waiting.

Once on it, check for non-existing superblocks in md_super_write().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647
Fixes: cc27b0c78c796 ("md: fix deadlock between mddev_suspend() and md_write_start()")

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>


# 1532d9e8 27-Dec-2017 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

raid5-ppl: PPL support for disks with write-back cache enabled

In order to provide data consistency with PPL for disks with write-back
cache enabled all data has to be flushed to disks before next PPL
entry. The disks to be flushed are marked in the bitmap. It's modified
under a mutex and it's only read after PPL io unit is submitted.

A limitation of 64 disks in the array has been introduced to keep data
structures and implementation simple. RAID5 arrays with so many disks are
not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

With write-back cache disabled next PPL entry is submitted when data write
for current one completes. Data flush defers next log submission so trigger
it when there are no stripes for handling found.

As PPL assures all data is flushed to disk at request completion, just
acknowledge flush request when PPL is enabled.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>


# d5d885fd 19-Nov-2017 Song Liu <songliubraving@fb.com>

md: introduce new personality funciton start()

In do_md_run(), md threads should not wake up until the array is fully
initialized in md_run(). However, in raid5_run(), raid5-cache may wake
up mddev->thread to flush stripes that need to be written back. This
design doesn't break badly right now. But it could lead to bad bug in
the future.

This patch tries to resolve this problem by splitting start up work
into two personality functions, run() and start(). Tasks that do not
require the md threads should go into run(), while task that require
the md threads go into start().

r5l_load_log() is moved to raid5_start(), so it is not called until
the md threads are started in do_md_run().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# efa4b77b 18-Oct-2017 Shaohua Li <shli@fb.com>

md: use lockdep_assert_held

lockdep_assert_held is a better way to assert lock held, and it works
for UP.

Signed-off-by: Shaohua Li <shli@fb.com>


# b03e0ccb 18-Oct-2017 NeilBrown <neilb@suse.com>

md: remove special meaning of ->quiesce(.., 2)

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 35bfc521 16-Oct-2017 NeilBrown <neilb@suse.com>

md: allow metadata update while suspending.

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held. So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening. Then allow it
while the quiesce completes, and then wait for it to finish.

Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 393debc2 21-Sep-2017 Shaohua Li <shli@fb.com>

md: separate request handling

With commit cc27b0c78c79, pers->make_request could bail out without handling
the bio. If that happens, we should retry. The commit fixes md_make_request
but not other call sites. Separate the request handling part, so other call
sites can use it.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ddc08823 16-Aug-2017 Pawel Baldysiak <pawel.baldysiak@intel.com>

md: Runtime support for multiple ppls

Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it).

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 74d46992 23-Aug-2017 Christoph Hellwig <hch@lst.de>

block: replace bi_bdev with a gendisk pointer and partitions index

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# be453e77 14-Jul-2017 Ming Lei <ming.lei@redhat.com>

md: raid1-10: move raid1/raid10 common code into raid1-10.c

No function change, just move 'struct resync_pages' and related
helpers into raid1-10.c

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 022e510f 14-Jul-2017 Ming Lei <ming.lei@redhat.com>

md: remove 'idx' from 'struct resync_pages'

bio_add_page() won't fail for resync bio, and the page index for each
bio is same, so remove it.

More importantly the 'idx' of 'struct resync_pages' is initialized in
mempool allocator function, the current way is wrong since mempool is
only responsible for allocation, we can't use that for initialization.

Suggested-by: NeilBrown <neilb@suse.com>
Reported-by: NeilBrown <neilb@suse.com>
Reported-and-tested-by: Patrick <dto@gmx.net>
Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages)
Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages)
Cc: stable@vger.kernel.org (4.12+)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4aaf7694 03-Jul-2017 Guoqing Jiang <gqjiang@suse.com>

md/bitmap: don't read page from device with Bitmap_sync

The device owns Bitmap_sync flag needs recovery
to become in sync, and read page from this type
device could get stale status.

Also add comments for Bitmap_sync bit per the
suggestion from Shaohua and Neil.

Previous disscussion can be found here:
https://marc.info/?t=149760428900004&r=1&w=2

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 5a85071c 20-Jun-2017 NeilBrown <neilb@suse.com>

md: use a separate bio_set for synchronous IO.

md devices allocate a bio_set and use it for two
distinct purposes.
mddev->bio_set is used to clone bios as part of sending
upper level requests down to lower level devices,
and it is also use for synchronous IO such as superblock
and bitmap updates, and for correcting read errors.

This multiple usage can lead to deadlocks. It is likely
that cloned bios might be queued for write and to be
waiting for a metadata update before the write can be permitted.
If the cloning exhausted mddev->bio_set, the metadata update
may not be able to proceed.

This scenario has been seen during heavy testing, with lots of IO and
lots of memory pressure.

Address this by adding a new bio_set specifically for synchronous IO.
All synchronous IO goes directly to the underlying device and is not
queued at the md level, so request using entries from the new
mddev->sync_set will complete in a timely fashion.
Requests that use mddev->bio_set will sometimes need to wait
for synchronous IO, but will no longer risk deadlocking that iO.

Also: small simplification in mddev_put(): there is no need to
wait until the spinlock is released before calling bioset_free().

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# cc27b0c7 05-Jun-2017 NeilBrown <neilb@suse.com>

md: fix deadlock between mddev_suspend() and md_write_start()

If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().

Reported-by: Nix <nix@esperi.org.uk>
Fix: 68866e425be2(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# a415c0f1 05-Jun-2017 NeilBrown <neilb@suse.com>

md: initialise ->writes_pending in personality modules.

The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it. This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 2214c260 08-May-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: don't return -EAGAIN in md_allow_write for external metadata arrays

This essentially reverts commit b5470dc5fc18 ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.

Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 3deff1a7 05-Apr-2017 Christoph Hellwig <hch@lst.de>

md: support REQ_OP_WRITE_ZEROES

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 513e2faa 16-Mar-2017 Ming Lei <tom.leiming@gmail.com>

md: prepare for managing resync I/O pages in clean way

Now resync I/O use bio's bec table to manage pages,
this way is very hacky, and may not work any more
once multipage bvec is introduced.

So introduce helpers and new data structure for
managing resync I/O pages more cleanly.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# d8e29fbc 16-Mar-2017 Ming Lei <tom.leiming@gmail.com>

md: move two macros into md.h

Both raid1 and raid10 share common resync
block size and page count, so move them into md.h.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 4ad23a97 14-Mar-2017 NeilBrown <neilb@suse.com>

MD: use per-cpu counter for writes_pending

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean". Consequently it needs to be updated frequently
but only checked for zero occasionally. Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio. This provided
justification for making the updates more efficient.

So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed. As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.

We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero. md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode. If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.

It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.

mod_timer() already optimizes the case where the timeout value doesn't
actually change. We leverage that further by always rounding off the
jiffies to the timeout value. This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.

md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set. That too can be optimised to avoid
calls to wake_up().

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 49728050 14-Mar-2017 NeilBrown <neilb@suse.com>

md/raid5: use md_write_start to count stripes, not bios

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.

So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.

add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().

The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ba903a3e 09-Mar-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

raid5-ppl: runtime PPL enabling or disabling

Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.

When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.

Enabling or disabling PPL is performed under a suspended array. The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ea0213e0 09-Mar-2017 Artur Paszkiewicz <artur.paszkiewicz@intel.com>

md: superblock changes for PPL

Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 0ba95977 01-Mar-2017 Guoqing Jiang <gqjiang@suse.com>

md-cluster: use sync way to handle METADATA_UPDATED msg

Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
-> md_update_sb-metadata_update_start
(want EX on token however token is
got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
appear if stop a resyncing array.
This is because md_unregister_thread(&cinfo->recv_thread) is
blocked by recv_daemon -> process_recvd_msg
-> process_metadata_update.
To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
a. Node A sends METADATA_UPDATED msg (held Token lock).
b. Node B wants to do resync, and is blocked since it can't
get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
not set since the callchain
(md_do_sync -> sync_request
-> resync_info_update
-> sendmsg
-> lock_comm -> lock_token)
doesn't hold reconfig_mutex.
c. Node B trys to update sb (held reconfig_mutex), but stopped
at wait_event() in metadata_update_start since we have set
MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
d. Then Node B receives METADATA_UPDATED msg from A, of course
recv_daemon is blocked forever.
Since metadata_update_start always calls lock_token with reconfig_mutex,
we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
lock_token don't need to set it twice unless lock_token is invoked from
lock_comm.

Finally, thanks to Neil for his great idea and help!

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 99b3d74e 23-Feb-2017 Shaohua Li <shli@fb.com>

md: delete dead code

Nobody is using mddev_check_plugged(), so delete the dead code

Signed-off-by: Shaohua Li <shli@fb.com>


# d7a10308 14-Feb-2017 Ming Lei <tom.leiming@gmail.com>

md: fast clone bio in bio_clone_mddev()

Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 26483819 13-Feb-2017 Shaohua Li <shli@fb.com>

md: disable WRITE SAME if it fails in underlayer disks

This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.

We did a little bit different than dm. Instead of disabling writesame in
the first IO error, we disable it till next writesame IO coming after
the first IO error. This way we don't need to clone a bio.

Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581

Suggested-by: NeilBrown <neilb@suse.com>
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 394ed8e4 04-Jan-2017 Shaohua Li <shli@fb.com>

md: cleanup mddev flag clear for takeover

Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 2953079c 08-Dec-2016 Shaohua Li <shli@fb.com>

md: separate flags for superblock changes

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 46533ff7 17-Nov-2016 NeilBrown <neilb@suse.com>

md: Use REQ_FAILFAST_* on metadata writes where appropriate

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state. i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty. This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success. We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 688834e6 17-Nov-2016 NeilBrown <neilb@suse.com>

md/failfast: add failfast flag for md to be used by some personalities.

This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error. The underlying driver will be directed not to retry requests
that result in failures. There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device. No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# be306c29 08-Nov-2016 NeilBrown <neilb@suse.com>

md: define mddev flags, recovery flags and r1bio state bits using enums

This is less error prone than using individual #defines.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 35b785f7 21-Oct-2016 Tomasz Majchrzak <tomasz.majchrzak@intel.com>

md: add bad block support for external metadata

Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# af8d8e6f 11-Aug-2016 Guoqing Jiang <gqjiang@suse.com>

md: changes for MD_STILL_CLOSED flag

When stop clustered raid while it is pending on resync,
MD_STILL_CLOSED flag could be cleared since udev rule
is triggered to open the mddev. So obviously array can't
be stopped soon and returns EBUSY.

mdadm -Ss md-raid-arrays.rules
set MD_STILL_CLOSED md_open()
... ... ... clear MD_STILL_CLOSED
do_md_stop

We make below changes to resolve this issue:

1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
when stop array and it means we are stopping array.
2. let md_open returns early if CLOSING is set, so no
other threads will open array if one thread is trying
to close it.
3. no need to clear CLOSING bit in md_open because 1 has
ensure the bit is cleared, then we also don't need to
test CLOSING bit in do_md_stop.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 0e3ef49e 17-Jun-2016 Arnd Bergmann <arnd@arndb.de>

md: use seconds granularity for error logging

The md code stores the exact time of the last error in the
last_read_error variable using a timespec structure. It only
ever uses the seconds portion of that though, so we can
use a scalar for it.

There won't be an overflow in 2038 here, because it already
used monotonic time and 32-bit is enough for that, but I've
decided to use time64_t for consistency in the conversion.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <shli@fb.com>


# d787be40 02-Jun-2016 NeilBrown <neilb@suse.com>

md: reduce the number of synchronize_rcu() calls when multiple devices fail.

Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.

As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices. Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.

fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 28a8f0d3 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 796a5cf0 05-Jun-2016 Mike Christie <mchristi@redhat.com>

md: use bio op accessors

Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bb8bf15b 02-Jun-2016 Guoqing Jiang <gqjiang@suse.com>

md-cluster: fix deadlock issue when add disk to an recoverying array

Add a disk to an array which is performing recovery
is a little complicated, we need to do both reap the
sync thread and perform add disk for the case, then
it caused deadlock as follows.

linux44:~ # ps aux|grep md|grep D
root 1822 0.0 0.0 0 0 ? D 16:50 0:00 [md127_resync]
root 1848 0.0 0.0 19860 952 pts/0 D+ 16:50 0:00 mdadm --manage /dev/md127 --re-add /dev/vdb
linux44:~ # cat /proc/1848/stack
[<ffffffff8107afde>] kthread_stop+0x6e/0x120
[<ffffffffa051ddb0>] md_unregister_thread+0x40/0x80 [md_mod]
[<ffffffffa0526e45>] md_reap_sync_thread+0x15/0x150 [md_mod]
[<ffffffffa05271e0>] action_store+0x260/0x270 [md_mod]
[<ffffffffa05206b4>] md_attr_store+0xb4/0x100 [md_mod]
[<ffffffff81214a7e>] sysfs_write_file+0xbe/0x140
[<ffffffff811a6b98>] vfs_write+0xb8/0x1e0
[<ffffffff811a75b8>] SyS_write+0x48/0xa0
[<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
[<00007f068ea1ed30>] 0x7f068ea1ed30
linux44:~ # cat /proc/1822/stack
[<ffffffffa05251a6>] md_do_sync+0x846/0xf40 [md_mod]
[<ffffffffa052402d>] md_thread+0x16d/0x180 [md_mod]
[<ffffffff8107ad94>] kthread+0xb4/0xc0
[<ffffffff8152a518>] ret_from_fork+0x58/0x90

Task1848 Task1822
md_attr_store (held reconfig_mutex by call mddev_lock())
action_store
md_reap_sync_thread
md_unregister_thread
kthread_stop md_wakeup_thread(mddev->thread);
wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING))

md_check_recovery is triggered by wakeup mddev->thread,
but it can't clear MD_CHANGE_PENDING flag since it can't
get lock which was held by md_attr_store already.

To solve the deadlock problem, we move "->resync_finish()"
from md_do_sync to md_reap_sync_thread (after md_update_sb),
also MD_HELD_RESYNC_LOCK is introduced since it is possible
that node can't get resync lock in md_do_sync.

Then we do not need to wait for MD_CHANGE_PENDING is cleared
or not since metadata should be updated after md_update_sb,
so just call resync_finish if MD_HELD_RESYNC_LOCK is set.

We also unified the code after skip label, since set PENDING
for non-clustered case should be harmless.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 1501efad 13-Jan-2016 Dan Williams <dan.j.williams@intel.com>

md/raid: only permit hot-add of compatible integrity profiles

It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue. Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.

The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a671 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.

This policy of disallowing changes the integrity profile once one has
been established is shared with DM.

Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277]

[ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[ 59.078302] md: data integrity enabled on md0
[..]
[ 90.489209] md0: incompatible integrity profile for pmem1m
[..]
[ 205.671277] md: super_written gets error=-5
[ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[ 205.677386] md/raid1:md0: Operation continuing on 1 devices.
[ 205.683037] RAID1 conf printout:
[ 205.684699] --- wd:1 rd:2
[ 205.685972] disk 0, wo:0, o:1, dev:pmem0s
[ 205.687562] disk 1, wo:1, o:1, dev:pmem1s
[ 205.691717] md: recovery of RAID array md0

Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: <stable@vger.kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# fc974ee2 24-Dec-2015 Vishal Verma <vishal.l.verma@intel.com>

md: convert to use the generic badblocks code

Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 274d8cbd 03-Jan-2016 NeilBrown <neilb@suse.com>

md: Remove 'ready' field from mddev.

This field is always set in tandem with ->pers, and when it is tested
->pers is also tested. So ->ready is not needed.

It was needed once, but code rearrangement and locking changes have
removed that needed.

Signed-off-by: NeilBrown <neilb@suse.com>


# 9ebc6ef1 20-Dec-2015 Deepa Dinamani <deepa.kernel@gmail.com>

drivers: md: use ktime_get_real_seconds()

get_seconds() API is not y2038 safe on 32 bit systems and the API
is deprecated. Replace it with calls to ktime_get_real_seconds()
API instead. Change mddev structure types to time64_t accordingly.

32 bit signed timestamps will overflow in the year 2038.

Change the user interface mdu_array_info_s structure timestamps:
ctime and utime values used in ioctls GET_ARRAY_INFO and
SET_ARRAY_INFO to unsigned int. This will extend the field to last
until the year 2106.
The long term plan is to get rid of ctime and utime values in
this structure as this information can be read from the on-disk
meta data directly.

Clamp the tim64_t timestamps to positive values with a max of U32_MAX
when returning from GET_ARRAY_INFO ioctl to accommodate above changes
in the data type of timestamps to unsigned int.

v0.90 on disk meta data uses u32 for maintaining time stamps.
So this will also last until year 2106.
Assumption is that the usage of v0.90 will be deprecated by
year 2106.

Timestamp fields in the on disk meta data for v1.0 version already
use 64 bit data types. Remove the truncation of the bits while
writing to or reading from these from the disk.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>


# 15858fa5 20-Dec-2015 Guoqing Jiang <gqjiang@suse.com>

md-cluster: Defer MD reloading to mddev->thread

Reloading of superblock must be performed under reconfig_mutex. However,
this cannot be done with md_reload_sb because it would deadlock with
the message DLM lock. So, we defer it in md_check_recovery() which is
executed by mddev->thread.

This introduces a new flag, MD_RELOAD_SB, which if set, will reload the
superblock. And good_device_nr is also added to 'struct mddev' which is
used to get the num of the good device within cluster raid.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 659b254f 20-Dec-2015 Guoqing Jiang <gqjiang@suse.com>

md-cluster: remove a disk asynchronously from cluster environment

For cluster raid, if one disk couldn't be reach in one node, then
other nodes would receive the REMOVE message for the disk.

In receiving node, we can't call md_kick_rdev_from_array to remove
the disk from array synchronously since the disk might still be busy
in this node. So let's set a ClusterRemove flag on the disk, then
let the thread to do the removal job eventually.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 9b15603d 17-Dec-2015 Shaohua Li <shli@fb.com>

MD: change journal disk role to disk 0

Neil pointed out setting journal disk role to raid_disks will confuse
reshape if we support reshape eventually. Switching the role to 0 (we
should be fine as long as the value >=0) and skip sysfs file creation to
avoid error.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# a97b7896 08-Oct-2015 Song Liu <songliubraving@fb.com>

MD: add new bit to indicate raid array with journal

If a raid array has journal feature bit set, add a new bit to indicate
this. If the array is started without journal disk existing, we know
there is something wrong.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# bd18f646 02-Sep-2015 Shaohua Li <shli@fb.com>

md: skip resync for raid array with journal

If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 3069aa8d 13-Aug-2015 Shaohua Li <shli@fb.com>

md: override md superblock recovery_offset for journal device

Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# bac624f3 13-Aug-2015 Song Liu <songliubraving@fb.com>

MD: add a new disk role to present write journal device

Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>


# 70bcecdb 21-Aug-2015 Goldwyn Rodrigues <rgoldwyn@suse.com>

md-cluster: Improve md_reload_sb to be less error prone

md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.

Instead, read the superblock of one of the "good" rdevs and update
the necessary information:

- read the superblock into a newly allocated page, by temporarily
swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
1. iterates over list of active devices and checks the matching
dev_roles[] value.
If that is 'faulty', the device must be marked as faulty
- call md_error to mark the device as faulty. Make sure
not to set CHANGE_DEVS and wakeup mddev->thread or else
it would initiate a resync process, which is the responsibility
of the "primary" node.
- clear the Blocked bit
- Call remove_and_add_spares() to hot remove the device.
If the device is 'spare':
- call remove_and_add_spares() to get the number of spares
added in this operation.
- Reduce mddev->degraded to mark the array as not degraded.
2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
is equal to MaxSector, call spare_active() to set it In_sync

This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 8ae12666 28-Apr-2015 Kent Overstreet <kent.overstreet@gmail.com>

block: kill merge_bvec_fn() completely

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 66114cad 22-May-2015 Tejun Heo <tj@kernel.org>

writeback: separate out include/linux/backing-dev-defs.h

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 09314799 18-Feb-2015 NeilBrown <neilb@suse.de>

md: remove 'go_faster' option from ->sync_request()

This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.

Signed-off-by: NeilBrown <neilb@suse.de>


# 57d051dc 14-Apr-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md: Export and rename find_rdev_nr_rcu

This is required by the clustering module (patches to follow) to
find the device to remove or re-add.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# fb56dfef 14-Apr-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

md: Export and rename kick_rdev_from_array

This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 1aee41f6 29-Oct-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Add new disk to clustered array

Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# 1d7e3e96 07-Jun-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Reload superblock if METADATA_UPDATED is received

Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# cf921cc1 29-Mar-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Add node recovery callbacks

DLM offers callbacks when a node fails and the lock remastery
is performed:

1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
can start
3. recover_done: called when all nodes have completed recover_slot

recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# c4ce867f 29-Mar-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce md_cluster_info

md_cluster_info stores the cluster information in the MD device.

The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
1. Setup a DLM lockspace
2. Setup all initial locks such as super block locks and bitmap lock (will come later)

The leave() clears up the lockspace and all the locks held.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# edb39c9d 29-Mar-2014 Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce md_cluster_operations to handle cluster functions

This allows dynamic registering of cluster hooks.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


# dfe15ac1 26-Jul-2012 Hannes Reinecke <hare@suse.de>

md: wakeup thread upon rdev_dec_pending()

After each call to rdev_dec_pending() we should wakeup the
md thread if the device is found to be faulty.
Otherwise we'll incur heavy delays on failing devices.

Signed-off-by: Neil Brown <nfbrown@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>


# 5c47daf6 14-Dec-2014 NeilBrown <neilb@suse.de>

md: move mddev_lock and related to md.h

The one which is not inline (mddev_unlock) gets EXPORTed.

This makes the locking available to personality modules so that it
doesn't have to be imposed upon them.

Signed-off-by: NeilBrown <neilb@suse.de>


# 23da422b 14-Dec-2014 NeilBrown <neilb@suse.de>

md: use mddev->lock to protect updates to resync_{min,max}.

There are interdependencies between these two sysfs attributes
and whether a resync is currently running.

Rather than depending on reconfig_mutex to ensure no races when
testing these interdependencies are met, use the spinlock.
This will allow the mutex to be remove from protecting this
code in a subsequent patch.

Signed-off-by: NeilBrown <neilb@suse.de>


# 4af1a041 14-Dec-2014 NeilBrown <neilb@suse.de>

md: move GET_BITMAP_FILE ioctl out from mddev_lock.

It makes more sense to report bitmap_info->file, rather than
bitmap->file (the later is only available once the array is
active).

With that change, use mddev->lock to protect bitmap_info being
set to NULL, and we can call get_bitmap_file() without taking
the mutex.

Signed-off-by: NeilBrown <neilb@suse.de>


# 978a7a47 14-Dec-2014 NeilBrown <neilb@suse.de>

md/bitmap: protect clearing of ->bitmap by mddev->lock

This makes it safe to inspect the struct while holding only
the spinlock.

Signed-off-by: NeilBrown <neilb@suse.de>


# 36d091f4 14-Dec-2014 NeilBrown <neilb@suse.de>

md: protect ->pers changes with mddev->lock

->pers is already protected by ->reconfig_mutex, and
cannot possibly change when there are threads running or
outstanding IO.

However there are some places where we access ->pers
not in a thread or IO context, and where ->reconfig_mutex
is unnecessarily heavy-weight: level_show and md_seq_show().

So protect all changes, and those accesses, with ->lock.
This is a step toward taking those accesses out from under
reconfig_mutex.

[Fixed missing "mddev->pers" -> "pers" conversion, thanks to
Dan Carpenter <dan.carpenter@oracle.com>]

Signed-off-by: NeilBrown <neilb@suse.de>


# afa0f557 14-Dec-2014 NeilBrown <neilb@suse.de>

md: rename ->stop to ->free

Now that the ->stop function only frees the private data,
rename is accordingly.

Also pass in the private pointer as an arg rather than using
mddev->private. This flexibility will be useful in level_store().

Finally, don't clear ->private. It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before ->to_remove was
introduced).

Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.

Signed-off-by: NeilBrown <neilb@suse.de>


# 64590f45 14-Dec-2014 NeilBrown <neilb@suse.de>

md: make merge_bvec_fn more robust in face of personality changes.

There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.

Signed-off-by: NeilBrown <neilb@suse.de>


# 5c675f83 14-Dec-2014 NeilBrown <neilb@suse.de>

md: make ->congested robust against personality changes.

There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.

Signed-off-by: NeilBrown <neilb@suse.de>


# 85572d7c 14-Dec-2014 NeilBrown <neilb@suse.de>

md: rename mddev->write_lock to mddev->lock

This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further. So the
name is inappropriate.

Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.

So:
-rename write_lock to lock
-document what it protects
-remove _irq ... except in md_flush_request() as there
is no wait_event_lock() (with no _irq). This can be
cleaned up after appropriate changes to wait.h.

Signed-off-by: NeilBrown <neilb@suse.de>


# f72ffdd6 29-Sep-2014 NeilBrown <neilb@suse.de>

md: remove unwanted white space from md.c

My editor shows much of this is RED.

Signed-off-by: NeilBrown <neilb@suse.de>


# 035328c2 08-Apr-2014 NeilBrown <neilb@suse.de>

md/bitmap: don't abuse i_writecount for bitmap files.

md bitmap code currently tries to use i_writecount to stop any other
process from writing to out bitmap file. But that is really an abuse
and has bit-rotted so locking is all wrong.

So discard that - root should be allowed to shoot self in foot.

Still use it in a much less intrusive way to stop the same file being
used as bitmap on two different array, and apply other checks to
ensure the file is at least vaguely usable for bitmap storage
(is regular, is open for write. Support for ->bmap is already checked
elsewhere).

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>


# 8313b8e5 11-Dec-2013 NeilBrown <neilb@suse.de>

md: fix problem when adding device to read-only array with bitmap.

If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.

If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.

However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update. We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.

This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag. If that is set,
then the device will not be added to a read-only array.

Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fafdbef0800e73942482bb075c21578b
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>


# 324a56e1 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above

Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.

This patch is strictly rename only and doesn't introduce any
functional difference.

- mic / gpio renames were missing. Spotted by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 6678d83f 07-Aug-2013 Kent Overstreet <kmo@daterainc.com>

block: Consolidate duplicated bio_trim() implementations

Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 388975cc 11-Sep-2013 Tejun Heo <tj@kernel.org>

sysfs: clean up sysfs_get_dirent()

The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention. For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.

This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns(). This makes confusions a lot less
likely.

There are other interfaces which take @ns before @name. They'll be
updated by following patches.

This patch doesn't introduce any functional changes.

v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
error on module builds. Reported by build test robot. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 260fa034 27-Aug-2013 NeilBrown <neilb@suse.de>

md: avoid deadlock when dirty buffers during md_stop.

When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.

However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.

So do_md_stop() calls sync_blockdev(). However at this point it is
holding ->reconfig_mutex. So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex. So we deadlock.

We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.

So introduce new flag MD_STILL_CLOSED. Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.

It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl. Just don't do that.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7a0a5355 27-Aug-2013 NeilBrown <neilb@suse.de>

md: Don't test all of mddev->flags at once.

mddev->flags is mostly used to record if an update of the
metadata is needed. Sometimes the whole field is tested
instead of just the important bits. This makes it difficult
to introduce more state bits.

So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.

Signed-off-by: NeilBrown <neilb@suse.de>


# c4a39551 25-Jun-2013 Jonathan Brassow <jbrassow@redhat.com>

MD: Remember the last sync operation that was performed

MD: Remember the last sync operation that was performed

This patch adds a field to the mddev structure to track the last
sync operation that was performed. This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs. If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check. If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired. etc.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# a91d5ac0 23-Apr-2013 Jonathan Brassow <jbrassow@redhat.com>

MD: Export 'md_reap_sync_thread' function

MD: Export 'md_reap_sync_thread' function

Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c.
- rename reap_sync_thread to md_reap_sync_thread
- move the fn after md_check_recovery to match md.h declaration placement
- export md_reap_sync_thread

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 90584fc9 07-Mar-2013 Jonathan Brassow <jbrassow@redhat.com>

MD: Prevent sysfs operations on uninitialized kobjects

MD: Prevent sysfs operations on uninitialized kobjects

Device-mapper does not use sysfs; but when device-mapper is leveraging
MD's RAID personalities, MD sometimes attempts to update sysfs. This
patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure
it is about to operate on something valid. This patch also checks for
'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'.
Although 'sysfs_notify' already makes this check, doing so in
'remove_and_add_spares' prevents an additional mutex operation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0a19caab 19-Nov-2012 majianpeng <majianpeng@gmail.com>

md: Use ->curr_resync as last completed request when cleanly aborting resync.

If a resync is aborted cleanly, ->curr_resync is a reliable
record of where we got up to.
If there was an error it is less reliable but we always know that
->curr_resync_completed is safe.

So add a flag MD_RECOVERY_ERROR to differentiate between these cases
and set recovery_cp accordingly.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# eed8c02e 30-Nov-2012 Lukas Czerner <lczerner@redhat.com>

wait: add wait_event_lock_irq() interface

New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.

The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.

All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.

Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7f7583d4 10-Oct-2012 Jianpeng Ma <majianpeng@gmail.com>

Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races

Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 4ed8731d 10-Oct-2012 Shaohua Li <shli@kernel.org>

MD: change the parameter of md thread

Change the thread parameter, so the thread can carry extra info. Next patch
will use it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 74018dc3 31-Jul-2012 NeilBrown <neilb@suse.de>

blk: pass from_schedule to non-request unplug functions.

This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9cbb1750 31-Jul-2012 NeilBrown <neilb@suse.de>

blk: centralize non-request unplug handling.

Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0021b7bc 31-Jul-2012 NeilBrown <neilb@suse.de>

md: remove plug_cnt feature of plugging.

This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.

So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.

This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.

Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6409bb05 21-May-2012 NeilBrown <neilb@suse.de>

md/bitmap: add new 'space' attribute for bitmaps.

If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.

So create an attribute to store this information and set appropriate
defaults.

It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.

Signed-off-by: NeilBrown <neilb@suse.de>


# 545c8795 21-May-2012 NeilBrown <neilb@suse.de>

md: dm-raid should call helper function to clear rdev.

dm-raid currently open-codes the freeing of some members of
and rdev. It is more maintainable to have it call common code
from md.c which does this for all call-sites.

So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c

Signed-off-by: NeilBrown <neilb@suse.de>


# c6563a8c 20-May-2012 NeilBrown <neilb@suse.de>

md: add possibility to change data-offset for devices.

When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
we need a new FEATURE flag for this.
A (belatedly) check that all remaining 'pad' fields are
zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be. At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors. So provide an exported function to do that.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2c810cdd 20-May-2012 NeilBrown <neilb@suse.de>

md: allow a reshape operation to be reversed.

Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.

To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.

However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.

So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.

This will become more important when we allow the data_offset to
change in a reshape. Then the explicit statement of what direction is
being used will be more useful.

This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.

Signed-off-by: NeilBrown <neilb@suse.de>


# 050b6615 18-Mar-2012 NeilBrown <neilb@suse.de>

md/raid10: handle merge_bvec_fn in member devices.

Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So enhance the raid10 merge_bvec_fn to check that function in children
as well.

This introduces a small problem. There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request. So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched(). This will work
providing no preemption happens. If there is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>


# dafb20fa 18-Mar-2012 NeilBrown <neilb@suse.de>

md: tidy up rdev_for_each usage.

md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev. However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
- rename rdev_for_each to rdev_for_each_safe
- create a new rdev_for_each which uses the plain
list_for_each_entry,
- use the 'safe' version only where needed, and convert all other
list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2d78f8c4 22-Dec-2011 NeilBrown <neilb@suse.de>

md: create externally visible flags for supporting hot-replace.

hot-replace is a feature being added to md which will allow a
device to be replaced without removing it from the array first.

With hot-replace a spare can be activated and recovery can start while
the original device is still in place, thus allowing a transition from
an unreliable device to a reliable device without leaving the array
degraded during the transition. It can also be use when the original
device is still reliable but it not wanted for some reason.

This will eventually be supported in RAID4/5/6 and RAID10.

This patch adds a super-block flag to distinguish the replacement
device. If an old kernel sees this flag it will reject the device.

It also adds two per-device flags which are viewable and settable via
sysfs.
"want_replacement" can be set to request that a device be replaced.
"replacement" is set to show that this device is replacing another
device.

The "rd%d" links in /sys/block/mdXx/md only apply to the original
device, not the replacement. We currently don't make links for the
replacement - there doesn't seem to be a need.

Signed-off-by: NeilBrown <neilb@suse.de>


# b8321b68 22-Dec-2011 NeilBrown <neilb@suse.de>

md: change hot_remove_disk to take an rdev rather than a number.

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 84fc4b56 10-Oct-2011 NeilBrown <neilb@suse.de>

md: rename "mdk_personality" to "md_personality"

"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2b8bf345 10-Oct-2011 NeilBrown <neilb@suse.de>

md: remove typedefs: mdk_thread_t -> struct md_thread

Signed-off-by: NeilBrown <neilb@suse.de>


# fd01b88c 10-Oct-2011 NeilBrown <neilb@suse.de>

md: remove typedefs: mddev_t -> struct mddev

Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown <neilb@suse.de>


# 3cb03002 10-Oct-2011 NeilBrown <neilb@suse.de>

md: removing typedefs: mdk_rdev_t -> struct md_rdev

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7e841526 20-Sep-2011 Wang Sheng-Hui <shhuiw@gmail.com>

trival: md_k.h should be md.h in the beginning comment of file md.h

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 01f96c0a 20-Sep-2011 NeilBrown <neilb@suse.de>

md: Avoid waking up a thread after it has been freed.

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org


# 5a7bbad2 11-Sep-2011 Christoph Hellwig <hch@infradead.org>

block: remove support for bio remapping from ->make_request

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# de393cde 27-Jul-2011 NeilBrown <neilb@suse.de>

md: make it easier to wait for bad blocks to be acknowledged.

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown <neilb@suse.de>


# d7a9d443 27-Jul-2011 NeilBrown <neilb@suse.de>

md: add 'write_error' flag to component devices.

If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>


# d2eb35ac 27-Jul-2011 NeilBrown <neilb@suse.de>

md/raid1: avoid reading from known bad blocks.

Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2699b672 27-Jul-2011 NeilBrown <neilb@suse.de>

md: load/store badblock list from v1.x metadata

Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.

We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.

If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2230dfe4 27-Jul-2011 NeilBrown <neilb@suse.de>

md: beginnings of bad block management.

This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.

This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>


# 3520fa4d 26-Jul-2011 Jonathan Brassow <jbrassow@redhat.com>

MD bitmap: Revert DM dirty log hooks


Revert most of commit e384e58549a2e9a83071ad80280c1a9053cfd84c
md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.

MD should not need to use DM's dirty log - we decided to use md's
bitmaps instead.

Keeping the DIV_ROUND_UP clean-ups that were part of commit
e384e58549a2e9a83071ad80280c1a9053cfd84c, however.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 5389042f 26-Jul-2011 NeilBrown <neilb@suse.de>

md: change managed of recovery_disabled.

If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.

Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1. For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.

So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.

This will allow RAID10 to get the control that it needs.

Signed-off-by: NeilBrown <neilb@suse.de>


# 36fad858 26-Jul-2011 Namhyung Kim <namhyung@gmail.com>

md: introduce link/unlink_rdev() helpers

There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 9c81075f 08-Jun-2011 Jonathan Brassow <jbrassow@redhat.com>

MD: support initial bitmap creation in-kernel

Add bitmap support to the device-mapper specific metadata area.

This patch allows the creation of the bitmap metadata area upon
initial array creation via device-mapper.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 076f968b 07-Jun-2011 Jonathan Brassow <jbrassow@redhat.com>

MD: add sync_super to mddev_t struct

Add the 'sync_super' function pointer to MD array structure (struct mddev_s)

If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 97658cdd 18-Apr-2011 NeilBrown <neilb@suse.de>

md: provide generic support for handling unplug callbacks.

When an md device adds a request to a queue, it can call
mddev_check_plugged.
If this succeeds then we know that the md thread will be woken up
shortly, and ->plug_cnt will be non-zero until then, so some
processing can be delayed.

If it fails, then no unplug callback is expected and the make_request
function needs to do whatever is required to make the request happen.

Signed-off-by: NeilBrown <neilb@suse.de>


# 482c0834 18-Apr-2011 NeilBrown <neilb@suse.de>

md - remove old plugging code.

md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.

This relied on the ->unplug_fn callback which doesn't exist any more.

So remove all of that code, both in md and raid5. Subsequent patches
with restore the plugging functionality.

Signed-off-by: NeilBrown <neilb@suse.de>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# f0b4f7e2 23-Feb-2011 NeilBrown <neilb@suse.de>

md: Fix - again - partition detection when array becomes active

Revert
b821eaa572fd737faaf6928ba046e571526c36c6
and
f3b99be19ded511a1bf05a148276239d9f13eefa

When I wrote the first of these I had a wrong idea about the
lifetime of 'struct block_device'. It can disappear at any time that
the block device is not open if it falls out of the inode cache.

So relying on the 'size' recorded with it to detect when the
device size has changed and so we need to revalidate, is wrong.

Rather, we really do need the 'changed' attribute stored directly in
the mddev and set/tested as appropriate.

Without this patch, a sequence of:
mknod / open / close / unlink

(which can cause a block_device to be created and then destroyed)
will result in a rescan of the partition table and consequence removal
and addition of partitions.
Several of these in a row can get udev racing to create and unlink and
other code can get confused.

With the patch, the rescan is only performed when needed and so there
are no races.

This is suitable for any stable kernel from 2.6.35.

Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# f21e9ff7 30-Jan-2011 NeilBrown <neilb@suse.de>

md: Remove the AllReserved flag for component devices.

This flag is not needed and is used badly.

Devices that are included in a native-metadata array are reserved
exclusively for that array - and currently have AllReserved set.
They all are bd_claimed for the rdev and so cannot be shared.

Devices that are included in external-metadata arrays can be shared
among multiple arrays - providing there is no overlap.
These are bd_claimed for md in general - not for a particular rdev.

When changing the amount of a device that is used in an array we need
to check for overlap. This currently includes a check on AllReserved
So even without overlap, sharing with an AllReserved device is not
allowed.
However the bd_claim usage already precludes sharing with these
devices, so the test on AllReserved is not needed. And in fact it is
wrong.

As this is the only use of AllReserved, simply remove all usage and
definition of AllReserved.

Signed-off-by: NeilBrown <neilb@suse.de>


# a6ff7e08 13-Jan-2011 Jonathan Brassow <jbrassow@redhat.com>

md: separate meta and data devs

Allow the metadata to be on a separate device from the
data.

This doesn't mean the data and metadata will by on separate
physical devices - it simply gives device-mapper and userspace
tools more flexibility.

Signed-off-by: NeilBrown <neilb@suse.de>


# ccebd4c4 13-Jan-2011 Jonathan Brassow <jbrassow@redhat.com>

md-new-param-to_sync_page_io

Add new parameter to 'sync_page_io'.

The new parameter allows us to distinguish between metadata and data
operations. This becomes important later when we add the ability to
use separate devices for data and metadata.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>


# 0ca69886 13-Jan-2011 NeilBrown <neilb@suse.de>

md: Ensure no IO request to get md device before it is properly initialised.

When an md device is in the process of coming on line it is possible
for an IO request (typically a partition table probe) to get through
before the array is fully initialised, which can cause unexpected
behaviour (e.g. a crash).

So explicitly record when the array is ready for IO and don't allow IO
through until then.

There is no possibility for a similar problem when the array is going
off-line as there must only be one 'open' at that time, and it is busy
off-lining the array and so cannot send IO requests. So no memory
barrier is needed in md_stop()

This has been a bug since commit 409c57f3801 in 2.6.30 which
introduced md_make_request. Before then, each personality would
register its own make_request_fn when it was ready.
This is suitable for any stable kernel from 2.6.30.y onwards.

Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>


# a167f663 26-Oct-2010 NeilBrown <neilb@suse.de>

md: use separate bio pool for each md device.

bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.

So allocate a local bio pool for each md array and use that rather
than the common pool.

This pool is used both for regular IO and metadata updates.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2b193363 26-Oct-2010 NeilBrown <neilb@suse.de>

md: change type of first arg to sync_page_io.

Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.

Signed-off-by: NeilBrown <neilb@suse.de>


# e9c7469b 03-Sep-2010 Tejun Heo <tj@kernel.org>

md: implment REQ_FLUSH/FUA support

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.

* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 070dc6dd 30-Aug-2010 NeilBrown <neilb@suse.de>

md: resolve confusion of MD_CHANGE_CLEAN

MD_CHANGE_CLEAN is used for two different purposes and this leads to
confusion.
One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
not used for anything else, so have MD_CHANGE_PENDING take over that
purpose fully.

The two purposes are:
1/ tell md_update_sb that an update is needed and that it is just a
clean/dirty transition.
2/ tell user-space that an transition from clean to dirty is pending
(something wants to write), and tell te kernel (by clearin the
flag) that the transition is OK.

The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
fully to MD_CHANGE_PENDING.

This means that various places which conditionally set or cleared
MD_CHANGE_CLEAN no longer need to be conditional.

Signed-off-by: NeilBrown <neilb@suse.de>


# bb4f1e9d 08-Aug-2010 NeilBrown <neilb@suse.de>

md: fix another deadlock with removing sysfs attributes.

Move the deletion of sysfs attributes from reconfig_mutex to
open_mutex didn't really help as a process can try to take
open_mutex while holding reconfig_mutex, so the same deadlock can
happen, just requiring one more process to be involved in the chain.

I looks like I cannot easily use locking to wait for the sysfs
deletion to complete, so don't.

The only things that we cannot do while the deletions are still
pending is other things which can change the sysfs namespace: run,
takeover, stop. Each of these can fail with -EBUSY.
So set a flag while doing a sysfs deletion, and fail run, takeover,
stop if that flag is set.

This is suitable for 2.6.35.x

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 7b6d91da 07-Aug-2010 Christoph Hellwig <hch@lst.de>

block: unify flags for struct bio and struct request

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# e384e585 01-Jun-2010 NeilBrown <neilb@suse.de>

md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.

This allows md/raid5 to fully work as a dm target.

Normally md uses a 'filemap' which contains a list of pages of bits
each of which may be written separately.
dm-log uses and all-or-nothing approach to writing the log, so
when using a dm-log, ->filemap is NULL and the flags normally stored
in filemap_attr are stored in ->logattrs instead.



Signed-off-by: NeilBrown <neilb@suse.de>


# b63d7c2e 01-Jun-2010 NeilBrown <neilb@suse.de>

md/bitmap: clean up plugging calls.

1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
arrays with no queue attached.

2/ Don't bother plugging the queue when we set a bit in the bitmap.
The reason for this was to encourage as many bits as possible to
get set before we unplug and write stuff out.
However every personality already plugs the queue after
bitmap_startwrite either directly (raid1/raid10) or be setting
STRIPE_BIT_DELAY which causes the queue to be plugged later
(raid5).

Signed-off-by: NeilBrown <neilb@suse.de>


# ac2f40be 01-Jun-2010 NeilBrown <neilb@suse.de>

md/bitmap: white space clean up and similar.

Fixes some whitespace problems
Fixed some checkpatch.pl complaints.
Replaced kmalloc ... memset(0), with kzalloc
Fixed an unlikely memory leak on an error path.
Reformatted a number of 'if/else' sets, sometimes
replacing goto with an else clause.
Removed some old comments and commented-out code.

Signed-off-by: NeilBrown <neilb@suse.de>


# 252ac522 01-Jun-2010 NeilBrown <neilb@suse.de>

md/plug: optionally use plugger to unplug an array during resync/recovery.

If an array doesn't have a 'queue' then md_do_sync cannot
unplug it.
In that case it will have a 'plugger', so make that available
to the mddev, and use it to unplug the array if needed.

Signed-off-by: NeilBrown <neilb@suse.de>


# 2ac87401 01-Jun-2010 NeilBrown <neilb@suse.de>

md/raid5: add simple plugging infrastructure.

md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'. However when we plug raid5 under dm there
is no request queue so we cannot use that.

So create a similar infrastructure that is much lighter weight and use
it for raid5.

Signed-off-by: NeilBrown <neilb@suse.de>


# 768a418d 25-Jul-2010 NeilBrown <neilb@suse.de>

md: add support for raising dm events.

dm uses scheduled work to raise events to user-space.
So allow md device to have work_structs and schedule them on an error.

Signed-off-by: NeilBrown <neilb@suse.de>


# 390ee602 01-Jun-2010 NeilBrown <neilb@suse.de>

md: export various start/stop interfaces

export entry points for starting and stopping md arrays.
This will be used by a module to make md/raid5 work under
dm.
Also stop calling md_stop_writes from md_stop, as that won't
work well with dm - it will want to call the two separately.

Signed-off-by: NeilBrown <neilb@suse.de>


# e8bb9a83 01-Jun-2010 NeilBrown <neilb@suse.de>

md: split out md_rdev_init

This functionality will be needed separately in a subsequent patch, so
split it into it's own exported function.

Signed-off-by: NeilBrown <neilb@suse.de>


# 00bcb4ac 01-Jun-2010 NeilBrown <neilb@suse.de>

md: reduce dependence on sysfs.

We will want md devices to live as dm targets where sysfs is not
visible. So allow md to not connect to sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>


# e93f68a1 15-Jun-2010 NeilBrown <neilb@suse.de>

md: fix handling of array level takeover that re-arranges devices.

Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over. However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.

Reported-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# a8707c08 17-May-2010 NeilBrown <neilb@suse.de>

md: simplify updating of event count to sometimes avoid updating spares.

When updating the event count for a simple clean <-> dirty transition,
we try to avoid updating the spares so they can safely spin-down.
As the event_counts across an array must be +/- 1, this means
decrementing the event_count on a dirty->clean transition.
This is not always safe and we have to avoid the unsafe time.
We current do this with a misguided idea about it being safe or
not depending on whether the event_count is odd or even. This
approach only works reliably in a few common instances, but easily
falls down.

So instead, simply keep internal state concerning whether it is safe
or not, and always assume it is not safe when an array is first
assembled.

Signed-off-by: NeilBrown <neilb@suse.de>


# 21a52c6d 31-Mar-2010 NeilBrown <neilb@suse.de>

md: pass mddev to make_request functions rather than request_queue

We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown <neilb@suse.de>


# b821eaa5 28-Mar-2010 NeilBrown <neilb@suse.de>

md: remove ->changed and related code.

We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.

Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.

Signed-off-by: NeilBrown <neilb@suse.de>


# c0cc75f8 21-Mar-2010 NeilBrown <neilb@suse.de>

md: discard StateChanged device flag.

This was needed when sysfs files could only be 'notified'
from process context. Now that we have sys_notify_direct,
we can call it directly from an interrupt.

Signed-off-by: NeilBrown <neilb@suse.de>


# ee8b81b0 07-Mar-2010 NeilBrown <neilb@suse.de>

md: remove some dead fields from mddev_s

These fields have never been used.
commit 4b6d287f627b5fb6a49f78f9e81649ff98c62bb7
added them, but also added identical files to bitmap_super_s,
and only used the latter.

So remove these unused fields.

Signed-off-by: NeilBrown <neilb@suse.de>


# a64c876f 14-Apr-2010 NeilBrown <neilb@suse.de>

md: manage redundancy group in sysfs when changing level.

Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.

This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.

When changing levels, we may need to add or remove attributes. When
changing RAID5 -> RAID6, we both add and remove the same thing. It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.


Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>


# 1e50915f 13-Dec-2009 Robert Becker <Rob.Becker@riverbed.com>

raid: improve MD/raid10 handling of correctable read errors.

We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts. Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# ece5cff0 13-Dec-2009 NeilBrown <neilb@suse.de>

md: Support write-intent bitmaps with externally managed metadata.

In this case, the metadata needs to not be in the same
sector as the bitmap.
md will not read/write any bitmap metadata. Config must be
done via sysfs and when a recovery makes the array non-degraded
again, writing 'true' to 'bitmap/can_clear' will allow bits in
the bitmap to be cleared again.

Signed-off-by: NeilBrown <neilb@suse.de>


# 43a70507 13-Dec-2009 NeilBrown <neilb@suse.de>

md: support updating bitmap parameters via sysfs.

A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.
'chunksize' and 'time_base' must be set before 'location'
can be set.

'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.

'time_base' and 'backlog' can be updated at any time.


Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>


# 72e02075 13-Dec-2009 NeilBrown <neilb@suse.de>

md: factor out parsing of fixed-point numbers

safe_delay_store can parse fixed point numbers (for fractions
of a second). We will want to do that for another sysfs
file soon, so factor out the code.

Signed-off-by: NeilBrown <neilb@suse.de>


# f6af949c 13-Dec-2009 NeilBrown <neilb@suse.de>

md: support bitmap offset appropriate for external-metadata arrays.

For md arrays were metadata is managed externally, the kernel does not
know about a superblock so the superblock offset is 0.
If we want to have a write-intent-bitmap near the end of the
devices of such an array, we should support sector_t sized offset.
We need offset be possibly negative for when the bitmap is before
the metadata, so use loff_t instead.

Also add sanity check that bitmap does not overlap with data.

Signed-off-by: NeilBrown <neilb@suse.de>


# 42a04b50 13-Dec-2009 NeilBrown <neilb@suse.de>

md: move offset, daemon_sleep and chunksize out of bitmap structure

... and into bitmap_info. These are all configuration parameters
that need to be set before the bitmap is created.

Signed-off-by: NeilBrown <neilb@suse.de>


# c3d9714e 13-Dec-2009 NeilBrown <neilb@suse.de>

md: collect bitmap-specific fields into one structure.

In preparation for making bitmap fields configurable via sysfs,
start tidying up by making a single structure to contain the
configuration fields.

Signed-off-by: NeilBrown <neilb@suse.de>


# a2826aa9 13-Dec-2009 NeilBrown <neilb@suse.de>

md: support barrier requests on all personalities.

Previously barriers were only supported on RAID1. This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device. When that completes - and if the original request was not
empty - we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail. If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail. That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet. So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted. Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>


# aa5cbd10 13-Dec-2009 NeilBrown <neilb@suse.de>

md/bitmap: protect against bitmap removal while being updated.

A write intent bitmap can be removed from an array while the
array is active.
When this happens, all IO is suspended and flushed before the
bitmap is removed.
However it is possible that bitmap_daemon_work is still running to
clear old bits from the bitmap. If it is, it can dereference the
bitmap after it has been freed.

So introduce a new mutex to protect bitmap_daemon_work and get it
before destroying a bitmap.

This is suitable for any current -stable kernel.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org


# 3fa841d7 23-Sep-2009 NeilBrown <neilb@suse.de>

md: report device as congested when suspended

This should writeback from coming when the device is temporarily
suspended.

Signed-off-by: NeilBrown <neilb@suse.de>


# 411c9403 07-Jul-2009 Anand Gadiyar <gadiyar@ti.com>

trivial: fix typo "for for" in multiple files

trivial: fix typo "for for" in multiple files

Signed-off-by: Anand Gadiyar <gadiyar@ti.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# c8c00a69 09-Aug-2009 NeilBrown <neilb@suse.de>

Remove deadlock potential in md_open

A recent commit:
commit 449aad3e25358812c43afc60918c5ad3819488e7

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL. So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# ac5e7113 02-Aug-2009 Andre Noll <maan@systemlinux.org>

md: Push down data integrity code to personalities.

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity. The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0894cc30 17-Jun-2009 Andre Noll <maan@systemlinux.org>

md: Move check for bitmap presence to personality code.

If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.

Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.

This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.

A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 8190e754 17-Jun-2009 NeilBrown <neilb@suse.de>

md: remove chunksize rounding from common code.

It is easiest to round sizes to multiples of chunk size in
the personality code for those personalities which care.
Those personalities now do the rounding, so we can
remove that function from common code.

Also remove the upper bound on the size of a chunk, and the lower
bound on the size of a device (1 chunk), neither of which really buy
us anything.

Signed-off-by: NeilBrown <neilb@suse.de>


# 50ac168a 17-Jun-2009 NeilBrown <neilb@suse.de>

md: merge reconfig and check_reshape methods.

The difference between these two methods is artificial.
Both check that a pending reshape is valid, and perform any
aspect of it that can be done immediately.
'reconfig' handles chunk size and layout.
'check_reshape' handles raid_disks.

So make them just one method.

Signed-off-by: NeilBrown <neilb@suse.de>


# 597a711b 17-Jun-2009 NeilBrown <neilb@suse.de>

md: remove unnecessary arguments from ->reconfig method.

Passing the new layout and chunksize as args is not necessary as
the mddev has fields for new_check and new_layout.

This is preparation for combining the check_reshape and reconfig
methods

Signed-off-by: NeilBrown <neilb@suse.de>


# 664e7c41 17-Jun-2009 Andre Noll <maan@systemlinux.org>

md: Convert mddev->new_chunk to sectors.

A straight-forward conversion which gets rid of some
multiplications/divisions/shifts. The patch also introduces a couple
of new ones, most of which are due to conf->chunk_size still being
represented in bytes. This will be cleaned up in subsequent patches.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 9d8f0363 17-Jun-2009 Andre Noll <maan@systemlinux.org>

md: Make mddev->chunk_size sector-based.

This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics. Since

is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
= is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 63fe0817 13-Apr-2009 Christoph Hellwig <hch@lst.de>

md: tiny md.h cleanups

- update inclusion guard and make sure it covers the whole file
- remove superflous #ifdef CONFIG_BLOCK
- make sure all required headers are included so that new users aren't
required to include others before

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>


# cea9c228 30-Mar-2009 NeilBrown <neilb@suse.de>

md: add explicit method to signal the end of a reshape.

Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.

This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.

The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.

The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.

This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.

Signed-off-by: NeilBrown <neilb@suse.de>


# b522adcd 30-Mar-2009 Dan Williams <dan.j.williams@intel.com>

md: 'array_size' sysfs attribute

Allow userspace to set the size of the array according to the following
semantics:

1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality

Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.

Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 1f403624 30-Mar-2009 Dan Williams <dan.j.williams@intel.com>

md: centralize ->array_sectors modifications

Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 80c3a6ce 17-Mar-2009 Dan Williams <dan.j.williams@intel.com>

md: add 'size' as a personality method

In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested. For personalities that do not reshape emit
a warning if anything but the default size is requested.

In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 245f46c2 30-Mar-2009 NeilBrown <neilb@suse.de>

md: add ->takeover method to support changing the personality managing an array

Implement this for RAID6 to be able to 'takeover' a RAID5 array. The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.

Signed-off-by: NeilBrown <neilb@suse.de>


# 409c57f3 30-Mar-2009 NeilBrown <neilb@suse.de>

md: enable suspend/resume of md devices.

To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).

Signed-off-by: NeilBrown <neilb@suse.de>


# dd8ac336 30-Mar-2009 Andre Noll <maan@systemlinux.org>

md: Represent raid device size in sectors.

This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.

All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 58c0fed4 30-Mar-2009 Andre Noll <maan@systemlinux.org>

md: Make mddev->size sector-based.

This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.

All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.

In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>


# 97e4f42d 30-Mar-2009 NeilBrown <neilb@suse.de>

md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash

Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.

One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed. And we need to know what has completed to record
how much is recovered. So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.

When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date. So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number. But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.


Signed-off-by: NeilBrown <neilb@suse.de>


# 43b2e5d8 30-Mar-2009 NeilBrown <neilb@suse.de>

md: move md_k.h from include/linux/raid/ to drivers/md/

It really is nicer to keep related code together..

Signed-off-by: NeilBrown <neilb@suse.de>