History log of /linux-master/drivers/md/dm-raid.c
Revision Date Author Comments
# b25b8f4b 11-Mar-2024 Ming Lei <ming.lei@redhat.com>

dm raid: fix false positive for requeue needed during reshape

An empty flush doesn't have a payload, so it should never be looked at
when considering to possibly requeue a bio for the case when a reshape
is in progress.

Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target")
Reported-by: Patrick Plenefisch <simonpatp@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# fa34e589 07-Feb-2024 Mike Snitzer <snitzer@kernel.org>

dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing list

Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 95009ae9 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid: fix lockdep waring in "pers->hot_add_disk"

The lockdep assert is added by commit a448af25becf ("md/raid10: remove
rcu protection to access rdev from conf") in print_conf(). And I didn't
notice that dm-raid is calling "pers->hot_add_disk" without holding
'reconfig_mutex'.

"pers->hot_add_disk" read and write many fields that is protected by
'reconfig_mutex', and raid_resume() already grab the lock in other
contex. Hence fix this problem by protecting "pers->host_add_disk"
with the lock.

Fixes: 9092c02d9435 ("DM RAID: Add ability to restore transiently failed devices on resume")
Fixes: a448af25becf ("md/raid10: remove rcu protection to access rdev from conf")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-10-yukuai1@huaweicloud.com


# 41425f96 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com


# 5625ff8b 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid: add a new helper prepare_suspend() in md_personality

There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-8-yukuai1@huaweicloud.com


# cd32b27a 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

md/dm-raid: don't call md_reap_sync_thread() directly

Currently md_reap_sync_thread() is called from raid_message() directly
without holding 'reconfig_mutex', this is definitely unsafe because
md_reap_sync_thread() can change many fields that is protected by
'reconfig_mutex'.

However, hold 'reconfig_mutex' here is still problematic because this
will cause deadlock, for example, commit 130443d60b1b ("md: refactor
idle/frozen_sync_thread() to fix deadlock").

Fix this problem by using stop_sync_thread() to unregister sync_thread,
like md/raid did.

Fixes: be83651f0050 ("DM RAID: Add message/status support for changing sync action")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-7-yukuai1@huaweicloud.com


# 16c4770c 05-Mar-2024 Yu Kuai <yukuai3@huawei.com>

dm-raid: really frozen sync_thread during suspend

1) commit f52f5c71f3d4 ("md: fix stopping sync thread") remove
MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that
dm-raid relies on __md_stop_writes() to frozen sync_thread
indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in
md_stop_writes(), and since stop_sync_thread() is only used for
dm-raid in this case, also move stop_sync_thread() to
md_stop_writes().
2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen,
it only prevent new sync_thread to start, and it can't stop the
running sync thread; In order to frozen sync_thread, after seting the
flag, stop_sync_thread() should be used.
3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use
it as condition for md_stop_writes() in raid_postsuspend() doesn't
look correct. Consider that reentrant stop_sync_thread() do nothing,
always call md_stop_writes() in raid_postsuspend().
4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime,
and if MD_RECOVERY_FROZEN is cleared while the array is suspended,
new sync_thread can start unexpected. Fix this by disallow
raid_message() to change sync_thread status during suspend.

Note that after commit f52f5c71f3d4 ("md: fix stopping sync thread"), the
test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(),
and with previous fixes, the test won't hang there anymore, however, the
test will still fail and complain that ext4 is corrupted. And with this
patch, the test won't hang due to stop_sync_thread() or fail due to ext4
is corrupted anymore. However, there is still a deadlock related to
dm-raid456 that will be fixed in following patches.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com/
Fixes: 1af2048a3e87 ("dm raid: fix deadlock caused by premature md_stop_writes()")
Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-6-yukuai1@huaweicloud.com


# db29d79b 24-Nov-2023 Yu Kuai <yukuai3@huawei.com>

dm-raid: delay flushing event_work() after reconfig_mutex is released

After commit db5e653d7c9f ("md: delay choosing sync action to
md_start_sync()"), md_start_sync() will hold 'reconfig_mutex', however,
in order to make sure event_work is done, __md_stop() will flush
workqueue with reconfig_mutex grabbed, hence if sync_work is still
pending, deadlock will be triggered.

Fortunately, former pacthes to fix stopping sync_thread already make sure
all sync_work is done already, hence such deadlock is not possible
anymore. However, in order not to cause confusions for people by this
implicit dependency, delay flushing event_work to dm-raid where
'reconfig_mutex' is not held, and add some comments to emphasize that
the workqueue can't be flushed with 'reconfig_mutex'.

Fixes: db5e653d7c9f ("md: delay choosing sync action to md_start_sync()")
Depends-on: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 2b16a525 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md: rename __mddev_suspend/resume() back to mddev_suspend/resume()

Now that the old apis are removed, __mddev_suspend/resume() can be
renamed to their original names.

This is done by:

sed -i "s/__mddev_suspend/mddev_suspend/g" *.[ch]
sed -i "s/__mddev_resume/mddev_resume/g" *.[ch]

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com


# 4eb3327a 10-Oct-2023 Yu Kuai <yukuai3@huawei.com>

md/dm-raid: use new apis to suspend array

Convert to use new apis, the old apis will be removed eventually.

These are not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-7-yukuai1@huaweicloud.com


# d58eff83 24-Aug-2023 Yu Kuai <yukuai3@huawei.com>

md: initialize 'active_io' while allocating mddev

'active_io' is used for mddev_suspend() and it's initialized in
md_run(), this restrict that 'reconfig_mutex' must be held and
"mddev->pers" must be set before calling mddev_suspend().

Initialize 'active_io' early so that mddev_suspend() is safe to call
once mddev is allocated, this will be helpful to refactor
mddev_suspend() in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com


# e3260d90 15-Sep-2023 Kees Cook <keescook@chromium.org>

dm raid: Annotate struct raid_set with __counted_by

Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct raid_set.

[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@redhat.com
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200335.never.098-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>


# a865b96c 29-May-2023 Yu Kuai <yukuai3@huawei.com>

Revert "md: unlock mddev before reap sync_thread in action_store"

This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.

Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:

list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18

This is because work is requeued while it's still inside workqueue:

t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock

t3:

// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted

The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com


# 7d5fff89 08-Jul-2023 Yu Kuai <yukuai3@huawei.com>

dm raid: protect md_stop() with 'reconfig_mutex'

__md_stop_writes() and __md_stop() will modify many fields that are
protected by 'reconfig_mutex', and all the callers will grab
'reconfig_mutex' except for md_stop().

Also, update md_stop() to make certain 'reconfig_mutex' is held using
lockdep_assert_held().

Fixes: 9d09e663d550 ("dm: raid456 basic support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# e74c874e 08-Jul-2023 Yu Kuai <yukuai3@huawei.com>

dm raid: clean up four equivalent goto tags in raid_ctr()

There are four equivalent goto tags in raid_ctr(), clean them up to
use just one.

There is no functional change and this is preparation to fix
raid_ctr()'s unprotected md_stop().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# bae30287 08-Jul-2023 Yu Kuai <yukuai3@huawei.com>

dm raid: fix missing reconfig_mutex unlock in raid_ctr() error paths

In the error paths 'bad_stripe_cache' and 'bad_check_reshape',
'reconfig_mutex' is still held after raid_ctr() returns.

Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 955a257d 22-May-2023 Yu Kuai <yukuai3@huawei.com>

dm-raid: remove useless checking in raid_message()

md_wakeup_thread() handle the case that pass in md_thread is NULL, there
is no need to check this.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-3-yukuai1@huaweicloud.com


# 3664ff82 09-Apr-2023 Yangtao Li <frank.li@vivo.com>

dm: add helper macro for simple DM target module init and exit

Eliminate duplicate boilerplate code for simple modules that contain
a single DM target driver without any additional setup code.

Add a new module_dm() macro, which replaces the module_init() and
module_exit() with template functions that call dm_register_target()
and dm_unregister_target() respectively.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 306fbc2e 30-Mar-2023 Tom Rix <trix@redhat.com>

dm raid: remove unused d variable

clang with W=1 reports
drivers/md/dm-raid.c:2212:15: error: variable
'd' set but not used [-Werror,-Wunused-but-set-variable]
unsigned int d;
^
This variable is not used so remove it.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 23fda2ef 07-Feb-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: fix suspect indent whitespace

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# b30f1607 07-Feb-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: add missing blank line after declarations/fix those

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# a4a82ce3 26-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: correct block comments format.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 255e2646 25-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: address indent/space issues

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 2f06cd12 30-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: avoid initializing static variables

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 86a3238c 25-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: change "unsigned" to "unsigned int"

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 3bd94003 25-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: add missing SPDX-License-Indentifiers

'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has
deprecated its use.

Suggested-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# efdd3c33 05-Feb-2023 Yu Zhe <yuzhe@nfschina.com>

dm raid: fix some spelling mistakes in comments

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 96fccdce 04-Sep-2022 Jiangshan Yi <yijiangshan@kylinos.cn>

dm raid: fix typo in analyse_superblocks code comment

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# cea44663 30-Aug-2022 Jilin Yuan <yuanjilin@cdjrlc.com>

dm raid: delete the redundant word 'that' in comment

Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 9dfbdafd 20-Jun-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

md: unlock mddev before reap sync_thread in action_store

Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.

Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.

1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
not changed accidentally by others.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9dd1cd32 20-Jul-2022 Mike Snitzer <snitzer@kernel.org>

dm: fix dm-raid crash if md_handle_request() splits bio

Commit ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone")
introduced the optimization to _not_ perform bio_associate_blkg()'s
relatively costly work when DM core clones its bio. But in doing so it
exposed the possibility for DM's cloned bio to alter DM target
behavior (e.g. crash) if a target were to issue IO without first
calling bio_set_dev().

The DM raid target can trigger an MD crash due to its need to split
the DM bio that is passed to md_handle_request(). The split will
recurse to submit_bio_noacct() using a bio with an uninitialized
->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to
dereference a NULL blkg_to_tg(bio->bi_blkg).

Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that
will make alloc_tio() call bio_set_dev() on behalf of the target.
dm-raid is the only target that requires this flag. bio_set_dev()
initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg,
before passing the bio to md_handle_request().

Long-term fix would be to audit and refactor MD code to rely on DM to
split its bio, using dm_accept_partial_bio(), but there are MD raid
personalities (e.g. raid1 and raid10) whose implementation are tightly
coupled to handling the bio splitting inline.

Fixes: ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 7dad24db 24-Jul-2022 Mikulas Patocka <mpatocka@redhat.com>

dm raid: fix address sanitizer warning in raid_resume

There is a KASAN warning in raid_resume when running the lvm test
lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks
is greater than rs->raid_disks, so the loop touches one entry beyond
the allocated length.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 1fbeea21 24-Jul-2022 Mikulas Patocka <mpatocka@redhat.com>

dm raid: fix address sanitizer warning in raid_status

There is this warning when using a kernel with the address sanitizer
and running this testsuite:
https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid

==================================================================
BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid]
Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319
CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x9c
print_address_description.constprop.0+0x1f/0x1e0
print_report.cold+0x55/0x244
kasan_report+0xc9/0x100
raid_status+0x1747/0x2820 [dm_raid]
dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod]
table_load+0x35c/0x630 [dm_mod]
ctl_ioctl+0x411/0x630 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
__x64_sys_ioctl+0x12a/0x1a0
do_syscall_64+0x5b/0x80

The warning is caused by reading conf->max_nr_stripes in raid_status. The
code in raid_status reads mddev->private, casts it to struct r5conf and
reads the entry max_nr_stripes.

However, if we have different raid type than 4/5/6, mddev->private
doesn't point to struct r5conf; it may point to struct r0conf, struct
r1conf, struct r10conf or struct mpconf. If we cast a pointer to one
of these structs to struct r5conf, we will be reading invalid memory
and KASAN warns about it.

Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 4ce4c73f 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

md/core: Combine two sync_page_io() arguments

Improve uniformity in the kernel of handling of request operation and
flags by passing these as a single argument.

Cc: Song Liu <song@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ce92fc4b 21-Jun-2022 Jiang Jian <jiangjian@cdjrlc.com>

dm raid: remove redundant "the" in parse_raid_params() comment

Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 332bd077 27-Jun-2022 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix accesses beyond end of raid member array

On dm-raid table load (using raid_ctr), dm-raid allocates an array
rs->devs[rs->raid_disks] for the raid device members. rs->raid_disks
is defined by the number of raid metadata and image tupples passed
into the target's constructor.

In the case of RAID layout changes being requested, that number can be
different from the current number of members for existing raid sets as
defined in their superblocks. Example RAID layout changes include:
- raid1 legs being added/removed
- raid4/5/6/10 number of stripes changed (stripe reshaping)
- takeover to higher raid level (e.g. raid5 -> raid6)

When accessing array members, rs->raid_disks must be used in control
loops instead of the potentially larger value in rs->md.raid_disks.
Otherwise it will cause memory access beyond the end of the rs->devs
array.

Fix this by changing code that is prone to out-of-bounds access.
Also fix validate_raid_redundancy() to validate all devices that are
added. Also, use braces to help clean up raid_iterate_devices().

The out-of-bounds memory accesses was discovered using KASAN.

This commit was verified to pass all LVM2 RAID tests (with KASAN
enabled).

Cc: stable@vger.kernel.org
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# d0a18034 06-Jun-2022 Guoqing Jiang <guoqing.jiang@linux.dev>

Revert "md: don't unregister sync_thread with reconfig_mutex held"

The 07reshape5intr test is broke because of below path.

md_reap_sync_thread
-> mddev_unlock
-> md_unregister_thread(&mddev->sync_thread)

And md_check_recovery is triggered by,

mddev_unlock -> md_wakeup_thread(mddev->thread)

then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
reshape_position can't be updated accordingly.

Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>


# 8b48ec23 12-Feb-2021 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

md: don't unregister sync_thread with reconfig_mutex held

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.

Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <song@kernel.org>


# 70200574 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: remove QUEUE_FLAG_DISCARD

Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.

The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6dcbb52c 17-Oct-2021 Christoph Hellwig <hch@lst.de>

dm: use bdev_nr_sectors and bdev_nr_bytes instead of open coding them

Use the proper helpers to read the block device size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ec45662 12-Jul-2021 Tushar Sugandhi <tusharsu@linux.microsoft.com>

dm: update target status functions to support IMA measurement

For device mapper targets to take advantage of IMA's measurement
capabilities, the status functions for the individual targets need to be
updated to handle the status_type_t case for value STATUSTYPE_IMA.

Update status functions for the following target types, to log their
respective attributes to be measured using IMA.
01. cache
02. crypt
03. integrity
04. linear
05. mirror
06. multipath
07. raid
08. snapshot
09. striped
10. verity

For rest of the targets, handle the STATUSTYPE_IMA case by setting the
measurement buffer to NULL.

For IMA to measure the data on a given system, the IMA policy on the
system needs to be updated to have the following line, and the system
needs to be restarted for the measurements to take effect.

/etc/ima/ima-policy
measure func=CRITICAL_DATA label=device-mapper template=ima-buf

The measurements will be reflected in the IMA logs, which are located at:

/sys/kernel/security/integrity/ima/ascii_runtime_measurements
/sys/kernel/security/integrity/ima/binary_runtime_measurements

These IMA logs can later be consumed by various attestation clients
running on the system, and send them to external services for attesting
the system.

The DM target data measured by IMA subsystem can alternatively
be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
DM_TABLE_STATUS_CMD.

Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ca4a4e9a 30-Apr-2021 Mike Snitzer <snitzer@redhat.com>

dm raid: remove unnecessary discard limits for raid0 and raid10

Commit 29efc390b946 ("md/md0: optimize raid0 discard handling") and
commit d30588b2731f ("md/raid10: improve raid10 discard request")
remove MD raid0's and raid10's inability to properly handle large
discards. So eliminate associated constraints from dm-raid's support.

Depends-on: 29efc390b946 ("md/md0: optimize raid0 discard handling")
Depends-on: d30588b2731f ("md/raid10: improve raid10 discard request")
Reported-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# f99a8e43 21-Apr-2021 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences

If fast table reloads occur during an ongoing reshape of raid4/5/6
devices the target may race reading a superblock vs the the MD resync
thread; causing an inconclusive reshape state to be read in its
constructor.

lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause
BUG_ON() to trigger in md_run(), e.g.:
"kernel BUG at drivers/md/raid5.c:7567!".

Scenario triggering the bug:

1. the MD sync thread calls end_reshape() from raid5_sync_request()
when done reshaping. However end_reshape() _only_ updates the
reshape position to MaxSector keeping the changed layout
configuration though (i.e. any delta disks, chunk sector or RAID
algorithm changes). That inconclusive configuration is stored in
the superblock.

2. dm-raid constructs a mapping, loading named inconsistent superblock
as of step 1 before step 3 is able to finish resetting the reshape
state completely, and calls md_run() which leads to mentioned bug
in raid5.c.

3. the MD RAID personality's finish_reshape() is called; which resets
the reshape information on chunk sectors, delta disks, etc. This
explains why the bug is rarely seen on multi-core machines, as MD's
finish_reshape() superblock update races with the dm-raid
constructor's superblock load in step 2.

Fix identifies inconclusive superblock content in the dm-raid
constructor and resets it before calling md_run(), factoring out
identifying checks into rs_is_layout_change() to share in existing
rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also
enhance a comment and remove an empty line.

Cc: stable@vger.kernel.org
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# be962b2f 20-Apr-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

dm raid: fix fall-through warning in rs_check_takeover() for Clang

In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# cc07d72b 24-Sep-2020 Mike Snitzer <snitzer@redhat.com>

dm raid: fix discard limits for raid1

Block core warned that discard_granularity was 0 for dm-raid with
personality of raid1. Reason is that raid_io_hints() was incorrectly
special-casing raid1 rather than raid0.

Fix raid_io_hints() by removing discard limits settings for
raid1. Check for raid0 instead.

Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Stephan Bärwolf <stephan@matrixstorm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 0941e3b0 13-Dec-2020 Mike Snitzer <snitzer@redhat.com>

Revert "dm raid: fix discard limits for raid1 and raid10"

This reverts commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512.

Reverting 6ffeb1c3f822 ("md: change mddev 'chunk_sectors' from int to
unsigned") exposes dm-raid.c compiler warnings detailed that commit's
header. Clearly this more conservative fix, of simply reverting
e0910c8e4f8, would've been more prudent given how late we were in the
v5.10 release. Lessons have been learned.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# e2782f56 09-Dec-2020 Song Liu <songliubraving@fb.com>

Revert "dm raid: remove unnecessary discard limits for raid10"

This reverts commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28.

Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/
Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>


# dc2985a8 16-Nov-2020 Christoph Hellwig <hch@lst.de>

dm-raid: use set_capacity_and_notify

Use set_capacity_and_notify to set the size of both the disk and block
device. This also gets the uevent notifications for the resize for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f0e90b6c 24-Sep-2020 Mike Snitzer <snitzer@redhat.com>

dm raid: remove unnecessary discard limits for raid10

Commit bcc90d280465e ("md/raid10: improve raid10 discard request")
removes raid10's inability to properly handle large discards. So
eliminate associated constraint from dm-raid's raid10 support.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# e0910c8e 24-Sep-2020 Mike Snitzer <snitzer@redhat.com>

dm raid: fix discard limits for raid1 and raid10

Block core warned that discard_granularity was 0 for dm-raid with
personality of raid1. Reason is that raid_io_hints() was incorrectly
special-casing raid1 rather than raid0.

But since commit 29efc390b9462 ("md/md0: optimize raid0 discard
handling") even raid0 properly handles large discards.

Fix raid_io_hints() by removing discard limits settings for raid1.
Also, fix limits for raid10 by properly stacking underlying limits as
done in blk_stack_limits().

Depends-on: 29efc390b9462 ("md/md0: optimize raid0 discard handling")
Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 659e56ba 01-Sep-2020 Christoph Hellwig <hch@lst.de>

block: add a new revalidate_disk_size helper

revalidate_disk is a relative awkward helper for driver use, as it first
calls an optional driver method and then updates the block device size,
while most callers either don't need the method call at all, or want to
keep state between the caller and the called method.

Add a revalidate_disk_size helper that just performs the update of the
block device size from the gendisk one, and switch all drivers that do
not implement ->revalidate_disk to use the new helper instead of
revalidate_disk()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 04dc5330 15-Jul-2020 Damien Le Moal <damien.lemoal@wdc.com>

dm raid: Remove empty if statement

In super_init_validation(), remove a body-less if statement testing only
variables to avoid a compilation warning when compiling with W=1.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 21cf8661 01-Jul-2020 Christoph Hellwig <hch@lst.de>

writeback: remove bdi->congested_fn

Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b18ae8dd 07-May-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

dm: replace zero-length array with flexible-array

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 43f3952a 19-Dec-2019 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: table line rebuild status fixes

raid_status() wasn't emitting rebuild flags on the table line properly
because the rdev number was not yet set properly; index raid component
devices array directly to solve.

Also fix wrong argument count on emitted table line caused by 1 too
many rebuild/write_mostly argument and consider any journal_(dev|mode)
pairs.

Link: https://bugzilla.redhat.com/1782045
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 35ad035b 24-Oct-2019 Nathan Chancellor <nathan@kernel.org>

dm raid: Remove unnecessary negation of a shift in raid10_format_to_md_layout

When building with Clang + -Wtautological-constant-compare:

drivers/md/dm-raid.c:619:8: warning: converting the result of '<<' to a
boolean always evaluates to true [-Wtautological-constant-compare]
r = !RAID10_OFFSET;
^
drivers/md/dm-raid.c:517:28: note: expanded from macro 'RAID10_OFFSET'
#define RAID10_OFFSET (1 << 16) /* stripes with data
copies area adjacent on devices */
^
1 warning generated.

Negating a non-zero number will always make it zero, which is the
default value of r in this function so this statement is unnecessary;
remove it so that clang no longer warns.

Link: https://github.com/ClangBuiltLinux/linux/issues/753
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 53be73a5 01-Oct-2019 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: streamline rs_get_progress() and its raid_status() caller side

Pass already deciphered state into rs_get_progress, simplify recovery offset
definition and combine two st_resync, st_reshape conditionals into one as is
already the case with st_check and st_repair.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# f9f3ee91 01-Oct-2019 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: simplify rs_setup_recovery call chain

rs_setup_recovery() sets the starting recovery offset.

Drop superfluous rs_setup_recovery() and replace with __rs_setup_recovery().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 99273d9e 01-Oct-2019 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: to ensure resynchronization, perform raid set grow in preresume

This fixes a flaw causing raid set extensions not to be synchronized
in case the MD bitmap resize required additional pages to be allocated.

Also share resize code in the raid constructor between
new size changes and those occuring during recovery.

Bump the target version to define the change and document
it in Documentation/admin-guide/device-mapper/dm-raid.rst.

Reported-by: Steve D <steved424@gmail.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 22c992e1 01-Oct-2019 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: change rs_set_dev_and_array_sectors API and callers

Add a size argument to rs_set_dev_and_array_sectors as prerequisite
to fixing grown device resynchronization not occuring when new MD
bitmap pages have to be allocated as a result of the extension in
a follwup patch.

Also avoid code duplication by using rs_set_rdev_sectors
in the aforementioned function.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c8156fc7 11-Sep-2019 Ming Lei <ming.lei@redhat.com>

dm raid: fix updating of max_discard_sectors limit

Unit of 'chunk_size' is byte, instead of sector, so fix it by setting
the queue_limits' max_discard_sectors to rs->md.chunk_sectors. Also,
rename chunk_size to chunk_size_bytes.

Without this fix, too big max_discard_sectors is applied on the request
queue of dm-raid, finally raid code has to split the bio again.

This re-split done by raid causes the following nested clone_endio:

1) one big bio 'A' is submitted to dm queue, and served as the original
bio

2) one new bio 'B' is cloned from the original bio 'A', and .map()
is run on this bio of 'B', and B's original bio points to 'A'

3) raid code sees that 'B' is too big, and split 'B' and re-submit
the remainded part of 'B' to dm-raid queue via generic_make_request().

4) now dm will handle 'B' as new original bio, then allocate a new
clone bio of 'C' and run .map() on 'C'. Meantime C's original bio
points to 'B'.

5) suppose now 'C' is completed by raid directly, then the following
clone_endio() is called recursively:

clone_endio(C)
->clone_endio(B) #B is original bio of 'C'
->bio_endio(A)

'A' can be big enough to make hundreds of nested clone_endio(), then
stack can be corrupted easily.

Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# dc1a3e8e 18-Aug-2019 Wenwen Wang <wenwen@cs.uga.edu>

dm raid: add missing cleanup in raid_ctr()

If rs_prepare_reshape() fails, no cleanup is executed, leading to
leak of the raid_set structure allocated at the beginning of
raid_ctr(). To fix this issue, go to the label 'bad' if the error
occurs.

Fixes: 11e4723206683 ("dm raid: stop keeping raid set frozen altogether")
Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6cf2a73c 17-Jun-2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

docs: device-mapper: move it to the admin-guide

The DM support describes lots of aspects related to mapped
disk partitions from the userspace PoV.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>


# f0ba4377 12-Jun-2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

docs: convert docs to ReST and rename to *.rst

The conversion is actually:
- add blank lines and indentation in order to identify paragraphs;
- fix tables markups;
- add some lists markups;
- mark literal blocks;
- adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>


# 61697a6a 18-Jan-2019 Mike Snitzer <snitzer@redhat.com>

dm: eliminate 'split_discard_bios' flag from DM target interface

There is no need to have DM core split discards on behalf of a DM target
now that blk_queue_split() handles splitting discards based on the
queue_limits. A DM target just needs to set max_discard_sectors,
discard_granularity, etc, in queue_limits.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 74694bcb 18-Dec-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix false -EBUSY when handling check/repair message

Sending a check/repair message infrequently leads to -EBUSY instead of
properly identifying an active resync. This occurs because
raid_message() is testing recovery bits in a racy way.

Fix by calling decipher_sync_action() from raid_message() to properly
identify the idle state of the RAID device.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d857ad75 12-Oct-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: avoid bitmap with raid4/5/6 journal device

With raid4/5/6, journal device and write intent bitmap are mutually exclusive.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 0328ba90 17-Sep-2018 Geert Uytterhoeven <geert@linux-m68k.org>

dm raid: remove bogus const from decipher_sync_action() return type

With gcc-4.1.2:

drivers/md/dm-raid.c:3357: warning: type qualifiers ignored on function return type

Remove the "const" keyword to fix this.

Fixes: 36a240a706d43383 ("dm raid: fix RAID leg rebuild errors")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 5380c05b 06-Sep-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: bump target version, update comments and documentation

Bump target version to reflect the documented fixes are available.
Also fix some code comments (typos and clarity).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 36a240a7 06-Sep-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix RAID leg rebuild errors

On fast devices such as NVMe, a flaw in rs_get_progress() results in
false target status output when userspace lvm2 requests leg rebuilds
(symptom of the failure is device health chars 'aaaaaaaa' instead of
expected 'aAaAAAAA' causing lvm2 to fail).

The correct sync action state definitions already exist in
decipher_sync_action() so fix rs_get_progress() to use it.

Change decipher_sync_action() to return an enum rather than a string for
the sync states and call it from rs_get_progress(). Introduce
sync_str() to translate from enum to the string that is needed by
raid_status().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c44a5ee8 06-Sep-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix rebuild of specific devices by updating superblock

Update superblock when particular devices are requested via rebuild
(e.g. lvconvert --replace ...) to avoid spurious failure with the "New
device injected into existing raid set without 'delta_disks' or
'rebuild' parameter specified" error message.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 644e2537 06-Sep-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix stripe adding reshape deadlock

When initiating a stripe adding reshape, a deadlock between
md_stop_writes() waiting for the sync thread to stop and the running
sync thread waiting for inactive stripes occurs (this frequently happens
on single-core but rarely on multi-core systems).

Fix this deadlock by setting MD_RECOVERY_WAIT to have the main MD
resynchronization thread worker (md_do_sync()) bail out when initiating
the reshape via constructor arguments.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 38b0bd0c 06-Sep-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix reshape race on small devices

Loading a new mapping table, the dm-raid target's constructor
retrieves the volatile reshaping state from the raid superblocks.

When the new table is activated in a following resume, the actual
reshape position is retrieved. The reshape driven by the previous
mapping can already have finished on small and/or fast devices thus
updating raid superblocks about the new raid layout.

This causes the actual array state (e.g. stripe size reshape finished)
to be inconsistent with the one in the new mapping, causing hangs with
left behind devices.

This race does not occur with usual raid device sizes but with small
ones (e.g. those created by the lvm2 test suite).

Fix by no longer transferring stale/inconsistent raid_set state during
preresume.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# e64e4018 01-Aug-2018 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

md: Avoid namespace collision with bitmap API

bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>


# f2ccaa59 22-Jun-2018 Arnd Bergmann <arnd@arndb.de>

dm raid: don't use 'const' in function return

A newly introduced function has 'const int' as the return type,
but as "make W=1" reports, that has no meaning:

drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]

This changes the return type to plain 'int'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 33e53f06850f ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 552aa679f2657431 ("dm raid: use rs_is_raid*()")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# acafe7e3 08-May-2018 Kees Cook <keescook@chromium.org>

treewide: Use struct_size() for kmalloc()-family

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
int stuff;
void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:

// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
// sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@

- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

Signed-off-by: Kees Cook <keescook@chromium.org>


# 13bc62d4 28-Mar-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix parse_raid_params() variable range issue

parse_raid_params() compares variable "int value" with INT_MAX.

E.g. related Coverity report excerpt:
CID 1364818 (#2 of 3): Operands don't affect result (CONSTANT_EXPRESSION_RESULT) [select issue]
1433 if (value > INT_MAX) {

Fix by changing checks to avoid INT_MAX.

Whilst on it, avoid unnecessary checks against constants
and add check for sane recovery speed min/max.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 880bcce0 16-Mar-2018 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix nosync status

Fix a race for "nosync" activations providing "aa.." device health
characters and "0/N" sync ratio rather than "AA..." and "N/N". Occurs
when status for the raid set is retrieved during resume before the MD
sync thread starts and clears the MD_RECOVERY_NEEDED flag.

Cc: stable@vger.kernel.org # 4.16+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 1eb5fa84 28-Feb-2018 Mike Snitzer <snitzer@redhat.com>

dm: allow targets to return output from messages they are sent

Could be useful for a target to return stats or other information.
If a target does DMEMIT() anything to @result from its .message method
then it must return 1 to the caller.

Signed-off-By: Mike Snitzer <snitzer@redhat.com>


# da1e1488 27-Feb-2018 Jonathan Brassow <jbrassow@redhat.com>

dm raid: fix incorrect sync_ratio when degraded

Upstream commit 4102d9de6d375 ("dm raid: fix rs_get_progress()
synchronization state/ratio") in combination with commit 7c29744ecce
("dm raid: simplify rs_get_progress()") introduced a regression by
incorrectly reporting a sync_ratio of 0 for degraded raid sets. This
caused lvm2 to fail to repair raid legs automatically.

Fix by identifying the degraded state by checking the MD_RECOVERY_INTR
flag and returning mddev->recovery_cp in case it is set.

MD sets recovery = [ MD_RECOVERY_RECOVER MD_RECOVERY_INTR
MD_RECOVERY_NEEDED ] when a RAID member fails. It then shuts down any
sync thread that is running and leaves us with all MD_RECOVERY_* flags
cleared. The bug occurs if a status is requested in the short time it
takes to shut down any sync thread and clear the flags, because we were
keying in on the MD_RECOVERY_NEEDED - understanding it to be the initial
phase of a “recover” sync thread. However, this is an incorrect
interpretation if MD_RECOVERY_INTR is also set.

This also explains why the bug only happened when automatic repair was
enabled and not a normal ‘manual’ method. It is impossible to react
quick enough to hit the problematic window without it being automated.

Fix passes automatic repair tests.

Fixes: 7c29744ecce ("dm raid: simplify rs_get_progress()")
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 67ac901c 01-Jan-2018 Wei Yongjun <weiyongjun1@huawei.com>

dm raid: make raid_sets symbol static

Fixes the following sparse warning:

drivers/md/dm-raid.c:33:1: warning:
symbol 'raid_sets' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 552aa679 13-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: use rs_is_raid*()

Cleanup, no functional change.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 7c29744e 13-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: simplify rs_get_progress()

No need to calculate the reshaping progress because
mddev->curr_resync_completed holds it.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# dc15b943 13-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: ensure 'a' chars during reshape

During reshape, 'A' chars were reported in status rather than 'a'.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 11e47232 13-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: stop keeping raid set frozen altogether

In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared. The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.

Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether. This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.

Fixes: d39f0010e ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 53bf5384 13-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: validate current raid sets redundancy

Verifying the current raid sets redundancy based on retrieved
superblock content has to use the superblock's raid level (e.g. raid0),
not the constructor requested one (e.g. raid10).

Using the requested raid level of raid10 lead to a "divide error"
on raid0 which defines data copies divided by to be zero.

Also check for bogus data copies.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d5d885fd 19-Nov-2017 Song Liu <songliubraving@fb.com>

md: introduce new personality funciton start()

In do_md_run(), md threads should not wake up until the array is fully
initialized in md_run(). However, in raid5_run(), raid5-cache may wake
up mddev->thread to flush stripes that need to be written back. This
design doesn't break badly right now. But it could lead to bad bug in
the future.

This patch tries to resolve this problem by splitting start up work
into two personality functions, run() and start(). Tasks that do not
require the md threads should go into run(), while task that require
the md threads go into start().

r5l_load_log() is moved to raid5_start(), so it is not called until
the md threads are started in do_md_run().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# b84cf269 04-Dec-2017 Mike Snitzer <snitzer@redhat.com>

dm raid: bump target version to reflect numerous fixes

Also update Documentation accordingly.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 78a75d10 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: small cleanup and remove unsed "struct raid_set" member

Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to
mddev_resume().

Also, remove unused 'bitmap_loaded' member from "struct raid_set".

No functional changes.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4102d9de 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix rs_get_progress() synchronization state/ratio

Fix various sync state issues causing racy/bogus sync ratio,
sync_action ad health chars in dm_status() info output.

Sync ratio could be N/N (i.e. 100%) shortly after raid set
creation, i.e. creating a new RaidLV or upconverting a linear LV to
raid1 thus:
"0 2097152 raid raid1 2 Aa 2097162/2097152 recover 0 0 -"
instead of:
"0 2097152 raid raid1 2 Aa 0/2097152 idle 0 0 -"

Sync action could be non-idle, when the MD thread was done with io.

Health chars could be 'A' when they should be 'a' for a short time
before a resynchonization started.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 242ea5ad 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: avoid passing array_in_sync variable to raid_status() callees

The raid_status() function passes the bool array_in_sync variable around
providing synchronization state of the MD array. Replace it with a
runtime flag. This will avoid a pattern of having to pass discrete
variables to various functions.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 67143510 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: display a consistent copy of the MD status via raid_status()

The MD sync thread updates recovery flags providing state of any
running, idle, frozen, recovering, reshaping, ... activity it performs
and updates respective flags asynchronously versus dm processing
raid_status(). To close that race window, take a single copy of the
flags and pass it into its callees.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d39f0010 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix raid_resume() to keep raid set frozen as needed

During a reshape request: if userspace reloads a "raid" table multiple
times, resulting in multiple superblock reads, the raid set needs to
stay frozen until all config changes (chunk size, layout data_offset,
delta_disks) have been stored in the superblocks and respective flags
cleared.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 188a212d 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add component device size checks to avoid runtime failure

Check all component data device sizes versus calculated size.
Reject if device(s) are too small. Otherwise, MD will fail the
operation by accessing beyond the end of the data device.

An example use-case is that growing bitmap won't fit any more and the MD
runtime will report an error when DM raid should catch this earlier.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 61e06e2c 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix raid set size revalidation

The raid set size is being revalidated unconditionally before a
reshaping conversion is started. MD requires the size to only be
reduced in case of a stripe removing (i.e. shrinking) reshape but not
when growing because the raid array has to stay small until after the
growing reshape finishes.

Fix by avoiding the size revalidation in preresume unless a shrinking
reshape is requested.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 7501537e 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: correct resizing state relative to reshape space in ctr

Pay attention to existing reshape space to define if a raid set needs
resizing. Otherwise we can hit "Can't resize a reshaping raid set"
when a reshape is being requested.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 052b2b1e 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: consume sizes after md_finish_reshape() completes changing them

The md raid personalities call md_finish_reshape() at the end of a
reshape conversion which adjusts rdev->sectors.

Correct/check rdev->sectors before initiating a reshape and raise the
recovery pointer accordingly.

Otherwise, the DM raid coordinated reshape will fail.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 1af2048a 01-Dec-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix deadlock caused by premature md_stop_writes()

md_stop_writes() is called in raid_presuspend() causing deadlocks on
bios submitted afterwards -- which happens on loaded raid sets with
conversion requests.

Fix by moving md_stop_writes() to raid_postsuspend(). NOTE: when the
recovery's frozen (MD_RECOVERY_FROZEN), writes haven't been started (or
are already stopped) so don't stop them again.

Also remove superfluous readonly setting.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 7dea378b 16-Nov-2017 Mike Snitzer <snitzer@redhat.com>

dm: do not set 'discards_supported' in targets that do not need it

The DM target's 'discards_supported' flag is intended to act as an
override. Meaning, even if the underlying storage doesn't support
discards the DM target will.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 23397844 02-Nov-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix panic when attempting to force a raid to sync

Requesting a sync on an active raid device via a table reload
(see 'sync' parameter in Documentation/device-mapper/dm-raid.txt)
skips the super_load() call that defines the superblock size
(rdev->sb_size) -- resulting in an oops if/when super_sync()->memset()
is called.

Fix by moving the initialization of the superblock start and size
out of super_load() to the caller (analyse_superblocks).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4d5324f7 18-Oct-2017 NeilBrown <neilb@suse.com>

md: always hold reconfig_mutex when calling mddev_suspend()

Most often mddev_suspend() is called with
reconfig_mutex held. Make this a requirement in
preparation a subsequent patch. Also require
reconfig_mutex to be held for mddev_resume(),
partly for symmetry and partly to guarantee
no races with incr/decr of mddev->suspend.

Taking the mutex in r5c_disable_writeback_async() is
a little tricky as this is called from a work queue
via log->disable_writeback_work, and flush_work()
is called on that while holding ->reconfig_mutex.
If the work item hasn't run before flush_work()
is called, the work function will not be able to
get the mutex.

So we use mddev_trylock() inside the wait_event() call, and have that
abort when conf->log is set to NULL, which happens before
flush_work() is called.
We wait in mddev->sb_wait and ensure this is woken
when any of the conditions change. This requires
waking mddev->sb_wait in mddev_unlock(). This is only
like to trigger extra wake_ups of threads that needn't
be woken when metadata is being written, and that
doesn't happen often enough that the cost would be
noticeable.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 935fe098 10-Oct-2017 Mike Snitzer <snitzer@redhat.com>

md: rename some drivers/md/ files to have an "md-" prefix

Motivated by the desire to illiminate the imprecise nature of
DM-specific patches being unnecessarily sent to both the MD maintainer
and mailing-list. Which is born out of the fact that DM files also
reside in drivers/md/

Now all MD-specific files in drivers/md/ start with either "raid" or
"md-" and the MAINTAINERS file has been updated accordingly.

Shaohua: don't change module name

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 41dcf197 02-Oct-2017 Jonathan Brassow <jbrassow@redhat.com>

dm raid: fix incorrect status output at the end of a "recover" process

There are three important fields that indicate the overall health and
status of an array: dev_health, sync_ratio, and sync_action. They tell
us the condition of the devices in the array, and the degree to which
the array is synchronized.

This commit fixes a condition that is reported incorrectly. When a member
of the array is being rebuilt or a new device is added, the "recover"
process is used to synchronize it with the rest of the array. When the
process is complete, but the sync thread hasn't yet been reaped, it is
possible for the state of MD to be:
mddev->recovery = [ MD_RECOVERY_RUNNING MD_RECOVERY_RECOVER MD_RECOVERY_DONE ]
curr_resync_completed = <max dev size> (but not MaxSector)
and all rdevs to be In_sync.
This causes the 'array_in_sync' output parameter that is passed to
rs_get_progress() to be computed incorrectly and reported as 'false' --
or not in-sync. This in turn causes the dev_health status characters to
be reported as all 'a', rather than the proper 'A'.

This can cause erroneous output for several seconds at a time when tools
will want to be checking the condition due to events that are raised at
the end of a sync process. Fix this by properly calculating the
'array_in_sync' return parameter in rs_get_progress().

Also, remove an unnecessary intermediate 'recovery_cp' variable in
rs_get_progress().

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c4d6a1b8 21-Sep-2017 Shaohua Li <shli@fb.com>

dm-raid: fix a race condition in request handling

raid_map calls pers->make_request, which missed the suspend check. Fix it with
the new md_handle_request API.

Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# ac6a3188 13-Jul-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: bump target version

Bumo dm-raid target version to 1.12.1 to reflect that commit cc27b0c78c
("md: fix deadlock between mddev_suspend() and md_write_start()") is
available.

This version change allows userspace to detect that MD fix is available.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 0cf352e5 13-Jul-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: avoid mddev->suspended access

Use runtime flag to ensure that an mddev gets suspended/resumed just once.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# f4af3f82 13-Jul-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix activation check in validate_raid_redundancy()

During growing reshapes (i.e. stripes being added to a raid set), the
new stripe images are not in-sync and not part of the raid set until
the reshape is started.

LVM2 has to request multiple table reloads involving superblock updates
in order to reflect proper size of SubLVs in the cluster. Before a stripe
adding reshape starts, validate_raid_redundancy() fails as a result of that
because it checks the total number of devices against the number of rebuild
ones rather than the actual ones in the raid set (as retrieved from the
superblock) thus resulting in failed raid4/5/6/10 redundancy checks.

E.g. convert 3 stripes -> 7 stripes raid5 (which only allows for maximum
1 device to fail) requesting +4 delta disks causing 4 devices to rebuild
during reshaping thus failing activation.

To fix this, move validate_raid_redundancy() to get access to the
current raid_set members.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# bbac1e06 13-Jul-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: remove WARN_ON() in raid10_md_layout_to_format()

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4d49f1b4 30-Jun-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: stop using BUG() in __rdev_sectors()

Return 0 rather than BUG() if __rdev_sectors() fails and catch invalid
rdev size in the constructor.

Reported-by: Hannes Reinecke <hare@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c4d097d1 23-Jun-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix oops on upgrading to extended superblock format

When a RAID set was created on dm-raid version < 1.9.0 (old RAID
superblock format), all of the new 1.9.0 members of the superblock are
uninitialized (zero) -- including the device sectors member needed to
support shrinking.

All the other accesses to superblock fields new in 1.9.0 were reviewed
and verified to be properly guarded against invalid use. The 'sectors'
member was the only one used when the superblock version is < 1.9.

Don't access the superblock's >= 1.9.0 'sectors' member unconditionally.
Also add respective comments.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 48920ff2 05-Apr-2017 Christoph Hellwig <hch@lst.de>

block: remove the discard_zeroes_data flag

Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7a0c5c5b 30-Mar-2017 Dmitry Bilunov <kmeaw@yandex-team.ru>

dm raid: fix NULL pointer dereference for raid1 without bitmap

Commit 4257e08 ("dm raid: support to change bitmap region size")
introduced a bitmap resize call during preresume phase. User can create
a DM device with "raid" target configured as raid1 with no metadata
devices to hold superblock/bitmap info. It can be achieved using the
following sequence:

truncate -s 32M /dev/shm/raid-test
LOOP=$(losetup --show -f /dev/shm/raid-test)
dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0"
dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024"
dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1"

This results in the following crash:

[ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB
[ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors
[ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.114816] PGD 0

[ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015
[ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000
[ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246
[ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
[ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000
[ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff
[ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058
[ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 4029.115223] FS: 00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000
[ 4029.115245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0
[ 4029.115281] Call Trace:
[ 4029.115291] ? raid_iterate_devices+0x63/0x80 [dm_raid]
[ 4029.115309] ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod]
[ 4029.115329] ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod]
[ 4029.115346] raid_preresume+0x81/0x2e0 [dm_raid]
[ 4029.115361] dm_table_resume_targets+0x47/0xe0 [dm_mod]
[ 4029.115378] dm_resume+0xa8/0xd0 [dm_mod]
[ 4029.115391] dev_suspend+0x123/0x250 [dm_mod]
[ 4029.115405] ? table_load+0x350/0x350 [dm_mod]
[ 4029.115419] ctl_ioctl+0x1c2/0x490 [dm_mod]
[ 4029.115433] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[ 4029.115447] do_vfs_ioctl+0x8d/0x5a0
[ 4029.115459] ? ____fput+0x9/0x10
[ 4029.115470] ? task_work_run+0x79/0xa0
[ 4029.115481] SyS_ioctl+0x3c/0x70
[ 4029.115493] entry_SYSCALL_64_fastpath+0x13/0x94

The raid_preresume() function incorrectly assumes that the raid_set has
a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set. But
RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap()
even if there is no bitmap present (and bitmap_load() happily returns 0
even if a bitmap isn't present). So the only way forward in the
near-term is to check if the bitmap is present by seeing if
mddev->bitmap is not NULL after bitmap_load() has been called.

By doing so the above NULL pointer is avoided.

Fixes: 4257e08 ("dm raid: support to change bitmap region size")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru>
Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6e53636f 22-Mar-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add raid4/5/6 journal write-back support via journal_mode option

Commit 63c32ed4afc ("dm raid: add raid4/5/6 journaling support") added
journal support to close the raid4/5/6 "write hole" -- in terms of
writethrough caching.

Introduce a "journal_mode" feature and use the new
r5c_journal_mode_set() API to add support for switching the journal
device's cache mode between write-through (the current default) and
write-back.

NOTE: If the journal device is not layered on resilent storage and it
fails, write-through mode will cause the "write hole" to reoccur. But
if the journal fails while in write-back mode it will cause data loss
for any dirty cache entries unless resilent storage is used for the
journal.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4464e36e 17-Mar-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix table line argument order in status

Commit 3a1c1ef2f ("dm raid: enhance status interface and fixup
takeover/raid0") added new table line arguments and introduced an
ordering flaw. The sequence of the raid10_copies and raid10_format
raid parameters got reversed which causes lvm2 userspace to fail by
falsely assuming a changed table line.

Sequence those 2 parameters as before so that old lvm2 can function
properly with new kernels by adjusting the table line output as
documented in Documentation/device-mapper/dm-raid.txt.

Also, add missing version 1.10.1 highlight to the documention.

Fixes: 3a1c1ef2f ("dm raid: enhance status interface and fixup takeover/raid0")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2664f3c9 28-Feb-2017 Mike Snitzer <snitzer@redhat.com>

dm raid: bump the target version

This version bump reflects that the reshape corruption fix (commit
92a39f6cc "dm raid: fix data corruption on reshape request") is
present.

Done as a separate fix because the above referenced commit is marked for
stable and target version bumps in a stable@ fix are a recipe for the
fix to never get backported to stable@ kernels (because of target
version number conflicts).

Also, move RESUME_STAY_FROZEN_FLAGS up with the reset the the _FLAGS
definitions now that we don't need to worry about stable@ conflicts as a
result of missing context.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d36a1954 28-Feb-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix data corruption on reshape request

The lvm2 sequence to manage dm-raid constructor flags that trigger a
rebuild or a reshape is defined as:

1) load table with flags (e.g. rebuild/delta_disks/data_offset)
2) clear out the flags in lvm2 metadata
3) store the lvm2 metadata, reload the table to reset the flags
previously established during the initial load (1) -- in order to
prevent repeatedly requesting a rebuild or a reshape on activation

Currently, loading an inactive table with rebuild/reshape flags
specified will cause dm-raid to rebuild/reshape on resume and thus start
updating the raid metadata (about the progress). When the second table
reload, to reset the flags, occurs the constructor accesses the volatile
progress state kept in the raid superblocks. Because the active mapping
is still processing the rebuild/reshape, that position will be stale by
the time the device is resumed.

In the reshape case, this causes data corruption by processing already
reshaped stripes again. In the rebuild case, it does _not_ cause data
corruption but instead involves superfluous rebuilds.

Fix by keeping the raid set frozen during the first resume and then
allow the rebuild/reshape during the second resume.

Fixes: 9dbd1aa3a ("dm raid: add reshaping support to the target")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.8+


# ad470472 28-Feb-2017 Mike Snitzer <snitzer@redhat.com>

dm raid: fix raid "check" regression due to improper cleanup in raid_message()

While cleaning up awkward branching in raid_message() a raid set "check"
regression was introduced because "check" needs both MD_RECOVERY_SYNC
and MD_RECOVERY_REQUESTED flags set.

Fix this regression by explicitly setting both flags for the "check"
case (like is also done for the "repair" case, but redundant set_bit()s
are perfectly fine because it adds clarity to what is needed in response
to both messages -- in addition this isn't fast path code).

Fixes: 105db59912 ("dm raid: cleanup awkward branching in raid_message() option processing")
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 105db599 06-Jan-2017 Mike Snitzer <snitzer@redhat.com>

dm raid: cleanup awkward branching in raid_message() option processing

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 977f1a0a 13-Jan-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: use mddev rather than rdev->mddev

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# e2568465 13-Jan-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: use read_disk_sb() throughout

For consistency, call read_disk_sb() from
attempt_restore_of_faulty_devices() instead
of calling sync_page_io() directly.

Explicitly set device to faulty on superblock read error.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 63c32ed4 30-Nov-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add raid4/5/6 journaling support

Add md raid4/5/6 journaling support (upstream commit bac624f3f86a started
the implementation) which closes the write hole (i.e. non-atomic updates
to stripes) using a dedicated journal device.

Background:
raid4/5/6 stripes hold N data payloads per stripe plus one parity raid4/5
or two raid6 P/Q syndrome payloads in an in-memory stripe cache.
Parity or P/Q syndromes used to recover any data payloads in case of a disk
failure are calculated from the N data payloads and need to be updated on the
different component devices of the raid device. Those are non-atomic,
persistent updates. Hence a crash can cause failure to update all stripe
payloads persistently and thus cause data loss during stripe recovery.
This problem gets addressed by writing whole stripe cache entries (together with
journal metadata) to a persistent journal entry on a dedicated journal device.
Only if that journal entry is written successfully, the stripe cache entry is
updated on the component devices of the raid device (i.e. writethrough type).
In case of a crash, the entry can be recovered from the journal and be written
again thus ensuring consistent stripe payload suitable to data recovery.

Future dependencies:
once writeback caching being worked on to compensate for the throughput
implictions involved with writethrough overhead is supported with journaling
in upstream, an additional patch based on this one will support it in dm-raid.

Journal resilience related remarks:
because stripes are recovered from the journal in case of a crash, the
journal device better be resilient. Resilience becomes mandatory with
future writeback support, because loosing the working set in the log
means data loss as oposed to writethrough, were the loss of the
journal device 'only' reintroduces the write hole.

Fix comment on data offsets in parse_dev_params() and initialize
new_data_offset as well.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 50c4feb9 13-Jan-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: be prepared to accept arbitrary '- -' tuples

During raid set resize checks and setting up the recovery offset in case a raid
set grows, calculated rd->md.dev_sectors is compared to rs->dev[0].rdev.sectors.

Device 0 may not be defined in case userspace passes in '- -' for it
(lvm2 doesn't do that so far), thus it's device sectors can't be taken
authoritatively in this comparison and another valid device must be used
to retrieve the device size.

Use mddev->dev_sectors in checking for ongoing recovery for the same reason.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c63ede3b 13-Jan-2017 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix transient device failure processing

This fix addresses the following 3 failure scenarios:

1) If a (transiently) inaccessible metadata device is being passed into the
constructor (e.g. a device tuple '254:4 254:5'), it is processed as if
'- -' was given. This erroneously results in a status table line containing
'- -', which mistakenly differs from what has been passed in. As a result,
userspace libdevmapper puts the device tuple seperate from the RAID device
thus not processing the dependencies properly.

2) False health status char 'A' instead of 'D' is emitted on the status
status info line for the meta/data device tuple in this metadata device
failure case.

3) If the metadata device is accessible when passed into the constructor
but the data device (partially) isn't, that leg may be set faulty by the
raid personality on access to the (partially) unavailable leg. Restore
tried in a second raid device resume on such failed leg (status char 'D')
fails after the (partial) leg returned.

Fixes for aforementioned failure scenarios:

- don't release passed in devices in the constructor thus allowing the
status table line to e.g. contain '254:4 254:5' rather than '- -'

- emit device status char 'D' rather than 'A' for the device tuple
with the failed metadata device on the status info line

- when attempting to restore faulty devices in a second resume, allow the
device hot remove function to succeed by setting the device to not in-sync

In case userspace intentionally passes '- -' into the constructor to avoid that
device tuple (e.g. to split off a raid1 leg temporarily for later re-addition),
the status table line will correctly show '- -' and the status info line will
provide a '-' device health character for the non-defined device tuple.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2953079c 08-Dec-2016 Shaohua Li <shli@fb.com>

md: separate flags for superblock changes

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


# 11e29684 29-Nov-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix discard support regression

Commit ecbfb9f118 ("dm raid: add raid level takeover support") moved the
configure_discard_support() call from raid_ctr() to raid_preresume().

Enabling/disabling discard _must_ happen during table load (through the
.ctr hook). Fix this regression by moving the
configure_discard_support() call back to raid_ctr().

Fixes: ecbfb9f118 ("dm raid: add raid level takeover support")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# affa9d28 24-Nov-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: don't allow "write behind" with raid4/5/6

Remove CTR_FLAG_MAX_WRITE_BEHIND from raid4/5/6's valid ctr flags.

Only the md raid1 personality supports setting a maximum number
of "write behind" write IOs on any legs set to "write mostly".
"write mostly" enhances throughput with slow links/disks.

Technically the "write behind" value is a write intent bitmap
property only being respected by the raid1 personality. It allows a
maximum number of "write behind" writes to any "write mostly" raid1
mirror legs to be delayed and avoids reads from such legs.

No other MD personalities supported via dm-raid make use of "write
behind", thus setting this property is superfluous; it wouldn't cause
harm but it is correct to reject it.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 453c2a89 18-Oct-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: correct error messages on old metadata validation

When target 1.9.1 gets takeover/reshape requests on devices with old superblock
format not supporting such conversions and rejects them in super_init_validation(),
it logs bogus error message (e.g. Reshape when a takeover is requested).

Whilst on it, add messages for disk adding/removing and stripe sectors
reshape requests, use the newer rs_{takeover,reshape}_requested() API,
address a raid10 false positive in checking array positions and
remove rs_set_new() because device members are already set proper.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# b052b07c 17-Oct-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix activation of existing raid4/10 devices

dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the
old superblock format (which does not have takeover/reshaping support
that was added via commit 33e53f06850f).

Fix validation path for old superblocks by reverting to the old raid4
layout and basing checks on mddev->new_{level,layout,...} members in
super_init_validation().

Cc: stable@vger.kernel.org # 4.8
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 5c33677c 11-Oct-2016 Andy Whitcroft <apw@canonical.com>

dm raid: fix compat_features validation

In ecbfb9f118bce4 ("dm raid: add raid level takeover support") a new
compatible feature flag was added. Validation for these compat_features
was added but this only passes for new raid mappings with this feature
flag. This causes previously created raid mappings to be failed at
import.

Check compat_features for the only valid combination.

Fixes: ecbfb9f118bce4 ("dm raid: add raid level takeover support")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 9e7d9367 17-Aug-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: support raid0 with missing metadata devices

The raid0 MD personality does not start a raid0 array with any of its
data devices missing.

dm-raid was removing data/metadata device pairs unconditionally if it
failed to read a superblock off the respective metadata device of such
pair, resulting in failure to start arrays with the raid0 personality.

Avoid removing any data/metadata device pairs in case of raid0
(e.g. lvm2 segment type 'raid0_meta') thus allowing MD to start the
array.

Also, avoid region size validation for raid0.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# a3c06a38 09-Aug-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: enhance attempt_restore_of_faulty_devices() to support more devices

attempt_restore_of_faulty_devices() is limited to 64 when it should support
the new maximum of 253 when identifying any failed devices. It clears any
revivable devices via an MD personality hot remove and add cylce to allow
for their recovery.

Address by using existing functions to retrieve and update all failed
devices' bitfield members in the dm raid superblocks on all RAID devices
and check for any devices to clear in it.

Whilst on it, don't call attempt_restore_of_faulty_devices() for any MD
personality not providing disk hot add/remove methods (i.e. raid0 now),
because such personalities don't support reviving of failed disks.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 31e10a41 09-Aug-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix restoring of failed devices regression

'lvchange --refresh RaidLV' causes a mapped device suspend/resume cycle
aiming at device restore and resync after transient device failures. This
failed because flag RT_FLAG_RS_RESUMED was always cleared in the suspend path,
thus the device restore wasn't performed in the resume path.

Solve by removing RT_FLAG_RS_RESUMED from the suspend path and resume
unconditionally. Also, remove superfluous comment from raid_resume().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# a4423287 09-Aug-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix frozen recovery regression

On LVM2 conversions via lvconvert(8), the target keeps mapped devices in
frozen state when requesting RAID devices be resynchronized. This
applies to e.g. adding legs to a raid1 device or taking over from raid0
to raid4 when the rebuild flag's set on the new raid1 legs or the added
dedicated parity stripe.

Also, fix frozen recovery for reshaping as well.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2a034ec1 03-Aug-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix use of wrong status char during resynchronization

During a resynchronization, device status char 'a' is output on the raid
status line for every device of a RAID set. It changes from 'a' to 'A'
(unless device failure) when the resynchronization completes.

Interrupting and restarting a resynchronization, by reloading the DM
table, erroneously lead to status char 'A'.

Fix this by avoiding setting the MD_RECOVERY_REQUESTED flag in
raid_preresume().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# b2a4872a 03-Aug-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: constructor fails on non-zero incompat_features

When lvm2 userspace requests a RaidLV repair, it sets the rebuild
constructor flag on the new replacement DataLVs but does not clear the
respective MetaLVs. Hence the superblock that is loaded from such new
MetaLVs may have a non-zero incompat_features member and the constructor
will fail with false-positive on incompat_features.

Solve by initializing the incompat_features member properly.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# f15f64d6 27-Jul-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix processing of max_recovery_rate constructor flag

__CTR_FLAG_MIN_RECOVERY_RATE was used instead of __CTR_FLAG_MAX_RECOVERY_RATE
thus causing max_recovery_rate to be rejected in case min_recovery_rate
was already set.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 89d3d9a1 19-Jul-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix random optimal_io_size for raid0

raid_io_hints() was retrieving the number of data stripes used for the
calculation of io_opt from struct r5conf, which is not defined for raid0
mappings.

Base the calculation on the in-core raid_set structure instead.

Also, adjust to use to_bytes() for the sector -> bytes conversion
throughout.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 094f394d 19-Jul-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: address checkpatch.pl complaints

Use 'unsigned int' where appropriate.
Return negative errors.
Correct an indentation.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d7ccc2e2 06-Jul-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: change logical functions to actually return bool

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 32682409 30-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: use rdev_for_each in status

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ffeeac75 30-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: use rs->raid_disks to avoid memory leaks on free

Also makes code more consistent throughout.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 7a7c330f 30-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: support delta_disks for raid1, fix table output

Add "delta_disks" constructor argument support to raid1 to allow for
consistent userspace disk addition/removal handling.

Fix raid_status() to report all raid disks with status and table output
on disk adding reshapes, not just the ones listed on the mddev; optimize
its rebuild and writemostly output.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 469b304b 29-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: enhance reshape check and factor out reshape setup

Enhance rs_reshape_requested() check function to be more transparent and
fix its raid10 check.

Streamline the constructor by factoring out reshaping preparation into
fucntion rs_prepare_reshape().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2a5556c2 27-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: allow resize during recovery

Resizing a RAID set during recovery can be allowed, because the MD
resynchronization thread will either stop any ongoing recovery in case
of shrinking below the current recovery position or carry on recovery
to the new size if the set is growing.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 345a6cdc 24-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix rs_is_recovering() to allow for lvextend

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 37f10be1 24-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix rebuild and catch bogus sync/resync flags

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# b1956dc4 24-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix ctr memory leaks on error paths

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 65359ee6 24-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix typo in write_mostly flag

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4348309a 23-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: also reject size change during recovery

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# f6895fd5 23-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix new superblock/bitmap creation on disk addition

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2527b56e 23-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add comments and fix typos

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# fbe6365b 23-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix raid10 device size error on out-of-place reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2d92a3c2 23-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: prohibit 'nosync' on new raid6 and reject resize during reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4dff2f1e 23-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: clarify and fix recovery

Add function rs_setup_recovery() to allow for defined setup of RAID set
recovery in the constructor.

Will be called with dev_sectors={0, rdev->sectors, MaxSectors} to
recover a new or enforced sync, grown or not to be synhronized RAID set
respectively.

Prevents recovery on raid0, which doesn't support it.

Enforces recovery on raid6 to ensure properly defined Syndromes
mandatory for that MD personality are being created.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 0095dbc9 23-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix rs_set_capacity on growing reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 9d9d939c 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: make rs_set_capacity to work on shrinking reshape

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6ee0bae9 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: enhance comments in takeover checks

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ae3c6cff 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: remove bogus comment and fix comment typos

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 75dd3b9e 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: more restricting data_offset value checks

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 5fa146b2 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: reject too many write_mostly devices

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 0a7b8188 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: the sync_page_io() metadata_op argument is bool

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 0d851d14 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: prohibit to pass in both sync and nosync ctr flags

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ff4a88bf 15-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: avoid superfluous memory barriers on static metadata

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 68c1c4d5 16-Jun-2016 Arnd Bergmann <arnd@arndb.de>

dm raid: don't use 'const' in function return

A newly introduced function has 'const int' as the return type,
but as "make W=1" reports, that has no meaning:

drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]

This changes the return type to plain 'int'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 33e53f06850f ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6e20902e 14-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix failed takeover/reshapes by keeping raid set frozen

Superblock updates where bogus causing some takovers/reshapes to fail.

Introduce new runtime flag (RT_FLAG_KEEP_RS_FROZEN) to keep a raid set
frozen when a layout change was requested. Userpace will immediately
reload the table w/o the flags requesting such change once they made it
to the superblocks and any change of recovery/reshape offsets has to be
avoided until after read.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4257e085 13-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: support to change bitmap region size

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 9dbd1aa3 13-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add reshaping support to the target

Add bool functions rs_is_recovering and rs_is_reshaping()
to test for ongoing recovery/reshaping respectively in order
to reject respective requests on ongoing ones.

Remove ctr array size check, because ti->len and array
sectors will differ during disk addition/removal reshape.

Use __is_raid10_near() rather than type string compare.

Introduce rs_check_reshape() and rs_start_reshape(),
use the former in the ctr to reject bogus rehsape requests
and the latter in preresume to actually start a reshape.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 40ba37e5 13-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add prerequisite functions and definitions for reshaping

Add rs_is_reshapable(), rs_data_stripes(), rs_reshape_requested(),
rs_set_dev_and_array_sectors() and rs_adjust_data_offsets()

Remove superfluous check for reshape message

Correct runtime bit definitions to be incremental

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# a30cbc0d 09-Jun-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: inverse check for flags from invalid to valid flags

It is more intuitive to manage each raid level's features in terms of
what is supported rather than what isn't supported.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# e6ca5e1a 02-Jun-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: various code cleanups

Renamed functions and variables with leading single underscore to have a
double underscore. Renamed some functions to have better names. Folded
functions that were split out without reason.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# bfcee0e3 02-Jun-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: rename functions that alloc and free struct raid_set

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4286325b 01-Jun-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: remove all the bitops wrappers

Removes obfuscation that is of little value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# bb91a63f 01-Jun-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: rename _in_range to __within_range

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ef9b85a6 01-Jun-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: add missing "dm-raid0" module alias

Also update module description to "raid0/1/10/4/5/6 target"

Reported by Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 3fa6cf38 02-Jun-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: rename _argname_by_flag to dm_raid_arg_name_by_flag

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 9b6e5423 02-Jun-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: bump to v1.9.0 and make the extended SB feature flag reflect it

No idea what Heinz was doing with the versioning but upstream commit
4c9971ca6a ("dm raid: make sure no feature flags are set in metadata")
bumped to 1.8.0 already.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# bd83a4c4 31-May-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: remove ti_error_* wrappers

There ti_error_* wrappers added very little. No other DM target has
ever gone to such lengths to wrap setting ti->error.

Also fixes some NULL derefences via rs->ti->error.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 43157840 30-May-2016 Mike Snitzer <snitzer@redhat.com>

dm raid: tabify appropriate whitespace

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 3a1c1ef2 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: enhance status interface and fixup takeover/raid0

The target's status interface has to provide the new 'data_offset' value
to allow userspace to retrieve the kernels offset to the data on each
raid device of a raid set. This is the base for out-of-place reshaping
required to not write over any data during reshaping (e.g. change
raid6_zr -> raid6_nc):

- add rs_set_cur() to be able to start up existing array in case of no
takeover; use in ctr on takeover check

- enhance raid_status()

- add supporting functions to get resync/reshape progress and raid
device status chars

- fixup rebuild table line output race, which does miss to emit
'rebuild N' on fully synced/rebuild devices, because it is relying on
the transient 'In_sync' raid device flag

- add new status line output for 'data_offset', which'll later be used
for out-of-place reshaping

- fixup takeover not working for all levels

- fixup raid0 message interface oops caused by missing checks
for the md threads, which don't exist in case of raid0

- remove ALL_FREEZE_FLAGS not needed for takeover

- adjust comments

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ecbfb9f1 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add raid level takeover support

Add raid level takeover support allowing arbitrary takeovers between
raid levels supported by md personalities (i.e. raid0, raid1/10 and
raid4/5/6):

- add rs_config_{backup|restore} function to allow for temporary
storing ctr requested layout changes and restore them for takeover
conersion decision after the superblocks got loaded and analyzed

- add members to store layout to 'struct raid_set' (not mandatory
for takeover but needed for reshape in later patch)

- add rebuild_disks bitfield to 'struct raid_set' and set bits in ctr
to use in setting up takeover (base to address a 'rebuild' related
raid_status() table line bug and needed as well for reshape in future
patch)

- add runtime flags and respective manipulation functions to be able to
control e.g. wrting of superlocks to the preresume function on
takeover and (later) reshape

- add functions to detect takeover, check it's valid (mandatory here to
avoid failing on md_run()), setup for it and use in the ctr; those
will be likely moved out once reshaping gets added to simplify the
ctr

- start raid set readonly in ctr and switch to readwrite, optionally
updating superblocks, in preresume in order to allow suspend to
quiesce any active table before (which involves superblock updates);
this ensures the proper sequence of writing the current and any new
takeover(/reshape) metadata

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 7b34df74 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: enhance super_sync() to support new superblock members

Add transferring the new takeover/reshape related superblock
members introduced to the super_sync() function:

- add/move supporting functions

- add failed devices bitfield transfer functions to retrieve the
bitfield from superblock format or update it in the superblock

- add code to transfer all new members

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4763e543 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add new reshaping/raid10 format table line options to parameter parser

Support the follwoing arguments in the ctr parameter parser:

- add 'delta_disks', 'data_offset' taking int and sector respectively

- 'raid10_use_near_sets' bool argument to optionally select
near sets with supporting raid10 mappings

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 33e53f06 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: introduce extended superblock and new raid types to support takeover/reshaping

Add new members to the dm-raid superblock and new raid types to support
takeover/reshape.

Add all necessary members needed to support takeover and reshape in one
go -- aiming to limit the amount of changes to the superblock layout.

This is a larger patch due to the new superblock members, their related
flags, validation of both and involved API additions/changes:

- add additional members to keep track of:
- state about forward/backward reshaping
- reshape position
- new level, layout, stripe size and delta disks
- data offset to current and new data for out-of-place reshapes
- failed devices bitfield extensions to keep track of max raid devices

- adjust super_validate() to cope with new superblock members

- adjust super_init_validation() to cope with new superblock members

- add definitions for ctr flags supporting delta disks etc.

- add new raid types (raid6_n_6 etc.)

- add new raid10 supporting function API (_is_raid10_*())

- adjust to changed raid10 supporting function API

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 676fa5ad 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: use rt_is_raid*() in all appropriate checks

Make use if raid type rt_is_*() bool functions for simplification and
consistency reasons.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ad51d7f1 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: more use of flag testing wrappers

- add _test_flags() function

- use it to simplify rs_check_for_invalid_flags()

- use _test_flag() throughout

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# f090279e 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: check constructor arguments for invalid raid level/argument combinations

Reject invalid flag combinations to avoid potential data corruption or
failing raid set construction:

- add definitions for constructor flag combinations and invalid flags
per level

- add bool test functions for the various raid types
(also will be used by future reshaping enhancements)

- introduce rs_check_for_invalid_flags() and _invalid_flags()
to perform the validity checks

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 702108d1 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: cleanup / provide infrastructure

Provide necessary infrastructure to handle ctr flags and their names
and cleanup setting ti->error:

- comment constructor flags

- introduce constructor flag manipulation

- introduce ti_error_*() functions to simplify
setting the error message (use in other targets?)

- introduce array to hold ctr flag <-> flag name mapping

- introduce argument name by flag functions for that array

- use those functions throughout the ctr call path

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 92c83d79 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: use dm_arg_set API in constructor

- use dm_arg_set API in ctr and its callees parse_raid_params() and dev_parms()

- introduce _in_range() function to check a value is in a [ min, max ] range;
this is to support more callers in parsing parameters etc. in the future

- correct comment on MAX_RAID_DEVICES

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 73c6f239 19-May-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: rename variable 'ret' to 'r' to conform to other dm code

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 796a5cf0 05-Jun-2016 Mike Christie <mchristi@redhat.com>

md: use bio op accessors

Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4c9971ca 29-Apr-2016 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: make sure no feature flags are set in metadata

Given we don't yet support any feature flags in the dm-raid ondisk
metadata (see: 'features' member of 'struct dm_raid_superblock'),
add a check to ensure no flags are actually set, if any features are
set reject the activation of the RAID mapping.

This is to prevent possible data corruption in case of a kernel
downgrade when there'll potentially be feature flags set by a future
dm-raid target.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 042745ee 02-Oct-2015 Mikulas Patocka <mpatocka@redhat.com>

dm raid: fix round up of default region size

Commit 3a0f9aaee028 ("dm raid: round region_size to power of two")
intended to make sure that the default region size is a power of two.
However, the logic in that commit is incorrect and sets the variable
region_size to 0 or 1, depending on whether min_region_size is a power
of two.

Fix this logic, using roundup_pow_of_two(), so that region_size is
properly rounded up to the next power of two.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3a0f9aaee028 ("dm raid: round region_size to power of two")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 8ae12666 28-Apr-2015 Kent Overstreet <kent.overstreet@gmail.com>

block: kill merge_bvec_fn() completely

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0cf45031 29-Apr-2015 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add support for the MD RAID0 personality

Add dm-raid access to the MD RAID0 personality to enable single zone
striping.

The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
to allow the raid0 personality to calculate the MD array
size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
(wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
so that userspace checks for 100% sync will succeed and allow
for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
that resulted from bio payloads that grew too large. This problem
did not occur with the other raid levels because it either did not
apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c76d53f4 29-Apr-2015 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: a few cleanups

- ensure maximum device limit in superblock
- rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags)
and their respective struct raid_set member
- use strcasecmp() in raid10_format_to_md_layout() as in the constructor

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 0f4106b3 29-Apr-2015 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fixup documentation for discard support

Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.

Also, backfill dm-raid target version 1.6.0 documentation.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 3ca5a21a 29-May-2014 Dan Carpenter <dan.carpenter@oracle.com>

dm raid: fix a couple integer overflows

My static checker complains that if "num_raid_params" is UINT_MAX then
the "if (num_raid_params + 1 > argc) {" check doesn't work as intended.

The other change is that I moved the "if (argc != (num_raid_devs * 2))"
condition forward a few lines so it was before the call to
context_alloc(). If we had an integer overflow inside that function
then it would lead to an immediate crash.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 5c675f83 14-Dec-2014 NeilBrown <neilb@suse.de>

md: make ->congested robust against personality changes.

There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.

Signed-off-by: NeilBrown <neilb@suse.de>


# d20c4b08 29-Oct-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: fix inaccessible superblocks causing oops in configure_discard_support

Commit 48cf06bc5f ("dm raid: add discard support for RAID levels 4, 5
and 6") did not properly handle missing metadata device(s). A failing
read of the superblock causes the metadata and data devices to be
removed from the dev array in struct raid_set, setting references to
both devices to NULL. configure_discard_support() nonetheless tries to
access the data dev unconditionally causing an oops.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 40d43c4b 17-Oct-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: ensure superblock's size matches device's logical block size

The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.

Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.

Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work. Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.

[includes pointer math fix from Dan Carpenter]
Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 48cf06bc 24-Sep-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add discard support for RAID levels 4, 5 and 6

In case of RAID levels 4, 5 and 6 we have to verify each RAID members'
ability to zero data on discards to avoid stripe data corruption -- if
discard_zeroes_data is not set for each RAID member discard support must
be disabled. But given the uncertainty of whether or not a RAID member
properly supports zeroing data on discard we require the user to
explicitly allow discard support on RAID levels 4, 5, and 6 by setting
a dm-raid module paramter, e.g.: dm-raid.devices_handle_discard_safely=Y
Otherwise, discards could cause data corruption on RAID4/5/6.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 75b8e04b 24-Sep-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm raid: add discard support for RAID levels 1 and 10

Discard support is not enabled for RAID levels 4, 5, and 6 at this time
due to concerns about unreliable discard_zeroes_data support on some
hardware. Otherwise, discards could cause stripe data corruption
(classic example of bad apples spoiling the bunch).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c4a39551 25-Jun-2013 Jonathan Brassow <jbrassow@redhat.com>

MD: Remember the last sync operation that was performed

MD: Remember the last sync operation that was performed

This patch adds a field to the mddev structure to track the last
sync operation that was performed. This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs. If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check. If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired. etc.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# b29bebd6 01-Jun-2013 Jingoo Han <jg1.han@samsung.com>

md: replace strict_strto*() with kstrto*()

The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 3f6bbd3f 08-May-2013 NeilBrown <neilb@suse.de>

dm-raid: silence compiler warning on rebuilds_per_group.

This doesn't really need to be initialised, but it doesn't hurt,
silences the compiler, and as it is a counter it makes sense for it to
start at zero.

Signed-off-by: NeilBrown <neilb@suse.de>


# a4dc163a 08-May-2013 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Fix raid_resume not reviving failed devices in all cases

DM RAID: Fix raid_resume not reviving failed devices in all cases

When a device fails in a RAID array, it is marked as Faulty. Later,
md_check_recovery is called which (through the call chain) calls
'hot_remove_disk' in order to have the personalities remove the device
from use in the array.

Sometimes, it is possible for the array to be suspended before the
personalities get their chance to perform 'hot_remove_disk'. This is
normally not an issue. If the array is deactivated, then the failed
device will be noticed when the array is reinstantiated. If the
array is resumed and the disk is still missing, md_check_recovery will
be called upon resume and 'hot_remove_disk' will be called at that
time. However, (for dm-raid) if the device has been restored,
a resume on the array would cause it to attempt to revive the device
by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called,
a situation is then created where the device is thought to concurrently
be the replacement and the device to be replaced. Thus, the device
is first sync'ed with the rest of the array (because it is the replacement
device) and then marked Faulty and removed from the array (because
it is also the device being replaced).

The solution is to check and see if the device had properly been removed
before the array was suspended. This is done by seeing whether the
device's 'raid_disk' field is -1 - a condition that implies that
'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1)
-> hot_remove_disk' has been called. If 'raid_disk' is not -1, then
'hot_remove_disk' must be called to complete the removal of the previously
faulty device before it can be revived via 'hot_add_disk'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# f381e71b 08-May-2013 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Break-up untidy function

DM RAID: Break-up untidy function

Clean-up excessive indentation by moving some code in raid_resume()
into its own function.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 9092c02d 02-May-2013 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Add ability to restore transiently failed devices on resume

DM RAID: Add ability to restore transiently failed devices on resume

This patch adds code to the resume function to check over the devices
in the RAID array. If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array. This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# be83651f 23-Apr-2013 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Add message/status support for changing sync action

DM RAID: Add message/status support for changing sync action

This patch adds a message interface to dm-raid to allow the user to more
finely control the sync actions being performed by the MD driver. This
gives the user the ability to initiate "check" and "repair" (i.e. scrubbing).
Two additional fields have been appended to the status output to provide more
information about the type of sync action occurring and the results of those
actions, specifically: <sync_action> and <mismatch_cnt>. These new fields
will always be populated. This is essentially the device-mapper way of doing
what MD controls through the 'sync_action' sysfs file and shows through the
'mismatch_cnt' sysfs file.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 55a62eef 01-Mar-2013 Alasdair G Kergon <agk@redhat.com>

dm: rename request variables to bios

Use 'bio' in the name of variables and functions that deal with
bios rather than 'request' to avoid confusion with the normal
block layer use of 'request'.

No functional changes.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# fd7c092e 01-Mar-2013 Mikulas Patocka <mpatocka@redhat.com>

dm: fix truncated status strings

Avoid returning a truncated table or status string instead of setting
the DM_BUFFER_FULL_FLAG when the last target of a table fills the
buffer.

When processing a table or status request, the function retrieve_status
calls ti->type->status. If ti->type->status returns non-zero,
retrieve_status assumes that the buffer overflowed and sets
DM_BUFFER_FULL_FLAG.

However, targets don't return non-zero values from their status method
on overflow. Most targets returns always zero.

If a buffer overflow happens in a target that is not the last in the
table, it gets noticed during the next iteration of the loop in
retrieve_status; but if a buffer overflow happens in the last target, it
goes unnoticed and erroneously truncated data is returned.

In the current code, the targets behave in the following way:
* dm-crypt returns -ENOMEM if there is not enough space to store the
key, but it returns 0 on all other overflows.
* dm-thin returns errors from the status method if a disk error happened.
This is incorrect because retrieve_status doesn't check the error
code, it assumes that all non-zero values mean buffer overflow.
* all the other targets always return 0.

This patch changes the ti->type->status function to return void (because
most targets don't use the return code). Overflow is detected in
retrieve_status: if the status method fills up the remaining space
completely, it is assumed that buffer overflow happened.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# fe5d2f4a 20-Feb-2013 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms

DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms

Until now, dm-raid.c only supported the "near" algorthm of MD's RAID10
implementation. This patch adds support for the "far" and "offset"
algorithms, but only with the improved redundancy that is brought with
the introduction of the 'use_far_sets' bit, which shifts copied stripes
according to smaller sets vs the entire array. That is, the 17th bit
of the 'layout' variable that defines the RAID10 implementation will
always be set. (More information on how the 'layout' variable selects
the RAID10 algorithm can be found in the opening comments of
drivers/md/raid10.c.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 55ebbb59 22-Jan-2013 Jonathan Brassow <jbrassow@redhat.com>

DM-RAID: Fix RAID10's check for sufficient redundancy

Before attempting to activate a RAID array, it is checked for sufficient
redundancy. That is, we make sure that there are not too many failed
devices - or devices specified for rebuild - to undermine our ability to
activate the array. The current code performs this check twice - once to
ensure there were not too many devices specified for rebuild by the user
('validate_rebuild_devices') and again after possibly experiencing a failure
to read the superblock ('analyse_superblocks'). Neither of these checks are
sufficient. The first check is done properly but with insufficient
information about the possible failure state of the devices to make a good
determination if the array can be activated. The second check is simply
done wrong in the case of RAID10 because it doesn't account for the
independence of the stripes (i.e. mirror sets). The solution is to use the
properly written check ('validate_rebuild_devices'), but perform the check
after the superblocks have been read and we know which devices have failed.
This gives us one check instead of two and performs it in a location where
it can be done right.

Only RAID10 was affected and it was affected in the following ways:
- the code did not properly catch the condition where a user specified
a device for rebuild that already had a failed device in the same mirror
set. (This condition would, however, be caught at a deeper level in MD.)
- the code triggers a false positive and denies activation when devices in
independent mirror sets have failed - counting the failures as though they
were all in the same set.

The most likely place this error was introduced (or this patch should have
been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
Consequently this fix should also go in v3.7.y, however there is a
small conflict on the .version in raid_target, so I'll submit a
separate patch to -stable.

Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 7de3ee57 21-Dec-2012 Mikulas Patocka <mpatocka@redhat.com>

dm: remove map_info

This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 3a0f9aae 21-Dec-2012 Jonathan Brassow <jbrassow@redhat.com>

dm raid: round region_size to power of two

If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically. If this is not a power of 2,
the md code will report an error later.

This patch catches the problem early and rounds the region_size to the
next power of two.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 761becff 10-Oct-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Fix for "sync" directive ineffectiveness

There are two table arguments that can be given to a DM RAID target
that control whether the array is forced to (re)synchronize or skip
initialization: "sync" and "nosync". When "sync" is given, we set
mddev->recovery_cp to 0 in order to cause the device to resynchronize.
This is insufficient if there is a bitmap in use, because the array
will simply look at the bitmap and see that there is no recovery
necessary.

The fix is to skip over the loading of the superblocks when "sync" is
given, causing new superblocks to be written that will force the array
to go through initialization (i.e. synchronization).

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 7386199c 10-Oct-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Fix comparison of index and quantity for "rebuild" parameter

DM RAID: Fix comparison of index and quantity for "rebuild" parameter

The "rebuild" parameter takes an index argument that starts counting from
zero. The conditional used to validate the index was using '>' rather than
'>=', leaving the door open for an index value that would be 1 too large.

Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 4ec1e369 10-Oct-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Add rebuild capability for RAID10

DM RAID: Add code to validate replacement slots for RAID10 arrays

RAID10 can handle 'copies - 1' failures for each mirror group. This code
ensures the user has provided a valid array - one whose devices specified for
rebuild do not exceed the amount of redundancy available.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# eb649123 10-Oct-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Move 'rebuild' checking code to its own function

DM RAID: Move chunk of code to it's own function

The code that checks whether device replacements/rebuilds are possible given
a specific RAID type is moved to it's own function. It will further expand
when the code to check RAID10 is added. A separate function makes it easier
to read.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 63f33b8d 31-Jul-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Add support for MD RAID10

Support the MD RAID10 personality through dm-raid.c

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 1f4e0ff0 27-Jul-2012 Alasdair G Kergon <agk@redhat.com>

dm thin: commit before gathering status

Commit outstanding metadata before returning the status for a dm thin
pool so that the numbers reported are as up-to-date as possible.

The commit is not performed if the device is suspended or if
the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target
through a new 'status_flags' parameter in the target's dm_status_fn.

The userspace dmsetup tool will support the --noflush flag with the
'dmsetup status' and 'dmsetup wait' commands from version 1.02.76
onwards.

Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# c039c332 27-Jul-2012 Jonathan E Brassow <jbrassow@redhat.com>

dm raid: move sectors_per_dev calculation

In preparation for RAID10 inclusion in dm-raid, we move the sectors_per_dev
calculation later in the device creation process. This is because we won't
know up-front how many stripes vs how many mirrors there are which will
change the calculation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# f999e8fe 27-Jul-2012 Jonathan E Brassow <jbrassow@redhat.com>

dm raid: restructure parse_raid_params

In preparation for RAID10 addition to dm-raid, we change an 'if' conditional
to a 'switch' conditional to make it easier to see what is being checked for
each RAID type.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 542f9038 27-Jul-2012 Mike Snitzer <snitzer@redhat.com>

dm: support non power of two target max_io_len

Remove the restriction that limits a target's specified maximum incoming
I/O size to be a power of 2.

Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'.
Change it from sector_t to uint32_t, which is plenty big enough, and
introduce a wrapper function dm_set_target_max_io_len() to set it.
Use sector_div() to process it now that it is not necessarily a power of 2.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# c32fb9e7 21-May-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Use md_error() in place of simply setting Faulty bit

When encountering an error while reading the superblock, call md_error.

We are currently setting the 'Faulty' bit on one of the array devices when an
error is encountered while reading the superblock of a dm-raid array. We should
be calling md_error(), as it handles the error more completely.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 81f382f9 21-May-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Record and handle missing devices

Missing dm-raid devices should be recorded in the superblock

When specifying the devices that compose a DM RAID array, it is possible to denote
failed or missing devices with '-'s. When this occurs, we must record this in the
superblock. We do this by checking if the array position's data device is missing
and then forcing MD to record the superblock by setting 'MD_CHANGE_DEVS' in
'raid_resume'. If we do not cause the superblock to be rewritten by the resume
function, it is possible for a stale superblock to be written by an out-going
in-active table (during 'raid_dtr').

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 47525e59 21-May-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Set recovery flags on resume

Properly initialize MD recovery flags when resuming device-mapper devices.

When a device-mapper device is suspended, all I/O must stop. This is done by
calling 'md_stop_writes' and 'mddev_suspend'. These calls in-turn manipulate
the recovery flags - including setting 'MD_RECOVERY_FROZEN'. The DM device
may have been suspended while recovery was not yet complete, so the process
needs to pick-up where it left off. Since 'mddev_resume' does not unset
'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
(e.g. It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
setting 'MD_RECOVERY_FROZEN'. Clearing it in 'mddev_resume' would override the
desired behavior.)

Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.

Also clean up where level_store calls mddev_resume() - it current
duplicates some of the funcitons of that call. - NB

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 545c8795 21-May-2012 NeilBrown <neilb@suse.de>

md: dm-raid should call helper function to clear rdev.

dm-raid currently open-codes the freeing of some members of
and rdev. It is more maintainable to have it call common code
from md.c which does this for all call-sites.

So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c

Signed-off-by: NeilBrown <neilb@suse.de>


# a9ad8526 23-Apr-2012 Jonathan Brassow <jbrassow@redhat.com>

DM RAID: Use safe version of rdev_for_each

Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe

Commit dafb20fa34320a472deb7442f25a0c086e0feb33 mistakenly replaced a safe
iterator with an unsafe one when making some macro changes.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 0447568f 28-Mar-2012 Jonathan E Brassow <jbrassow@redhat.com>

dm raid: handle failed devices during start up

The dm-raid code currently fails to create a RAID array if any of the
superblocks cannot be read. This was an oversight as there is already
code to handle this case if the values ('- -') were provided for the
failed array position.

With this patch, if a superblock cannot be read, the array position's
fields are initialized as though '- -' was set in the table. That is,
the device is failed and the position should not be used, but if there
is sufficient redundancy, the array should still be activated.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# dafb20fa 18-Mar-2012 NeilBrown <neilb@suse.de>

md: tidy up rdev_for_each usage.

md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev. However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
- rename rdev_for_each to rdev_for_each_safe
- create a new rdev_for_each which uses the plain
list_for_each_entry,
- use the 'safe' version only where needed, and convert all other
list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>


# 0ca93de9 07-Mar-2012 Jonathan E Brassow <jbrassow@redhat.com>

dm raid: fix flush support

Fix dm-raid flush support.

Both md and dm have support for flush, but the dm-raid target
forgot to set the flag to indicate that flushes should be
passed on. (Important for data integrity e.g. with writeback cache
enabled.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 3aa3b2b2 07-Mar-2012 Jonathan E Brassow <jbrassow@redhat.com>

dm raid: set MD_CHANGE_DEVS when rebuilding

The 'rebuild' parameter is used to rebuild individual devices in an
array (e.g. resynchronize a RAID1 device or recalculate a parity device
in higher RAID). The MD_CHANGE_DEVS flag must be set when this
parameter is given in order to write out the superblocks and make the
change take immediate effect. The code that handles new devices in
super_load already sets MD_CHANGE_DEVS and 'FirstUse'. (The 'FirstUse'
flag was being set as a special case for rebuilds in
super_init_validation.)

Add a condition for rebuilds in super_load to take care of both flags
without the special case in 'super_init_validation'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 34f8ac6d 27-Jan-2012 Jonathan Brassow <jbrassow@redhat.com>

Prevent DM RAID from loading bitmap twice.

The life cycle of a device-mapper target is:
1) create
2) resume
3) suspend
*) possibly repeat from 2
4) destroy

The dm-raid target is unconditionally calling MD's bitmap_load function upon
every resume. If steps 2 & 3 above are repeated, bitmap_load is called
multiple times. It is only written to be called once; otherwise, it allocates
new memory for the bitmap (without freeing the old) and incrementing the number
of pages it thinks it has without zeroing first. This ultimately leads to
access beyond allocated memory and lost memory.

Simply avoiding the bitmap_load call upon resume is not sufficient. If the
target was suspended while the initial recovery was only partially complete,
it needs to be restarted when the target is resumed. This is why
'md_wakeup_thread' is called before issuing the 'mddev_resume'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>


# 056075c7 03-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

md: Add module.h to all files using it implicitly

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 2e727c3c 31-Oct-2011 Jonathan E Brassow <jbrassow@redhat.com>

dm: raid fix device status indicator when array initializing

When devices in a RAID array are not in-sync, they are supposed to be
reported as such in the status output as an 'a' character, which means
"alive, but not in-sync". But when the entire array is rebuilt 'A' is
being used, which is incorrect. This patch corrects this to 'a'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# d1688a6d 10-Oct-2011 NeilBrown <neilb@suse.de>

md/raid5: typedef removal: raid5_conf_t -> struct r5conf

Signed-off-by: NeilBrown <neilb@suse.de>


# fd01b88c 10-Oct-2011 NeilBrown <neilb@suse.de>

md: remove typedefs: mddev_t -> struct mddev

Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown <neilb@suse.de>


# 3cb03002 10-Oct-2011 NeilBrown <neilb@suse.de>

md: removing typedefs: mdk_rdev_t -> struct md_rdev

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown <neilb@suse.de>


# 82324809 25-Sep-2011 Jonthan Brassow <jbrassow@redhat.com>

dm: raid fix write_mostly arg validation

Fix off-by-one error in validation of write_mostly.

The user-supplied value given for the 'write_mostly' argument must be an
index starting at 0. The validation of the supplied argument failed to
check for 'N' ('>' vs '>='), which would have caused an access beyond the
end of the array.

Reported-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 32737279 01-Aug-2011 Jonathan Brassow <jbrassow@redhat.com>

dm raid: add md raid1 support

Support the MD RAID1 personality through dm-raid.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# b12d437b 01-Aug-2011 Jonathan Brassow <jbrassow@redhat.com>

dm raid: support metadata devices

Add the ability to parse and use metadata devices to dm-raid. Although
not strictly required, without the metadata devices, many features of
RAID are unavailable. They are used to store a superblock and bitmap.

The role, or position in the array, of each device must be recorded in
its superblock. This is to help with fault handling, array reshaping,
and sanity checks. RAID 4/5/6 devices must be loaded in a specific order:
in this way, the 'array_position' field helps validate the correctness
of the mapping when it is loaded. It can be used during reshaping to
identify which devices are added/removed. Fault handling is impossible
without this field. For example, when a device fails it is recorded in
the superblock. If this is a RAID1 device and the offending device is
removed from the array, there must be a way during subsequent array
assembly to determine that the failed device was the one removed. This
is done by correlating the 'array_position' field and the bit-field
variable 'failed_devices'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 46bed2b5 01-Aug-2011 Jonathan Brassow <jbrassow@redhat.com>

dm raid: add write_mostly parameter

Add the write_mostly parameter to RAID1 dm-raid tables.

This allows the user to set the WriteMostly flag on a RAID1 device that
should normally be avoided for read I/O.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# c1084561 01-Aug-2011 Jonathan Brassow <jbrassow@redhat.com>

dm raid: add region_size parameter

Allow the user to specify the region_size.

Ensures that the supplied value meets md's constraints, viz. the number of
regions does not exceed 2^21.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 3e8dbb7f 01-Aug-2011 Alasdair G Kergon <agk@redhat.com>

dm raid: tidy includes

A dm target only needs to use include/linux dm headers.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 13c87583 01-Aug-2011 Jonathan Brassow <jbrassow@redhat.com>

dm raid: cleanup parameter handling

Re-order the parameters so they are handled consistently in the same order
where defined, parsed and output.

Only include rebuild parameters in the STATUSTYPE_TABLE output if they were
supplied in the original table line.

Correct the parameter count when outputting rebuild: there are two words,
not one.

Use case-independent checks for keywords (as in other device-mapper targets).

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# af1db72d 18-Apr-2011 NeilBrown <neilb@suse.de>

md/dm - remove remains of plug_fn callback.

Now that unplugging is done differently, the unplug_fn callback is
never called, so it can be completely discarded.

Signed-off-by: NeilBrown <neilb@suse.de>


# 7eaceacc 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 9d09e663 13-Jan-2011 NeilBrown <neilb@suse.de>

dm: raid456 basic support

This patch is the skeleton for the DM target that will be
the bridge from DM to MD (initially RAID456 and later RAID1). It
provides a way to use device-mapper interfaces to the MD RAID456
drivers.

As with all device-mapper targets, the nominal public interfaces are the
constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
and STATUSTYPE_TABLE). The CTR table looks like the following:

1: <s> <l> raid \
2: <raid_type> <#raid_params> <raid_params> \
3: <#raid_devs> <meta_dev1> <dev1> .. <meta_devN> <devN>

Line 1 contains the standard first three arguments to any device-mapper
target - the start, length, and target type fields. The target type in
this case is "raid".

Line 2 contains the arguments that define the particular raid
type/personality/level, the required arguments for that raid type, and
any optional arguments. Possible raid types include: raid4, raid5_la,
raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is
planned for the future.) The list of required and optional parameters
is the same for all the current raid types. The required parameters are
positional, while the optional parameters are given as key/value pairs.
The possible parameters are as follows:
<chunk_size> Chunk size in sectors.
[[no]sync] Force/Prevent RAID initialization
[rebuild <idx>] Rebuild the drive indicated by the index
[daemon_sleep <ms>] Time between bitmap daemon work to clear bits
[min_recovery_rate <kB/sec/disk>] Throttle RAID initialization
[max_recovery_rate <kB/sec/disk>] Throttle RAID initialization
[max_write_behind <value>] See '-write-behind=' (man mdadm)
[stripe_cache <sectors>] Stripe cache size for higher RAIDs

Line 3 contains the list of devices that compose the array in
metadata/data device pairs. If the metadata is stored separately, a '-'
is given for the metadata device position. If a drive has failed or is
missing at creation time, a '-' can be given for both the metadata and
data drives for a given position.

Examples:
# RAID4 - 4 data drives, 1 parity
# No metadata devices specified to hold superblock/bitmap info
# Chunk size of 1MiB
# (Lines separated for easy reading)
0 1960893648 raid \
raid4 1 2048 \
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

# RAID4 - 4 data drives, 1 parity (no metadata devices)
# Chunk size of 1MiB, force RAID initialization,
# min recovery rate at 20 kiB/sec/disk
0 1960893648 raid \
raid4 4 2048 min_recovery_rate 20 sync\
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

Performing a 'dmsetup table' should display the CTR table used to
construct the mapping (with possible reordering of optional
parameters).

Performing a 'dmsetup status' will yield information on the state and
health of the array. The output is as follows:
1: <s> <l> raid \
2: <raid_type> <#devices> <1 health char for each dev> <resync_ratio>

Line 1 is standard DM output. Line 2 is best shown by example:
0 1960893648 raid raid4 5 AAAAA 2/490221568
Here we can see the RAID type is raid4, there are 5 devices - all of
which are 'A'live, and the array is 2/490221568 complete with recovery.

Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>