#
1bbe254e |
|
25-Sep-2023 |
Denis Plotnikov <den-plotnikov@yandex-team.ru> |
md-cluster: check for timeout while a new disk adding A new disk adding may end up with timeout and a new disk won't be added. Add returning the error in that case. Found by Linux Verification Center (linuxtesting.org) with SVACE Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230925125940.1542506-1-den-plotnikov@yandex-team.ru
|
#
7eb8ff02 |
|
03-Aug-2023 |
Li Lingfeng <lilingfeng3@huawei.com> |
md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
|
#
44693154 |
|
22-May-2023 |
Yu Kuai <yukuai3@huawei.com> |
md: protect md_thread with rcu Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
|
#
12cda13c |
|
15-Aug-2022 |
Alexander Aring <aahringo@redhat.com> |
fs: dlm: remove DLM_LSFL_FS from uapi The DLM_LSFL_FS flag is set in lockspaces created directly for a kernel user, as opposed to those lockspaces created for user space applications. The user space libdlm allowed this flag to be set for lockspaces created from user space, but then used by a kernel user. No kernel user has ever used this method, so remove the ability to do it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
|
#
9e26728b |
|
01-Jul-2022 |
Zhang Jiaming <jiaming@nfschina.com> |
md: Fix spelling mistake in comments There are 2 spelling mistakes in comments. Fix it. Signed-off-by: Zhang Jiaming <jiaming@nfschina.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
92d9aac9 |
|
31-Mar-2022 |
Heming Zhao <heming.zhao@suse.com> |
md: replace deprecated strlcpy & remove duplicated line This commit includes two topics: 1> replace deprecated strlcpy change strlcpy to strscpy for strlcpy is marked as deprecated in Documentation/process/deprecated.rst 2> remove duplicated strlcpy line in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the history: - commit cf921cc19cf7 ("Add node recovery callbacks") introduced the first usage of strlcpy(). - commit b97e92574c0b ("Use separate bitmaps for each nodes in the cluster") introduced the second strlcpy(). this time, the two strlcpy() are same, we can remove anyone safely. - commit d3b178adb3a3 ("md: Skip cluster setup for dm-raid") added dm-raid special handling. And the "nodes" value is the key of this patch. but from this patch, strlcpy() which was introduced by b97e92574c0bf become necessary. - commit 3c462c880b52 ("md: Increment version for clustered bitmaps") used clustered major version to only handle in clustered env. this patch could look a polishment for clustered code logic. So cf921cc19cf7 became useless after d3b178adb3a3a, we could remove it safely. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
|
#
dd3dc5f4 |
|
25-Dec-2021 |
Randy Dunlap <rdunlap@infradead.org> |
md: fix spelling of "its" Use the possessive "its" instead of the contraction "it's" in printed messages. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>
|
#
bca5b065 |
|
19-Nov-2020 |
Zhao Heming <heming.zhao@suse.com> |
md/cluster: fix deadlock when node is doing resync job md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock. c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. d. B recv METADATA_UPDATED msg, which send from A in step <a>. This will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove". For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock. Repro steps (I only triggered 3 times with hundreds tests): two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` test script will hung when executing "mdadm --remove". ``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` At last, thanks for Xiao's solution. Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming <heming.zhao@suse.com> Suggested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
94d91e7f |
|
16-Nov-2020 |
Christoph Hellwig <hch@lst.de> |
md: remove a spurious call to revalidate_disk_size in update_size None of the ->resize methods updates the disk size, so calling revalidate_disk_size here won't do anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
2c247c51 |
|
16-Nov-2020 |
Christoph Hellwig <hch@lst.de> |
md: use set_capacity_and_notify Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
1383b347 |
|
26-Sep-2020 |
Zhao Heming <heming.zhao@suse.com> |
md/bitmap: fix memory leak of temporary bitmap Callers of get_bitmap_from_slot() are responsible to free the bitmap. Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
659e56ba |
|
01-Sep-2020 |
Christoph Hellwig <hch@lst.de> |
block: add a new revalidate_disk_size helper revalidate_disk is a relative awkward helper for driver use, as it first calls an optional driver method and then updates the block device size, while most callers either don't need the method call at all, or want to keep state between the caller and the called method. Add a revalidate_disk_size helper that just performs the update of the block device size from the gendisk one, and switch all drivers that do not implement ->revalidate_disk to use the new helper instead of revalidate_disk() Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
e8abe1de |
|
04-Aug-2020 |
Dan Carpenter <dan.carpenter@oracle.com> |
md-cluster: Fix potential error pointer dereference in resize_bitmaps() The error handling calls md_bitmap_free(bitmap) which checks for NULL but will Oops if we pass an error pointer. Let's set "bitmap" to NULL on this error path. Fixes: afd756286083 ("md-cluster/raid10: resize all the bitmaps before start reshape") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
60f80d6f |
|
08-Jul-2020 |
Zhao Heming <heming.zhao@suse.com> |
md-cluster: fix wild pointer of unlock_all_bitmaps() reproduction steps: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb node1 # mdadm -G /dev/md0 -b none mdadm: failed to remove clustered bitmap. node1 # mdadm -S --scan ^C <==== mdadm hung & kernel crash ``` kernel stack: ``` [ 335.230657] general protection fault: 0000 [#1] SMP NOPTI [...] [ 335.230848] Call Trace: [ 335.230873] ? unlock_all_bitmaps+0x5/0x70 [md_cluster] [ 335.230886] unlock_all_bitmaps+0x3d/0x70 [md_cluster] [ 335.230899] leave+0x10f/0x190 [md_cluster] [ 335.230932] ? md_super_wait+0x93/0xa0 [md_mod] [ 335.230947] ? leave+0x5/0x190 [md_cluster] [ 335.230973] md_cluster_stop+0x1a/0x30 [md_mod] [ 335.230999] md_bitmap_free+0x142/0x150 [md_mod] [ 335.231013] ? _cond_resched+0x15/0x40 [ 335.231025] ? mutex_lock+0xe/0x30 [ 335.231056] __md_stop+0x1c/0xa0 [md_mod] [ 335.231083] do_md_stop+0x160/0x580 [md_mod] [ 335.231119] ? 0xffffffffc05fb078 [ 335.231148] md_ioctl+0xa04/0x1930 [md_mod] [ 335.231165] ? filename_lookup+0xf2/0x190 [ 335.231179] blkdev_ioctl+0x93c/0xa10 [ 335.231205] ? _cond_resched+0x15/0x40 [ 335.231214] ? __check_object_size+0xd4/0x1a0 [ 335.231224] block_ioctl+0x39/0x40 [ 335.231243] do_vfs_ioctl+0xa0/0x680 [ 335.231253] ksys_ioctl+0x70/0x80 [ 335.231261] __x64_sys_ioctl+0x16/0x20 [ 335.231271] do_syscall_64+0x65/0x1f0 [ 335.231278] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
8d7c56d0 |
|
20-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 45 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 11 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520170858.370933192@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
ea89238c |
|
18-Oct-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: remove suspend_info Previously, we allow multiple nodes can resync device, but we had changed it to only support one node can do resync at one time, but suspend_info is still used. Now, let's remove the structure and use suspend_lo/hi to record the range. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
cb9ee154 |
|
18-Oct-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted We need to continue the reshaping if it was interrupted in original node. So original node should call resync_bitmap in case reshaping is aborted. Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes, node which continues the reshaping should restart reshape from mddev->reshape_position instead of from the first beginning. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
cbce6863 |
|
18-Oct-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage When reshape is happening in one node, other nodes could receive lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called. Since the resyncing window is typically small in these RESYNCING messages, so WARN is always triggered, so we should not call the func when reshape is happening. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
5ebaf80b |
|
18-Oct-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: introduce resync_info_get interface for sanity check Since the resync region from suspend_info means one node is reshaping this area, so the position of reshape_progress should be included in the area. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
afd75628 |
|
18-Oct-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster/raid10: resize all the bitmaps before start reshape To support add disk under grow mode, we need to resize all the bitmaps of each node before reshape, so that we can ensure all nodes have the same view of the bitmap of the clustered raid. So after the master node resized the bitmap, it broadcast a message to other slave nodes, and it checks the size of each bitmap are same or not by compare pages. We can only continue the reshaping after all nodes update the bitmap to the same size (by checking the pages), otherwise revert bitmap size to previous value. The resize_bitmaps interface and BITMAP_RESIZE message are introduced in md-cluster.c for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
41a95041 |
|
30-Aug-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: release RESYNC lock after the last resync message All the RESYNC messages are sent with resync lock held, the only exception is resync_finish which releases resync_lockres before send the last resync message, this should be changed as well. Otherwise, we can see deadlock issue as follows: clustermd2-gqjiang2:~ # cat /proc/mdstat Personalities : [raid10] [raid1] md0 : active raid1 sdg[0] sdf[1] 134144 blocks super 1.2 [2/2] [UU] [===================>.] resync = 99.6% (134144/134144) finish=0.0min speed=26K/sec bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> clustermd2-gqjiang2:~ # ps aux|grep md|grep D root 20497 0.0 0.0 0 0 ? D 16:00 0:00 [md0_raid1] clustermd2-gqjiang2:~ # cat /proc/20497/stack [<ffffffffc05ff51e>] dlm_lock_sync+0x8e/0xc0 [md_cluster] [<ffffffffc05ff7e8>] __sendmsg+0x98/0x130 [md_cluster] [<ffffffffc05ff900>] sendmsg+0x20/0x30 [md_cluster] [<ffffffffc05ffc35>] resync_info_update+0xb5/0xc0 [md_cluster] [<ffffffffc0593e84>] md_reap_sync_thread+0x134/0x170 [md_mod] [<ffffffffc059514c>] md_check_recovery+0x28c/0x510 [md_mod] [<ffffffffc060c882>] raid1d+0x42/0x800 [raid1] [<ffffffffc058ab61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9a0a5b3f>] kthread+0xff/0x140 [<ffffffff9a800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff clustermd-gqjiang1:~ # ps aux|grep md|grep D root 20531 0.0 0.0 0 0 ? D 16:00 0:00 [md0_raid1] root 20537 0.0 0.0 0 0 ? D 16:00 0:00 [md0_cluster_rec] root 20676 0.0 0.0 0 0 ? D 16:01 0:00 [md0_resync] clustermd-gqjiang1:~ # cat /proc/mdstat Personalities : [raid10] [raid1] md0 : active raid1 sdf[1] sdg[0] 134144 blocks super 1.2 [2/2] [UU] [===================>.] resync = 97.3% (131072/134144) finish=8076.8min speed=0K/sec bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> clustermd-gqjiang1:~ # cat /proc/20531/stack [<ffffffffc080974d>] metadata_update_start+0xcd/0xd0 [md_cluster] [<ffffffffc079c897>] md_update_sb.part.61+0x97/0x820 [md_mod] [<ffffffffc079f15b>] md_check_recovery+0x29b/0x510 [md_mod] [<ffffffffc0816882>] raid1d+0x42/0x800 [raid1] [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9e0a5b3f>] kthread+0xff/0x140 [<ffffffff9e800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff clustermd-gqjiang1:~ # cat /proc/20537/stack [<ffffffffc0813222>] freeze_array+0xf2/0x140 [raid1] [<ffffffffc080a56e>] recv_daemon+0x41e/0x580 [md_cluster] [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9e0a5b3f>] kthread+0xff/0x140 [<ffffffff9e800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff clustermd-gqjiang1:~ # cat /proc/20676/stack [<ffffffffc080951e>] dlm_lock_sync+0x8e/0xc0 [md_cluster] [<ffffffffc080957f>] lock_token+0x2f/0xa0 [md_cluster] [<ffffffffc0809622>] lock_comm+0x32/0x90 [md_cluster] [<ffffffffc08098f5>] sendmsg+0x15/0x30 [md_cluster] [<ffffffffc0809c0a>] resync_info_update+0x8a/0xc0 [md_cluster] [<ffffffffc08130ba>] raid1_sync_request+0xa9a/0xb10 [raid1] [<ffffffffc079b8ea>] md_do_sync+0xbaa/0xf90 [md_mod] [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9e0a5b3f>] kthread+0xff/0x140 [<ffffffff9e800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
e64e4018 |
|
01-Aug-2018 |
Andy Shevchenko <andriy.shevchenko@linux.intel.com> |
md: Avoid namespace collision with bitmap API bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Shaohua Li <shli@kernel.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
#
df8c6764 |
|
02-Jul-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: don't send msg if array is closing If we close an array which resync thread is running, then we don't need the node to send msg since another node would launch the resync thread to continue the rest works. Also send a message is time consuming, we should avoid it. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
0357ba27 |
|
02-Jul-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: show array's status more accurate When resync or recovery is happening in one node, other nodes don't show the appropriate info now. For example, when create an array in master node without "--assume-clean", then assemble the array in slave nodes, you can see "resync=PENDING" when read /proc/mdstat in slave nodes. However, the info is confusing since "PENDING" status is introduced for start array in read-only mode. We introduce RESYNCING_REMOTE flag to indicate that resync thread is running in remote node. The flags is set when node receive RESYNCING msg. And we clear the REMOTE flag in following cases: 1. resync or recover is finished in master node, which means slaves receive msg with both lo and hi are set to 0. 2. node continues resync/recovery in recover_bitmaps. 3. when resync_finish is called. Then we show accurate information in status_resync by check REMOTE flags and with other conditions. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
010228e4 |
|
02-Jul-2018 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: clear another node's suspend_area after the copy is finished When one node leaves cluster or stops the resyncing (resync or recovery) array, then other nodes need to call recover_bitmaps to continue the unfinished task. But we need to clear suspend_area later after other nodes copy the resync information to their bitmap (by call bitmap_copy_from_slot). Otherwise, all nodes could write to the suspend_area even the suspend_area is not handled by any node, because area_resyncing returns 0 at the beginning of raid1_write_request. Which means one node could write suspend_area while another node is resyncing the same area, then data could be inconsistent. So let's clear suspend_area later to avoid above issue with the protection of bm lock. Also it is straightforward to clear suspend_area after nodes have copied the resync info to bitmap. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
6396bb22 |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
f0e230ad |
|
24-Oct-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: update document for raid10 Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
b03e0ccb |
|
18-Oct-2017 |
NeilBrown <neilb@suse.com> |
md: remove special meaning of ->quiesce(.., 2) The '2' argument means "wake up anything that is waiting". This is an inelegant part of the design and was added to help support management of suspend_lo/suspend_hi setting. Now that suspend_lo/hi is managed in mddev_suspend/resume, that need is gone. These is still a couple of places where we call 'quiesce' with an argument of '2', but they can safely be changed to call ->quiesce(.., 1); ->quiesce(.., 0) which achieve the same result at the small cost of pausing IO briefly. This removes a small "optimization" from suspend_{hi,lo}_store, but it isn't clear that optimization served a useful purpose. The code now is a lot clearer. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
935fe098 |
|
10-Oct-2017 |
Mike Snitzer <snitzer@redhat.com> |
md: rename some drivers/md/ files to have an "md-" prefix Motivated by the desire to illiminate the imprecise nature of DM-specific patches being unnecessarily sent to both the MD maintainer and mailing-list. Which is born out of the fact that DM files also reside in drivers/md/ Now all MD-specific files in drivers/md/ start with either "raid" or "md-" and the MAINTAINERS file has been updated accordingly. Shaohua: don't change module name Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
7a57157a |
|
03-Oct-2017 |
Colin Ian King <colin.king@canonical.com> |
md-cluster: make function cluster_check_sync_size static The function cluster_check_sync_size is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: symbol 'cluster_check_sync_size' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
2dffdc07 |
|
16-May-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: fix potential lock issue in add_new_disk The add_new_disk returns with communication locked if __sendmsg returns failure, fix it with call unlock_comm before return. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
835d89e9 |
|
14-Apr-2017 |
Christophe JAILLET <christophe.jaillet@wanadoo.fr> |
md-cluster: Fix a memleak in an error handling path We know that 'bm_lockres' is NULL here, so 'lockres_free(bm_lockres)' is a no-op. According to resource handling in case of error a few lines below, it is likely that 'bitmap_free(bitmap)' was expected instead. Fixes: b98938d16a10 ("md-cluster: introduce cluster_check_sync_size") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
818da59f |
|
01-Mar-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: add the support for resize To update size for cluster raid, we need to make sure all nodes can perform the change successfully. However, it is possible that some of them can't do it due to failure (bitmap_resize could fail). So we need to consider the issue before we set the capacity unconditionally, and we use below steps to perform sanity check. 1. A change the size, then broadcast METADATA_UPDATED msg. 2. B and C receive METADATA_UPDATED change the size excepts call set_capacity, sync_size is not update if the change failed. Also call bitmap_update_sb to sync sb to disk. 3. A checks other node's sync_size, if sync_size has been updated in all nodes, then send CHANGE_CAPACITY msg otherwise send msg to revert previous change. 4. B and C call set_capacity if receive CHANGE_CAPACITY msg, otherwise pers->resize will be called to restore the old value. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
b98938d1 |
|
01-Mar-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: introduce cluster_check_sync_size Support resize is a little complex for clustered raid, since we need to ensure all the nodes share the same knowledge about the size of raid. We achieve the goal by check the sync_size which is in each node's bitmap, we can only change the capacity after cluster_check_sync_size returns 0. Also, get_bitmap_from_slot is added to get a slot's bitmap. And we exported some funcs since they are used in cluster_check_sync_size(). We can also reuse get_bitmap_from_slot to remove redundant code existed in bitmap_copy_from_slot. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
7da3d203 |
|
01-Mar-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: add CHANGE_CAPACITY message type The msg type CHANGE_CAPACITY is introduced to support resize clustered raid in later patch, and it is sent after all the nodes have the same sync_size, receiver node just need to set new capacity once received this msg. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
0ba95977 |
|
01-Mar-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: use sync way to handle METADATA_UPDATED msg Previously, when node received METADATA_UPDATED msg, it just need to wakeup mddev->thread, then md_reload_sb will be called eventually. We taken the asynchronous way to avoid a deadlock issue, the deadlock issue could happen when one node is receiving the METADATA_UPDATED msg (wants reconfig_mutex) and trying to run the path: md_check_recovery -> mddev_trylock(hold reconfig_mutex) -> md_update_sb-metadata_update_start (want EX on token however token is got by the sending node) Since we will support resizing for clustered raid, and we need the metadata update handling to be synchronous so that the initiating node can detect failure, so we need to change the way for handling METADATA_UPDATED msg. But, we obviously need to avoid above deadlock with the sync way. To make this happen, we considered to not hold reconfig_mutex to call md_reload_sb, if some other thread has already taken reconfig_mutex and waiting for the 'token', then process_recvd_msg() can safely call md_reload_sb() without taking the mutex. This is because we can be certain that no other thread will take the mutex, and we also certain that the actions performed by md_reload_sb() won't interfere with anything that the other thread is in the middle of. To make this more concrete, we added a new cinfo->state bit MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD Which is set in lock_token() just before dlm_lock_sync() is called, and cleared just after. As lock_token() is always called with reconfig_mutex() held (the specific case is the resync_info_update which is distinguished well in previous patch), if process_recvd_msg() finds that the new bit is set, then the mutex must be held by some other thread, and it will keep waiting. So process_metadata_update() can call md_reload_sb() if either mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is set. The tricky bit is what to do if neither of these apply. We need to wait. Fortunately mddev_unlock() always calls wake_up() on mddev->thread->wqueue. So we can get lock_token() to call wake_up() on that when it sets the bit. There are also some related changes inside this commit: 1. remove RELOAD_SB related codes since there are not valid anymore. 2. mddev is added into md_cluster_info then we can get mddev inside lock_token. 3. add new parameter for lock_token to distinguish reconfig_mutex is held or not. And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below: 1. set it before unregister thread, otherwise a deadlock could appear if stop a resyncing array. This is because md_unregister_thread(&cinfo->recv_thread) is blocked by recv_daemon -> process_recvd_msg -> process_metadata_update. To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is also need to be set before unregister thread. 2. set it in metadata_update_start to fix another deadlock. a. Node A sends METADATA_UPDATED msg (held Token lock). b. Node B wants to do resync, and is blocked since it can't get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is not set since the callchain (md_do_sync -> sync_request -> resync_info_update -> sendmsg -> lock_comm -> lock_token) doesn't hold reconfig_mutex. c. Node B trys to update sb (held reconfig_mutex), but stopped at wait_event() in metadata_update_start since we have set MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2). d. Then Node B receives METADATA_UPDATED msg from A, of course recv_daemon is blocked forever. Since metadata_update_start always calls lock_token with reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and lock_token don't need to set it twice unless lock_token is invoked from lock_comm. Finally, thanks to Neil for his great idea and help! Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
75df023f |
|
23-Feb-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: remove useless memset from gather_all_resync_info This memset is not needed. The lvb is already zeroed because it was recently allocated by lockres_init, which uses kzalloc(), and read_resync_info() doesn't need it to be zero anyway. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
9c8043f3 |
|
23-Feb-2017 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: free md_cluster_info if node leave cluster To avoid memory leak, we need to free the cinfo which is allocated when node join cluster. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
d6385db9 |
|
11-Aug-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: make resync lock also could be interruptted When one node is perform resync or recovery, other nodes can't get resync lock and could block for a while before it holds the lock, so we can't stop array immediately for this scenario. To make array could be stop quickly, we check MD_CLOSING in dlm_lock_sync_interruptible to make us can interrupt the lock request. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
7bcda714 |
|
11-Aug-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang When some node leaves cluster, then it's bitmap need to be synced by another node, so "md*_recover" thread is triggered for the purpose. However, with below steps. we can find tasks hang happened either in B or C. 1. Node A create a resyncing cluster raid1, assemble it in other two nodes (B and C). 2. stop array in B and C. 3. stop array in A. linux44:~ # ps aux|grep md|grep D root 5938 0.0 0.1 19852 1964 pts/0 D+ 14:52 0:00 mdadm -S md0 root 5939 0.0 0.0 0 0 ? D 14:52 0:00 [md0_recover] linux44:~ # cat /proc/5939/stack [<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster] [<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster] [<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod] [<ffffffff8107ad94>] kthread+0xb4/0xc0 [<ffffffff8152a518>] ret_from_fork+0x58/0x90 linux44:~ # cat /proc/5938/stack [<ffffffff8107afde>] kthread_stop+0x6e/0x120 [<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod] [<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster] [<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod] [<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod] [<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod] [<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod] [<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0 [<ffffffff811dd3dd>] block_ioctl+0x3d/0x40 [<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0 [<ffffffff811b9538>] SyS_ioctl+0x88/0xa0 [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b The problem is caused by recover_bitmaps can't reliably abort when the thread is unregistered. So dlm_lock_sync_interruptible is introduced to detect the thread's situation to fix the problem. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
fccb60a4 |
|
11-Aug-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: convert the completion to wait queue Previously, we used completion to sync between require dlm lock and sync_ast, however we will have to expose completion.wait and completion.done in dlm_lock_sync_interruptible (introduced later), it is not a common usage for completion, so convert related things to wait queue. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
5f0aa21d |
|
11-Aug-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: protect md_find_rdev_nr_rcu with rcu lock We need to use rcu_read_lock/unlock to avoid potential race. Reported-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
e3f924d3 |
|
11-Aug-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: remove some unnecessary dlm_unlock_sync Since DLM_LKF_FORCEUNLOCK is used in lockres_free, we don't need to call dlm_unlock_sync before free lock resource. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
400cb454 |
|
11-Aug-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: use FORCEUNLOCK in lockres_free For dlm_unlock, we need to pass flag to dlm_unlock as the third parameter instead of set res->flags. Also, DLM_LKF_FORCEUNLOCK is more suitable for dlm_unlock since it works even the lock is on waiting or convert queue. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
0f6187db |
|
21-Aug-2016 |
Wei Yongjun <weiyongjun1@huawei.com> |
md-cluster: fix error return code in join() Fix to return error code -ENOMEM from the lockres_init() error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
1fa9a1ad |
|
03-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: check the return value of process_recvd_msg We don't need to run the full path of recv_daemon if process_recvd_msg doesn't return 0. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
51e453ae |
|
04-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: gather resync infos and enable recv_thread after bitmap is ready The in-memory bitmap is not ready when node joins cluster, so it doesn't make sense to make gather_all_resync_info() called so earlier, we need to call it after the node's bitmap is setup. Also, recv_thread could be wake up after node joins cluster, but it could cause problem if node receives RESYNCING message without persionality since mddev->pers->quiesce is called in process_suspend_info. This commit introduces a new cluster interface load_bitmaps to fix above problems, load_bitmaps is called in bitmap_load where bitmap and persionality are ready, and load_bitmaps does the following tasks: 1. call gather_all_resync_info to load all the node's bitmap info. 2. set MD_CLUSTER_ALREADY_IN_CLUSTER bit to recv_thread could be wake up, and wake up recv_thread if there is pending recv event. Then ack_bast only wakes up recv_thread after IN_CLUSTER bit is ready otherwise MD_CLUSTER_PENDING_RESYNC_EVENT is set. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
18c9ff7f |
|
02-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: sync bitmap when node received RESYNCING msg If the node received RESYNCING message which means another node will perform resync with the area, then we don't want to do it again in another node. Let's set RESYNC_MASK and clear NEEDED_MASK for the region from old-low to new-low which has finished syncing, and the region from old-hi to new-hi is about to syncing, bitmap_sync_with_cluste is introduced for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
1535212c |
|
02-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: fix locking when node joins cluster during message broadcast If a node joins the cluster while a message broadcast is under way, a lock issue could happen as follows. For a cluster which included two nodes, if node A is calling __sendmsg before up-convert CR to EX on ack, and node B released CR on ack. But if a new node C joins the cluster and it doesn't receive the message which A sent before, so it could hold CR on ack before A up-convert CR to EX on ack. So a node joining the cluster should get an EX lock on the "token" first to ensure no broadcast is ongoing, then release it after held CR on ack. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
5b0fb33e |
|
02-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: unregister thread if err happened The two threads need to be unregistered if a node can't join cluster successfully. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
eb315cd0 |
|
02-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: wake up thread to continue recovery In recovery case, we need to set MD_RECOVERY_NEEDED and wake up thread only if recover is not finished. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
41a9a0dc |
|
02-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: change resync lock from asynchronous to synchronous If multiple nodes choose to attempt do resync at the same time they need to be serialized so they don't duplicate effort. This serialization is done by locking the 'resync' DLM lock. Currently if a node cannot get the lock immediately it doesn't request notification when the lock becomes available (i.e. DLM_LKF_NOQUEUE is set), so it may not reliably find out when it is safe to try again. Rather than trying to arrange an async wake-up when the lock becomes available, switch to using synchronous locking - this is a lot easier to think about. As it is not permitted to block in the 'raid1d' thread, move the locking to the resync thread. So the rsync thread is forked immediately, but it blocks until the resync lock is available. Once the lock is locked it checks again if any resync action is needed. A particular symptom of the current problem is that a node can get stuck with "resync=pending" indefinitely. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
4ac7a65f |
|
22-Jan-2016 |
Shaohua Li <shli@fb.com> |
md-cluster: fix missing memory free There are several places we allocate dlm_lock_resource, but not free it. leave() need free a lock resource too (from Guoqing) Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Guoqing Jiang <gqjiang@suse.com> Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
e19508fa |
|
20-Dec-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY 1. fix unbalanced parentheses. 2. add more description about that MD_CLUSTER_SEND_LOCKED_ALREADY will be cleared after set it in add_new_disk. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
8b9277c8 |
|
20-Dec-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: Protect communication with mutexes Communication can happen through multiple threads. It is possible that one thread steps over another threads sequence. So, we use mutexes to protect both the send and receive sequences. Send communication is locked through state bit, MD_CLUSTER_SEND_LOCK. Communication is locked with bit manipulation in order to allow "lock and hold" for the add operation. In case of an add operation, if the lock is held, MD_CLUSTER_SEND_LOCKED_ALREADY is set. When md_update_sb() calls metadata_update_start(), it checks (in a single statement to avoid races), if the communication is already locked. If yes, it merely returns zero, else it locks the token lockresource. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
15858fa5 |
|
20-Dec-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: Defer MD reloading to mddev->thread Reloading of superblock must be performed under reconfig_mutex. However, this cannot be done with md_reload_sb because it would deadlock with the message DLM lock. So, we defer it in md_check_recovery() which is executed by mddev->thread. This introduces a new flag, MD_RELOAD_SB, which if set, will reload the superblock. And good_device_nr is also added to 'struct mddev' which is used to get the num of the good device within cluster raid. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
f6a2dc64 |
|
20-Dec-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: append some actions when change bitmap from clustered to none For clustered raid, we need to do extra actions when change bitmap to none. 1. check if all the bitmap lock could be get or not, if yes then we can continue the change since cluster raid is only active in current node. Otherwise return fail and unlock the related bitmap locks 2. set nodes to 0 and then leave cluster environment. 3. release other nodes's bitmap lock. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
54a88392 |
|
20-Dec-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Fix the remove sequence with the new MD reload code The remove disk message does not need metadata_update_start(), but can be an independent message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
659b254f |
|
20-Dec-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: remove a disk asynchronously from cluster environment For cluster raid, if one disk couldn't be reach in one node, then other nodes would receive the REMOVE message for the disk. In receiving node, we can't call md_kick_rdev_from_array to remove the disk from array synchronously since the disk might still be busy in this node. So let's set a ClusterRemove flag on the disk, then let the thread to do the removal job eventually. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
ac277c6a |
|
20-Dec-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Avoid the resync ping-pong If a RESYNCING message with (0,0) has been sent before, do not send it again. This avoids a resync ping pong between the nodes. We read the bitmap lockresource's LVB to figure out the previous value of the RESYNCING message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
30661b49 |
|
18-Oct-2015 |
NeilBrown <neilb@suse.com> |
md-cluster: remove mddev arg from add_resync_info() The arg isn't used, so its presence is only confusing. Signed-off-by: NeilBrown <neilb@suse.com>
|
#
2e2a7cd9 |
|
18-Oct-2015 |
NeilBrown <neilb@suse.com> |
md-cluster: don't cast void pointers when assigning them. It is common practice in the kernel to leave out this case. It isn't needed and adds little if any value. Signed-off-by: NeilBrown <neilb@suse.com>
|
#
82381523 |
|
18-Oct-2015 |
NeilBrown <neilb@suse.com> |
md-cluster: discard unused sb_mutex. Signed-off-by: NeilBrown <neilb@suse.com>
|
#
cf97a348 |
|
16-Oct-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__ This patches fixes sparse warnings like incorrect type in assignment (different base types), cast to restricted __le64. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
ba2746b0 |
|
15-Oct-2015 |
NeilBrown <neilb@suse.com> |
md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as le32 As cmsg.raid_slot is le32, comparing for >0 is not meaningful. So introduce cpu-endian 'raid_slot' and only assign to cmsg.raid_slot when we know value is valid. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
86b57277 |
|
12-Oct-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: Add 'SUSE' as author for md-cluster.c Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
|
#
aee177ac |
|
12-Oct-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: zero cmsg before it was sent Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
|
#
256f5b24 |
|
12-Oct-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: make sure the node do not receive it's own msg During the past test, the node occasionally received the msg which is sent from itself, this case should not happen in theory, but it is better to avoid it in case something wrong happened. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
487cf914 |
|
12-Oct-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: remove unnecessary setting for slot Since slot will be set within _sendmsg, we can remove the redundant code in resync_info_update. Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
|
#
faeff83f |
|
12-Oct-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: make other members of cluster_msg is handled by little endian funcs Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
|
#
d216711b |
|
12-Oct-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Do not printk() every received message The receive daemon prints kernel messages for every network message received. This would fill the kernel message log with unnecessary messages. Remove the pr_info() messages. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
dbb64f86 |
|
01-Oct-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Fix adding of new disk with new reload code Adding the disk worked incorrectly with the new reload code. Fix it: - No operation should be performed on rdev marked as Candidate - After a metadata update operation, kick disk if role is 0xfffe else clear Candidate bit and continue with the regular change check. - Saving the mode of the lock resource to check if token lock is already locked, because it can be called twice while adding a disk. However, unlock_comm() must be called only once. - add_new_disk() is called by the node initiating the --add operation. If it needs to be canceled, call add_new_disk_cancel(). The operation is completed by md_update_sb() which will write and unlock the communication. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
c186b128 |
|
30-Sep-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Perform resync/recovery under a DLM lock Resync or recovery must be performed by only one node at a time. A DLM lock resource, resync_lockres provides the mutual exclusion so that only one node performs the recovery/resync at a time. If a node is unable to get the resync_lockres, because recovery is being performed by another node, it set MD_RECOVER_NEEDED so as to schedule recovery in the future. Remove the debug message in resync_info_update() used during development. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
70bcecdb |
|
21-Aug-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Improve md_reload_sb to be less error prone md_reload_sb is too simplistic and it explicitly needs to determine the changes made by the writing node. However, there are multiple areas where a simple reload could fail. Instead, read the superblock of one of the "good" rdevs and update the necessary information: - read the superblock into a newly allocated page, by temporarily swapping out rdev->sb_page and calling ->load_super. - if that fails return - if it succeeds, call check_sb_changes 1. iterates over list of active devices and checks the matching dev_roles[] value. If that is 'faulty', the device must be marked as faulty - call md_error to mark the device as faulty. Make sure not to set CHANGE_DEVS and wakeup mddev->thread or else it would initiate a resync process, which is the responsibility of the "primary" node. - clear the Blocked bit - Call remove_and_add_spares() to hot remove the device. If the device is 'spare': - call remove_and_add_spares() to get the number of spares added in this operation. - Reduce mddev->degraded to mark the array as not degraded. 2. reset recovery_cp - read the rest of the rdevs to update recovery_offset. If recovery_offset is equal to MaxSector, call spare_active() to set it In_sync This required that recovery_offset be initialized to MaxSector, as opposed to zero so as to communicate the end of sync for a rdev. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
b8ca846e |
|
09-Oct-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Wake up suspended process When the suspended_area is deleted, the suspended processes must be woken up in order to complete their I/O. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
09995411 |
|
30-Sep-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster Previously, BITMAP_NEEDS_SYNC message is sent when the resyc aborts, but it could abort for different reasons, and not all of reasons require another node to take over the resync ownship. It is better make BITMAP_NEEDS_SYNC message only be sent when the node is leaving cluster with dirty bitmap. And we also need to ensure dlm connection is ok. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
c40f341f |
|
18-Aug-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: Use a small window for resync Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
9ed38ff5 |
|
13-Aug-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
md-cluster: complete all write requests before adding suspend_info process_suspend_info - which handles the RESYNCING request - must not reply until all writes which were initiated before the request arrived, have completed. As a by-product, all process_* functions now take mddev as their first arguement making it uniform. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
18b9f679 |
|
13-Aug-2015 |
NeilBrown <neilb@suse.com> |
md-cluster: remove inappropriate try_module_get from join() md_setup_cluster already calls try_module_get(), so this try_module_get isn't needed. Also, there is no matching module_put (except in error patch), so this leaves an unbalanced module count. Signed-off-by: NeilBrown <neilb@suse.com>
|
#
abb9b22a |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: Read the disk bitmap sb and check if it needs recovery In gather_all_resync_info, we need to read the disk bitmap sb and check if it needs recovery. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
eece075c |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: only call complete(&cinfo->completion) when node join cluster Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure complete(&cinfo->completion) is only be invoked when node join cluster. Otherwise node failure could also call the complete, and it doesn't make sense to do it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
6e6d9f2c |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: add missed lockres_free We also need to free the lock resource before goto out. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
b2b9bfff |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: remove the unused sb_lock The sb_lock is not used anywhere, so let's remove it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
9e3072e3 |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: init suspend_list and suspend_lock early in join If the node just join the cluster, and receive the msg from other nodes before init suspend_list, it will cause kernel crash due to NULL pointer dereference, so move the initializations early to fix the bug. md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3 BUG: unable to handle kernel NULL pointer dereference at (null) ... ... ... Call Trace: [<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster] [<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster] [<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod] [<ffffffff810768c4>] kthread+0xb4/0xc0 [<ffffffff8151927c>] ret_from_fork+0x7c/0xb0 ... ... ... RIP [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster] Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
b5ef5678 |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: add the error check if failed to get dlm lock In complicated cluster environment, it is possible that the dlm lock couldn't be get/convert on purpose, the related err info is added for better debug potential issue. For lockres_free, if the lock is blocking by a lock request or conversion request, then dlm_unlock just put it back to grant queue, so need to ensure the lock is free finally. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
b83d51c0 |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: init completion within lockres_init We should init completion within lockres_init, otherwise completion could be initialized more than one time during it's life cycle. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
66099bb0 |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: fix deadlock issue on message lock There is problem with previous communication mechanism, and we got below deadlock scenario with cluster which has 3 nodes. Sender Receiver Receiver token(EX) message(EX) writes message downconverts message(CR) requests ack(EX) get message(CR) gets message(CR) reads message reads message requests EX on message requests EX on message To fix this problem, we do the following changes: 1. the sender downconverts MESSAGE to CW rather than CR. 2. and the receiver request PR lock not EX lock on message. And in case we failed to down-convert EX to CW on message, it is better to unlock message otherthan still hold the lock. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Lidong Zhong <ldzhong@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
dc737d7c |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: transfer the resync ownership to another node When node A stops an array while the array is doing a resync, we need to let another node B take over the resync task. To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC message to the cluster. And the node B which received that message will invoke __recover_slot to do resync. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
05cd0e51 |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: split recover_slot for future code reuse Make recover_slot as a wraper to __recover_slot, since the logic of __recover_slot can be reused for the condition when other nodes need to take over the resync job. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
b89f704a |
|
10-Jul-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: use %pU to print UUIDs Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
90382ed9 |
|
24-Jun-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Fix read-balancing during node failure During a node failure, We need to suspend read balancing so that the reads are directed to the first device and stale data is not read. Suspending writes is not required because these would be recorded and synced eventually. A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep(). area_resyncing() will respond true for the entire devices if this flag is set and the request type is READ. The flag is cleared in recover_done(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reported-By: David Teigland <teigland@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
97f6cd39 |
|
14-Apr-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.de> |
md-cluster: re-add capabilities When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
88bcfef7 |
|
14-Apr-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.de> |
md-cluster: remove capabilities This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
8c58f02e |
|
21-Apr-2015 |
Guoqing Jiang <gqjiang@suse.com> |
md-cluster: correct the num for comparison Since the node num of md-cluster is from zero, and cinfo->slot_number represents the slot num of dlm, no need to check for equality. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
09dd1af2 |
|
27-Feb-2015 |
kbuild test robot <fengguang.wu@intel.com> |
md/cluster: Communication Framework: fix semicolon.cocci warnings drivers/md/md-cluster.c:328:2-3: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
6dc69c9c |
|
27-Feb-2015 |
kbuild test robot <fengguang.wu@intel.com> |
md: recover_bitmaps() can be static drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static? Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
fa8259da |
|
02-Mar-2015 |
Goldwyn Rodrigues <rgoldwyn@suse.de> |
md: Fix stray --cluster-confirm crash A --cluster-confirm without an --add (by another node) can crash the kernel. Fix it by guarding it using a state. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
1aee41f6 |
|
29-Oct-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Add new disk to clustered array Algorithm: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
589a1c49 |
|
07-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Suspend writes in RAID1 if within range If there is a resync going on, all nodes must suspend writes to the range. This is recorded in the suspend_info/suspend_list. If there is an I/O within the ranges of any of the suspend_info, should_suspend will return 1. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
e59721cc |
|
07-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Resync start/Finish actions When a RESYNC_START message arrives, the node removes the entry with the current slot number and adds the range to the suspend_list. Simlarly, when a RESYNC_FINISHED message is received, node clears entry with respect to the bitmap number. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
965400eb |
|
07-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Send RESYNCING while performing resync start/stop When a resync is initiated, RESYNCING message is sent to all active nodes with the range (lo,hi). When the resync is over, a RESYNCING message is sent with (0,0). A high sector value of zero indicates that the resync is over. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
1d7e3e96 |
|
07-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Reload superblock if METADATA_UPDATED is received Re-reads the devices by invalidating the cache. Since we don't write to faulty devices, this is detected using events recorded in the devices. If it is old as compared to the mddev mark it is faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
293467aa |
|
07-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
metadata_update sends message to other nodes - request to send a message - make changes to superblock - send messages telling everyone that the superblock has changed - other nodes all read the superblock - other nodes all ack the messages - updating node release the "I'm sending a message" resource. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
601b515c |
|
07-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Communication Framework: Sending functions The sending part is split in two functions to make sure atomicity of the operations, such as the MD superblock update. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
4664680c |
|
07-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Communication Framework: Receiving 1. receive status sender receiver receiver ACK:CR ACK:CR ACK:CR 2. sender get EX of TOKEN sender get EX of MESSAGE sender receiver receiver TOKEN:EX ACK:CR ACK:CR MESSAGE:EX ACK:CR 3. sender write LVB. sender down-convert MESSAGE from EX to CR sender try to get EX of ACK [ wait until all receiver has *processed* the MESSAGE ] [ triggered by bast of ACK ] receiver get CR of MESSAGE receiver read LVB receiver processes the message [ wait finish ] receiver release ACK sender receiver receiver TOKEN:EX MESSAGE:CR MESSAGE:CR MESSAGE:CR ACK:EX 4. sender down-convert ACK from EX to CR sender release MESSAGE sender release TOKEN receiver upconvert to EX of MESSAGE receiver get CR of ACK receiver release MESSAGE sender receiver receiver ACK:CR ACK:CR ACK:CR Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
4b26a08a |
|
06-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Perform resync for cluster node failure If bitmap_copy_slot returns hi>0, we need to perform resync. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
e94987db |
|
06-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Initiate recovery on node failure The DLM informs us in case of node failure with the DLM slot number. cluster_info->recovery_map sets the bit corresponding to the slot number and wakes up the recovery thread. The recovery thread: 1. Derives the slot number from the recovery_map 2. Locks the bitmap corresponding to the slot 3. Copies the set bits to the node-local bitmap Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
96ae923a |
|
05-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Gather on-going resync information of other nodes When a node joins, it does not know of other nodes performing resync. So, each node keeps the resync information in it's LVB. When a new node joins, it reads the LVB of each "online" bitmap. [TODO] The new node attempts to get the PW lock on other bitmap, if it is successful, it reads the bitmap and performs the resync (if required) on it's behalf. If the node does not get the PW, it requests CR and reads the LVB for the resync information. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
54519c5f |
|
05-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Lock bitmap while joining the cluster Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
b97e9257 |
|
06-Jun-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Use separate bitmaps for each nodes in the cluster On-disk format: 0 4k 8k 12k ------------------------------------------------------------------- | idle | md super | bm super [0] + bits | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | bm bits [3, contd] | | | Bitmap super has a field nodes, which defines the maximum number of nodes the device can use. While reading the bitmap super, if the cluster finds out that the number of nodes is > 0: 1. Requests the md-cluster module. 2. Calls md_cluster_ops->join(), which sets up clustering such as joining DLM lockspace. Since the first time, the first bitmap is read. After the call to the cluster_setup, the bitmap offset is adjusted and the superblock is re-read. This also ensures the bitmap is read the bitmap lock (when bitmap lock is introduced in later patches) Questions: 1. cluster name is repeated in all bitmap supers. Is that okay? Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
cf921cc1 |
|
29-Mar-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Add node recovery callbacks DLM offers callbacks when a node fails and the lock remastery is performed: 1. recover_prep: called when DLM discovers a node is down 2. recover_slot: called when DLM identifies the node and recovery can start 3. recover_done: called when all nodes have completed recover_slot recover_slot() and recover_done() are also called when the node joins initially in order to inform the node with its slot number. These slot numbers start from one, so we deduct one to make it start with zero which the cluster-md code uses. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
c4ce867f |
|
29-Mar-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Introduce md_cluster_info md_cluster_info stores the cluster information in the MD device. The join() is called when mddev detects it is a clustered device. The main responsibilities are: 1. Setup a DLM lockspace 2. Setup all initial locks such as super block locks and bitmap lock (will come later) The leave() clears up the lockspace and all the locks held. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
edb39c9d |
|
29-Mar-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Introduce md_cluster_operations to handle cluster functions This allows dynamic registering of cluster hooks. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
47741b7c |
|
07-Mar-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
DLM lock and unlock functions A dlm_lock_resource is a structure which contains all information required for locking using DLM. The init function allocates the lock and acquires the lock in NL mode. The unlock function converts the lock resource to NL mode. This is done to preserve LVB and for faster processing of locks. The lock resource is DLM unlocked only in the lockres_free function, which is the end of life of the lock resource. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
#
8e854e9c |
|
07-Mar-2014 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
Create a separate module for clustering support Tagged as EXPERIMENTAL for now. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|