Cross Reference: /linux-master/drivers/md/raid5-cache.c

History log of /linux-master/drivers/md/raid5-cache.c
Revision	Date	Author	Comments
# ad860670	25-Nov-2023	Yu Kuai <yukuai3@huawei.com>	md/raid5: remove rcu protection to access rdev from conf Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid456. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-5-yukuai1@huaweicloud.com
# 2b16a525	10-Oct-2023	Yu Kuai <yukuai3@huawei.com>	md: rename __mddev_suspend/resume() back to mddev_suspend/resume() Now that the old apis are removed, __mddev_suspend/resume() can be renamed to their original names. This is done by: sed -i "s/__mddev_suspend/mddev_suspend/g" .[ch] sed -i "s/__mddev_resume/mddev_resume/g" .[ch] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com
# 1b172e0b	10-Oct-2023	Yu Kuai <yukuai3@huawei.com>	md/raid5-cache: use new apis to suspend array Convert to use new apis, the old apis will be removed eventually. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-9-yukuai1@huaweicloud.com
# 06a4d0d8	10-Oct-2023	Yu Kuai <yukuai3@huawei.com>	md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log' 'conf->log' is set with 'reconfig_mutex' grabbed, however, readers are not procted, hence protect it with READ_ONCE/WRITE_ONCE to prevent reading abnormal values. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-3-yukuai1@huaweicloud.com
# 0d0bd28c	08-Aug-2023	Yu Kuai <yukuai3@huawei.com>	md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is empty, and then submit 'flush_bio', however, r5l_log_flush_endio() is clearing the list first and then clear the bio, which will cause null-ptr-deref: T1: submit flush io raid5d handle_active_stripes r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init submit_bio // io1 T2: io1 is done r5l_log_flush_endio list_splice_tail_init // clear the list T3: submit new flush io ... r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init bio_uninit // clear bio->bi_blkg submit_bio // null-ptr-deref Fix this problem by clearing bio before clearing the list in r5l_log_flush_endio(). Fixes: 0dd00cba99c3 ("raid5-cache: fully initialize flush_bio when needed") Reported-and-tested-by: Corey Hickey <bugfood-ml@fatooh.org> Closes: https://lore.kernel.org/all/cddd7213-3dfd-4ab7-a3ac-edd54d74a626@fatooh.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
# 7eb8ff02	03-Aug-2023	Li Lingfeng <lilingfeng3@huawei.com>	md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
# a705b11b	08-Jul-2023	Yu Kuai <yukuai3@huawei.com>	md/raid5-cache: fix a deadlock in r5l_exit_log() Commit b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") introduce a new problem: // caller hold reconfig_mutex r5l_exit_log flush_work(&log->disable_writeback_work) r5c_disable_writeback_async wait_event /* * conf->log is not NULL, and mddev_trylock() * will fail, wait_event() can never pass. */ conf->log = NULL Fix this problem by setting 'config->log' to NULL before wake_up() as it used to be, so that wait_event() from r5c_disable_writeback_async() can exist. In the meantime, move forward md_unregister_thread() so that null-ptr-deref this commit fixed can still be fixed. Fixes: b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230708091727.1417894-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
# 44693154	22-May-2023	Yu Kuai <yukuai3@huawei.com>	md: protect md_thread with rcu Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
# b0a2f17c	31-May-2023	Johannes Thumshirn <johannes.thumshirn@wdc.com>	md: raid5-log: use __bio_add_page to add single page The raid5 log metadata submission code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but the return value is never checked. Use __bio_add_page() as adding a single page to a newly created bio is guaranteed to succeed. This brings us a step closer to marking bio_add_page() as __must_check. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/832a810d6c9e71f88b0a39cb076a8c70e8bcb821.1685532726.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
# ad831a16	09-Nov-2022	Christoph Hellwig <hch@lst.de>	md/raid5: use bdev_write_cache instead of open coding it Use the bdev_write_cache instead of two equivalent open coded checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
# 9487a0f6	20-Oct-2022	Uros Bizjak <ubizjak@gmail.com>	raid5-cache: use try_cmpxchg in r5l_wake_reclaim Use try_cmpxchg instead of cmpxchg (ptr, old, new) == old in r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Cc: Song Liu <song@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
# a251c17a	05-Oct-2022	Jason A. Donenfeld <Jason@zx2c4.com>	treewide: use get_random_u32() when possible The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
# 65b94b52	19-Sep-2022	Zhou nan <zhounan@nfschina.com>	md: Fix spelling mistake in comments of r5l_log Fix spelling of dones't in comments. Signed-off-by: Zhou nan <zhounan@nfschina.com> Signed-off-by: Song Liu <song@kernel.org>
# 2f2d51ef	11-Aug-2022	Logan Gunthorpe <logang@deltatee.com>	md/raid5: Cleanup prototype of raid5_get_active_stripe() Drop the three bools in the prototype of raid5_get_active_stripe() and replace them with a flags parameter. At the same time, drop the distinction with __raid5_get_active_stripe(). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
# 12ba6676	17-Aug-2022	XU pengfei <xupengfei@nfschina.com>	md/raid5: Fix spelling mistakes in comments Fix spelling of 'waitting' in comments. Signed-off-by: XU pengfei <xupengfei@nfschina.com> Signed-off-by: Song Liu <song@kernel.org>
# 6f28c5c3	08-Jun-2022	Logan Gunthorpe <logang@deltatee.com>	md/raid5-cache: Annotate pslot with __rcu notation radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the slot returned and looked up to be marked with __rcu. Otherwise sparse warnings are generated: drivers/md/raid5-cache.c:2939:23: warning: incorrect type in assignment (different address spaces) drivers/md/raid5-cache.c:2939:23: expected void pslot drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# b13015af	08-Jun-2022	Logan Gunthorpe <logang@deltatee.com>	md/raid5-cache: Clear conf->log after finishing work A NULL pointer dereferlence on conf->log is seen randomly with the mdadm test 21raid5cache. Kasan reporst: BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140 Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086 Call Trace: dump_stack_lvl+0x5a/0x74 kasan_report.cold+0x5f/0x1a9 __asan_load8+0x69/0x90 r5l_reclaimable_space+0xf5/0x140 r5l_do_reclaim+0xf4/0x5e0 r5l_reclaim_thread+0x69/0x3b0 md_thread+0x1a2/0x2c0 kthread+0x177/0x1b0 ret_from_fork+0x22/0x30 This is caused by conf->log being cleared in r5l_exit_log() before stopping the reclaim thread. To fix this, clear conf->log after the reclaim_thread is unregistered and after flushing disable_writeback_work. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 7769085c	08-Jun-2022	Logan Gunthorpe <logang@deltatee.com>	md/raid5-cache: Drop RCU usage of conf->log The only place that uses RCU to access conf->log is in r5l_log_disk_error(). This function is mostly used in the IO path and once with mddev_lock() held in raid5_change_consistency_policy(). It is known that the IO will be suspended before the log is freed and r5l_log_exit() is called with the mddev_lock() held. This should mean that conf->log can not be freed while the function is being called, so the RCU protection is not necessary. Drop the rcu_read_lock() as well as the synchronize_rcu() and rcu_assign_pointer() usage. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 78ede6a0	08-Jun-2022	Logan Gunthorpe <logang@deltatee.com>	md/raid5-cache: Take mddev_lock in r5c_journal_mode_show() The mddev->lock spinlock doesn't protect against the removal of conf->log in r5l_exit_log() so conf->log may be freed before it is used. To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4ce4c73f	14-Jul-2022	Bart Van Assche <bvanassche@acm.org>	md/core: Combine two sync_page_io() arguments Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 913cce5a	12-May-2022	Christoph Hellwig <hch@lst.de>	md: remove most calls to bdevname Use the %pg format specifier to save on stack consumption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
# 44abff2c	14-Apr-2022	Christoph Hellwig <hch@lst.de>	block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 70200574	14-Apr-2022	Christoph Hellwig <hch@lst.de>	block: remove QUEUE_FLAG_DISCARD Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 89f94b64	28-Feb-2022	Christoph Hellwig <hch@lst.de>	raid5-cache: statically allocate the recovery ra bio There is no need to preallocate the bio and reset it when use. Just allocate it on-stack and use a bvec places next to the pages used for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
# 0dd00cba	28-Feb-2022	Christoph Hellwig <hch@lst.de>	raid5-cache: fully initialize flush_bio when needed Stop using bio_reset and just initialize the bio fully when needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
# a7c50c94	24-Jan-2022	Christoph Hellwig <hch@lst.de>	block: pass a block_device and opf to bio_reset Pass the block_device that we plan to use this bio for and the operation to bio_reset to optimize the assigment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 49add496	24-Jan-2022	Christoph Hellwig <hch@lst.de>	block: pass a block_device and opf to bio_init Pass the block_device that we plan to use this bio for and the operation to bio_init to optimize the assignment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 609be106	24-Jan-2022	Christoph Hellwig <hch@lst.de>	block: pass a block_device and opf to bio_alloc_bioset Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a8affc03	10-Mar-2021	Christoph Hellwig <hch@lst.de>	block: rename BIO_MAX_PAGES to BIO_MAX_VECS Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 01b5d32a	27-Jul-2020	Guoqing Jiang <guoqing.jiang@cloud.ionos.com>	raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show Replace mddev_lock with spin_lock to align with other show methods in raid5_attrs. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
# c911c46c	18-Jul-2020	Yufen Yu <yuyufen@huawei.com>	md/raid456: convert macro STRIPE_* to RAID5_STRIPE_* Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT(). This patch is prepare for the following adjustable stripe_size. It will not change any existing functionality. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
# 52923083	15-Jul-2020	Damien Le Moal <damien.lemoal@wdc.com>	md: raid5-cache: Remove set but unused variable Remove the variable offset in r5c_tree_index() to avoid a "set but not used" compilation warning when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
# c9020e64	06-Jul-2020	Song Liu <songliubraving@fb.com>	md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes In recovery, if we process too much data, raid5-cache may set MD_SB_CHANGE_PENDING, which causes spinning in handle_stripe(). Fix this issue by clearing the bit before flushing data only stripes. This issue was initially discussed in [1]. [1] https://www.spinics.net/lists/raid/msg64409.html Signed-off-by: Song Liu <songliubraving@fb.com>
# 2025cf9e	29-May-2019	Thomas Gleixner <tglx@linutronix.de>	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms and conditions of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 263 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
# 483cbbed	27-Mar-2018	Alexei Naberezhnov <anaberezhnov@fb.com>	md/raid5: fix 'out of memory' during raid cache recovery This fixes the case when md array assembly fails because of raid cache recovery unable to allocate a stripe, despite attempts to replay stripes and increase cache size. This happens because stripes released by r5c_recovery_replay_stripes and raid5_set_cache_size don't become available for allocation immediately. Released stripes first are placed on conf->released_stripes list and require md thread to merge them on conf->inactive_list before they can be allocated. Patch allows final allocation attempt during cache recovery to wait for new stripes to become availabe for allocation. Cc: linux-raid@vger.kernel.org Cc: Shaohua Li <shli@kernel.org> Cc: linux-stable <stable@vger.kernel.org> # 4.10+ Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1") Signed-off-by: Alexei Naberezhnov <anaberezhnov@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com>
# 116d99ad	08-Oct-2018	Colin Ian King <colin.king@canonical.com>	md: remove redundant code that is no longer reachable And earlier commit removed the error label to two statements that are now never reachable. Since this code is now dead code, remove it. Detected by CoverityScan, CID#1462409 ("Structurally dead code") Fixes: d5d885fd514f ("md: introduce new personality funciton start()") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
# e64e4018	01-Aug-2018	Andy Shevchenko <andriy.shevchenko@linux.intel.com>	md: Avoid namespace collision with bitmap API bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Shaohua Li <shli@kernel.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
# ebc7709f	03-Jul-2018	Colin Ian King <colin.king@canonical.com>	md/r5cache: remove redundant pointer bio Pointer bio is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'bio' set but not used [-Wunused-but-set-variable] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
# afeee514	20-May-2018	Kent Overstreet <kent.overstreet@gmail.com>	md: convert to bioset_init()/mempool_init() Convert md to embedded bio sets. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 1532d9e8	27-Dec-2017	Tomasz Majchrzak <tomasz.majchrzak@intel.com>	raid5-ppl: PPL support for disks with write-back cache enabled In order to provide data consistency with PPL for disks with write-back cache enabled all data has to be flushed to disks before next PPL entry. The disks to be flushed are marked in the bitmap. It's modified under a mutex and it's only read after PPL io unit is submitted. A limitation of 64 disks in the array has been introduced to keep data structures and implementation simple. RAID5 arrays with so many disks are not likely due to high risk of multiple disks failure. Such restriction should not be a real life limitation. With write-back cache disabled next PPL entry is submitted when data write for current one completes. Data flush defers next log submission so trigger it when there are no stripes for handling found. As PPL assures all data is flushed to disk at request completion, just acknowledge flush request when PPL is enabled. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
# 92e6245d	19-Dec-2017	Song Liu <songliubraving@fb.com>	md/r5cache: print more info of log recovery Log recovery is critical for raid5 journal/cache. Printing information about each recovery by default will help the system admin monitor the status of the array. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# d5d885fd	19-Nov-2017	Song Liu <songliubraving@fb.com>	md: introduce new personality funciton start() In do_md_run(), md threads should not wake up until the array is fully initialized in md_run(). However, in raid5_run(), raid5-cache may wake up mddev->thread to flush stripes that need to be written back. This design doesn't break badly right now. But it could lead to bad bug in the future. This patch tries to resolve this problem by splitting start up work into two personality functions, run() and start(). Tasks that do not require the md threads should go into run(), while task that require the md threads go into start(). r5l_load_log() is moved to raid5_start(), so it is not called until the md threads are started in do_md_run(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# ff35f58e	19-Nov-2017	Song Liu <songliubraving@fb.com>	md/r5cache: move mddev_lock() out of r5c_journal_mode_set() r5c_journal_mode_set() is called by r5c_journal_mode_store() and raid_ctr() in dm-raid. We don't need mddev_lock() when calling from raid_ctr(). This patch fixes this by moves the mddev_lock() to r5c_journal_mode_store(). Cc: stable@vger.kernel.org (v4.13+) Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# efa4b77b	18-Oct-2017	Shaohua Li <shli@fb.com>	md: use lockdep_assert_held lockdep_assert_held is a better way to assert lock held, and it works for UP. Signed-off-by: Shaohua Li <shli@fb.com>
# b03e0ccb	18-Oct-2017	NeilBrown <neilb@suse.com>	md: remove special meaning of ->quiesce(.., 2) The '2' argument means "wake up anything that is waiting". This is an inelegant part of the design and was added to help support management of suspend_lo/suspend_hi setting. Now that suspend_lo/hi is managed in mddev_suspend/resume, that need is gone. These is still a couple of places where we call 'quiesce' with an argument of '2', but they can safely be changed to call ->quiesce(.., 1); ->quiesce(.., 0) which achieve the same result at the small cost of pausing IO briefly. This removes a small "optimization" from suspend_{hi,lo}_store, but it isn't clear that optimization served a useful purpose. The code now is a lot clearer. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 4d5324f7	18-Oct-2017	NeilBrown <neilb@suse.com>	md: always hold reconfig_mutex when calling mddev_suspend() Most often mddev_suspend() is called with reconfig_mutex held. Make this a requirement in preparation a subsequent patch. Also require reconfig_mutex to be held for mddev_resume(), partly for symmetry and partly to guarantee no races with incr/decr of mddev->suspend. Taking the mutex in r5c_disable_writeback_async() is a little tricky as this is called from a work queue via log->disable_writeback_work, and flush_work() is called on that while holding ->reconfig_mutex. If the work item hasn't run before flush_work() is called, the work function will not be able to get the mutex. So we use mddev_trylock() inside the wait_event() call, and have that abort when conf->log is set to NULL, which happens before flush_work() is called. We wait in mddev->sb_wait and ensure this is woken when any of the conditions change. This requires waking mddev->sb_wait in mddev_unlock(). This is only like to trigger extra wake_ups of threads that needn't be woken when metadata is being written, and that doesn't happen often enough that the cost would be noticeable. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 935fe098	10-Oct-2017	Mike Snitzer <snitzer@redhat.com>	md: rename some drivers/md/ files to have an "md-" prefix Motivated by the desire to illiminate the imprecise nature of DM-specific patches being unnecessarily sent to both the MD maintainer and mailing-list. Which is born out of the fact that DM files also reside in drivers/md/ Now all MD-specific files in drivers/md/ start with either "raid" or "md-" and the MAINTAINERS file has been updated accordingly. Shaohua: don't change module name Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
# a72cbf83	08-Aug-2017	Song Liu <songliubraving@fb.com>	md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show In r5c_journal_mode_show(), it is necessary to call mddev_lock() before accessing conf and conf->log. Otherwise, the conf->log may change (and become NULL). Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 74d46992	23-Aug-2017	Christoph Hellwig <hch@lst.de>	block: replace bi_bdev with a gendisk pointer and partitions index This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# a9501d74	03-Aug-2017	Song Liu <songliubraving@fb.com>	md/r5cache: fix io_unit handling in r5l_log_endio() In r5l_log_endio(), once log->io_list_lock is released, the io unit may be accessed (or even freed) by other threads. Current code doesn't handle the io_unit properly, which leads to potential race conditions. This patch solves this race condition by: 1. Add a pending_stripe count flush_payload. Multiple flush_payloads are counted as only one pending_stripe. Flag has_flush_payload is added to show whether the io unit has flush_payload; 2. In r5l_log_endio(), check flags has_null_flush and has_flush_payload with log->io_list_lock held. After the lock is released, this IO unit is only accessed when we know the pending_stripe counter cannot be zeroed by other threads. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# b44886c5	31-Jul-2017	Song Liu <songliubraving@fb.com>	md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_set In r5c_journal_mode_set(), it is necessary to call mddev_lock() before accessing conf and conf->log. Otherwise, the conf->log may change (and become NULL). Shaohua: fix unlock in failure cases Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 011067b0	17-Jun-2017	NeilBrown <neilb@suse.com>	blk: replace bioset_create_nobvec() with a flags arg to bioset_create() "flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
# 4e4cbee9	03-Jun-2017	Christoph Hellwig <hch@lst.de>	block: switch bios to blk_status_t Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 5a8948f8	31-May-2017	Jan Kara <jack@suse.cz>	md: Make flush bios explicitely sync Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA\|PREFLUSH\|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: linux-raid@vger.kernel.org CC: Shaohua Li <shli@kernel.org> Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Shaohua Li <shli@fb.com>
# 5ddf0440	11-May-2017	Song Liu <songliubraving@fb.com>	md/r5cache: handle sync with data in write back cache Currently, sync of raid456 array cannot make progress when hitting data in writeback r5cache. This patch fixes this issue by flushing cached data of the stripe before processing the sync request. This is achived by: 1. In handle_stripe(), do not set STRIPE_SYNCING if the stripe is in write back cache; 2. In r5c_try_caching_write(), handle the stripe in sync with write through; 3. In do_release_stripe(), make stripe in sync write out and send it to the state machine. Shaohua: explictly set STRIPE_HANDLE after write out completed Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 70d466f7	11-May-2017	Song Liu <songliubraving@fb.com>	md/r5cache: gracefully handle journal device errors for writeback mode For the raid456 with writeback cache, when journal device failed during normal operation, it is still possible to persist all data, as all pending data is still in stripe cache. However, it is necessary to handle journal failure gracefully. During journal failures, the following logic handles the graceful shutdown of journal: 1. raid5_error() marks the device as Faulty and schedules async work log->disable_writeback_work; 2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is suspended, set to write through, and then resumed. mddev_suspend() flushes all cached stripes; 3. All cached stripes need to be flushed carefully to the RAID array. This patch fixes issues within the process above: 1. In r5c_update_on_rdev_error() schedule disable_writeback_work for journal failures; 2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING, since raid5_error() updates superblock. 3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0) to make progress during log_failed; 4. In delay_towrite(), if log failed only process data in the cache (skip new writes in dev->towrite); 5. In __get_priority_stripe(), process loprio_list during journal device failures. 6. In raid5_remove_disk(), wait for all cached stripes are flushed before calling log_exit(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# bb3338d3	08-May-2017	Song Liu <songliubraving@fb.com>	md/raid5-cache: in r5l_do_submit_io(), submit io->split_bio first In r5l_do_submit_io(), it is necessary to check io->split_bio before submit io->current_bio. This is because, endio of current_bio may free the whole IO unit, and thus change io->split_bio. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 78e470c2	22-Mar-2017	Heinz Mauelshagen <heinzm@redhat.com>	md: add raid4/5/6 journal mode switching API Commit 2ded370373a4 ("md/r5cache: State machine for raid5-cache write back mode") added support for "write-back" caching on the raid journal device. In order to allow the dm-raid target to switch between the available "write-through" and "write-back" modes, provide a new r5c_journal_mode_set() API. Use the new API in existing r5c_journal_mode_store() Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Shaohua Li <shli@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
# 1ad45a9b	24-Mar-2017	Jason Yan <yanaijie@huawei.com>	md/raid5-cache: fix payload endianness problem in raid5-cache The payload->header.type and payload->size are little-endian, so just convert them to the right byte order. Signed-off-by: Jason Yan <yanaijie@huawei.com> Cc: <stable@vger.kernel.org> #v4.10+ Signed-off-by: Shaohua Li <shli@fb.com>
# 016c76ac	14-Mar-2017	NeilBrown <neilb@suse.com>	md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter md/raid5 needs to keep track of how many stripe_heads are processing a bio so that it can delay calling bio_endio() until all stripe_heads have completed. It currently uses 16 bits of ->bi_phys_segments for this purpose. 16 bits is only enough for 256M requests, and it is possible for a single bio to be larger than this, which causes problems. Also, the bio struct contains a larger counter, __bi_remaining, which has a purpose very similar to the purpose of our counter. So stop using ->bi_phys_segments, and instead use __bi_remaining. This means we don't need to initialize the counter, as our caller initializes it to '1'. It also means we can call bio_endio() directly as it tests this counter internally. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# bd83d0a2	14-Mar-2017	NeilBrown <neilb@suse.com>	md/raid5: call bio_endio() directly rather than queueing for later. We currently gather bios that need to be returned into a bio_list and call bio_endio() on them all together. The original reason for this was to avoid making the calls while holding a spinlock. Locking has changed a lot since then, and that reason is no longer valid. So discard return_io() and various return_bi lists, and just call bio_endio() directly as needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 49728050	14-Mar-2017	NeilBrown <neilb@suse.com>	md/raid5: use md_write_start to count stripes, not bios We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# ea17481f	09-Mar-2017	Song Liu <songliubraving@fb.com>	md/r5cache: generate R5LOG_PAYLOAD_FLUSH In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to log->current_io. Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in quiesce. Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload. However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH, which is simpler. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 2d4f4687	07-Mar-2017	Song Liu <songliubraving@fb.com>	md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery. Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush finish. When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity will be dropped from recovery. This will reduce the number of stripes to replay, and thus accelerate the recovery process. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# ff875738	09-Mar-2017	Artur Paszkiewicz <artur.paszkiewicz@intel.com>	raid5: separate header for log functions Move raid5-cache declarations from raid5.h to raid5-log.h, add inline wrappers for functions which will be shared with ppl and use them in raid5 core instead of direct calls to raid5-cache. Remove unused parameter from r5c_cache_data(), move two duplicated pr_debug() calls to r5l_init_log(). Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
# effe6ee7	07-Mar-2017	Song Liu <songliubraving@fb.com>	md/r5cache: improve recovery with read ahead page pool In r5cache recovery, the journal device is scanned page by page. Currently, we use sync_page_io() to read journal device. This is not efficient when we have to recovery many stripes from the journal. To improve the speed of recovery, this patch introduces a read ahead page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive pages are read in one IO. Then the recovery code read the journal from ra_pool. With ra_pool, r5l_recovery_ctx has become much bigger. Therefore, r5l_recovery_log() is refactored so r5l_recovery_ctx is not using stack space. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 84890c03	15-Feb-2017	Shaohua Li <shli@fb.com>	md/raid5-cache: bump flush stripe batch size Bump the flush stripe batch size to 2048. For my 12 disks raid array, the stripes takes: 12 * 4k * 2048 = 96MB This is still quite small. A hardware raid card generally has 1GB size, which we suggest the raid5-cache has similar cache size. The advantage of a big batch size is we can dispatch a lot of IO in the same time, then we can do some scheduling to make better IO pattern. Last patch prioritizes stripes, so we don't worry about a big flush stripe batch will starve normal stripes. Signed-off-by: Shaohua Li <shli@fb.com>
# e33fbb9c	10-Feb-2017	Shaohua Li <shli@fb.com>	md/raid5-cache: exclude reclaiming stripes in reclaim check stripes which are being reclaimed are still accounted into cached stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the stripes and does unnecessary stripe reclaim. In practice, I saw one stripe is reclaimed one time. This will cause bad IO pattern. Fixing this by excluding the reclaing stripes in the check. Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# e8fd52ee	10-Feb-2017	Shaohua Li <shli@fb.com>	md/raid5-cache: stripe reclaim only counts valid stripes When log space is tight, we try to reclaim stripes from log head. There are stripes which can't be reclaimed right now if some conditions are met. We skip such stripes but accidentally count them, which might cause no stripes are claimed. Fixing this by only counting valid stripes. Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 39b99586	24-Jan-2017	Song Liu <songliubraving@fb.com>	md/r5cache: improve journal device efficiency It is important to be able to flush all stripes in raid5-cache. Therefore, we need reserve some space on the journal device for these flushes. If flush operation includes pending writes to the stripe, we need to reserve (conf->raid_disk + 1) pages per stripe for the flush out. This reduces the efficiency of journal space. If we exclude these pending writes from flush operation, we only need (conf->max_degraded + 1) pages per stripe. With this patch, when log space is critical (R5C_LOG_CRITICAL=1), pending writes will be excluded from stripe flush out. Therefore, we can reduce reserved space for flush out and thus improve journal device efficiency. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 03b047f4	11-Jan-2017	Song Liu <songliubraving@fb.com>	md/r5cache: enable chunk_aligned_read with write back cache Chunk aligned read significantly reduces CPU usage of raid456. However, it is not safe to fully bypass the write back cache. This patch enables chunk aligned read with write back cache. For chunk aligned read, we track stripes in write back cache at a bigger granularity, "big_stripe". Each chunk may contain more than one stripe (for example, a 256kB chunk contains 64 4kB-page, so this chunk contain 64 stripes). For chunk_aligned_read, these stripes are grouped into one big_stripe, so we only need one lookup for the whole chunk. For each big_stripe, struct big_stripe_info tracks how many stripes of this big_stripe are in the write back cache. We count how many stripes of this big_stripe are in the write back cache. These counters are tracked in a radix tree (big_stripe_tree). r5c_tree_index() is used to calculate keys for the radix tree. chunk_aligned_read() calls r5c_big_stripe_cached() to look up big_stripe of each chunk in the tree. If this big_stripe is in the tree, chunk_aligned_read() aborts. This look up is protected by rcu_read_lock(). It is necessary to remember whether a stripe is counted in big_stripe_tree. Instead of adding new flag, we reuses existing flags: STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these two flags are set, the stripe is counted in big_stripe_tree. This requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to r5c_try_caching_write(); and moving clear_bit of STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to r5c_finish_stripe_write_out(). Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 2e38a37f	24-Jan-2017	Song Liu <songliubraving@fb.com>	md/r5cache: disable write back for degraded array write-back cache in degraded mode introduces corner cases to the array. Although we try to cover all these corner cases, it is safer to just disable write-back cache when the array is in degraded mode. In this patch, we disable writeback cache for degraded mode: 1. On device failure, if the array enters degraded mode, raid5_error() will submit async job r5c_disable_writeback_async to disable writeback; 2. In r5c_journal_mode_store(), it is invalid to enable writeback in degraded mode; 3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled in write-through mode. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# a85dd7b8	23-Jan-2017	Song Liu <songliubraving@fb.com>	md/r5cache: flush data only stripes in r5l_recovery_log() For safer operation, all arrays start in write-through mode, which has been better tested and is more mature. And actually the write-through/write-mode isn't persistent after array restarted, so we always start array in write-through mode. However, if recovery found data-only stripes before the shutdown (from previous write-back mode), it is not safe to start the array in write-through mode, as write-through mode can not handle stripes with data in write-back cache. To solve this problem, we flush all data-only stripes in r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with empty cache in write-through mode. This logic is implemented in r5c_recovery_flush_data_only_stripes(): 1. enable write back cache 2. flush all stripes 3. wake up conf->mddev->thread 4. wait for all stripes get flushed (reuse wait_for_quiescent) 5. disable write back cache The wait in 4 will be waked up in release_inactive_stripe_list() when conf->active_stripes reaches 0. It is safe to wake up mddev->thread here because all the resource required for the thread has been initialized. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 86aa1397	12-Jan-2017	Song Liu <songliubraving@fb.com>	md/r5cache: read data into orig_page for prexor of cached data With write back cache, we use orig_page to do prexor. This patch makes sure we read data into orig_page for it. Flag R5_OrigPageUPTDODATE is added to show whether orig_page has the latest data from raid disk. We introduce a helper function uptodate_for_rmw() to simplify the a couple conditions in handle_stripe_dirtying(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# d46d29f0	11-Jan-2017	Shaohua Li <shli@fb.com>	md/raid5-cache: delete meaningless code sector_t is unsigned long, it's never < 0 Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Shaohua Li <shli@fb.com>
# 99f17890	22-Dec-2016	Colin Ian King <colin.king@canonical.com>	md/r5cache: fix spelling mistake on "recoverying" Trivial fix to spelling mistake "recoverying" to "recovering" in pr_dbg message. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
# d2250f10	14-Dec-2016	Song Liu <songliubraving@fb.com>	md/r5cache: assign conf->log before r5l_load_log() r5l_load_log() calls functions that requires a proper conf->log, for example, r5c_is_writeback(). Therefore, we should set conf->log before calling r5l_load_log(). If r5l_load_log() fails, conf->log is set back to NULL. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 3c66abba	14-Dec-2016	Song Liu <songliubraving@fb.com>	md/r5cache: simplify handling of sh->log_start in recovery We only need to update sh->log_start at the end of recovery, which is r5c_recovery_rewrite_data_only_stripes(), so it is not necessary to set it before that. In this patch, log_start is removed from r5c_recovery_alloc_stripe(). After updating all sh->log_start, rewrite_data_only_stripes() also updates log->next_checkpoints to the last sh->log_start. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 28ca833e	12-Dec-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: removes unnecessary write-through mode judgments The write-through mode has been returned in front of the function, do not need to do it again. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 2953079c	08-Dec-2016	Shaohua Li <shli@fb.com>	md: separate flags for superblock changes The mddev->flags are used for different purposes. There are a lot of places we check/change the flags without masking unrelated flags, we could check/change unrelated flags. These usage are most for superblock write, so spearate superblock related flags. This should make the code clearer and also fix real bugs. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 3c6edc66	07-Dec-2016	Song Liu <songliubraving@fb.com>	md/r5cache: after recovery, increase journal seq by 10000 Currently, we increase journal entry seq by 10 after recovery. However, this is not sufficient in the following case. After crash the journal looks like \| seq+0 \| +1 \| +2 \| +3 \| +4 \| +5 \| +6 \| +7 \| ... \| +11 \| +12 \| If +1 is not valid, we dropped all entries from +1 to +12; and write seq+10: \| seq+0 \| +10 \| +2 \| +3 \| +4 \| +5 \| +6 \| +7 \| ... \| +11 \| +12 \| However, if we write a big journal entry with seq+11, it will connect with some stale journal entry: \| seq+0 \| +10 \| +11 \| +12 \| To reduce the risk of this issue, we increase seq by 10000 instead. Shaohua: use 10000 instead of 1000. The risk should be very unlikely. The total stripe cache size is less than 2k typically, and several stripes can fit into one meta data block. So the total inflight meta data blocks would be quite small, which means the the total sequence number used should be quite small. The 10000 sequence number increase should be far more than safe. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 5c88f403	07-Dec-2016	Song Liu <songliubraving@fb.com>	md/raid5-cache: fix crc in rewrite_data_only_stripes() r5l_recovery_create_empty_meta_block() creates crc for the empty metablock. After the metablock is updated, we need clear the checksum before recalculate it. Shaohua: moved checksum calculation out of r5l_recovery_create_empty_meta_block. We should calculate it after all fields are updated. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# d30dfeb9	07-Dec-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: no recovery is required when create super-block When create the super-block information, We do not need to do this recovery stage, only need to initialize some variables. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 3d7e7e1d	04-Dec-2016	Zhengyuan Liu <liuzhengyuan@kylinos.cn>	md/r5cache: do r5c_update_log_state after log recovery We should update log state after we did a log recovery, current completion may get wrong log state since log->log_start wasn't initalized until we called r5l_recovery_log. At log recovery stage, no lock needed as there is no race conditon. next_checkpoint field will be initialized in r5l_recovery_log too. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# 43b96748	04-Dec-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: adjust the write position of the empty block if no data blocks When recovery is complete, we write an empty block and record his position first, then make the data-only stripes rewritten done, the location of the empty block as the last checkpoint position to write into the super block. And we should update last_checkpoint to this empty block position. ------------------------------------------------------------------ \| old log \| empty block \| data only stripes \| invalid log \| ------------------------------------------------------------------ ^ ^ ^ \| \|- log->last_checkpoint \|- log->log_start \| \|- log->last_cp_seq \|- log->next_checkpoint \|- log->seq=n \|- log->seq=10+n At the same time, if there is no data-only stripes, this scene may appear, \| meta1 \| meta2 \| meta3 \| meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should The solution is we create a new meta in meta2 with its seq == meta1's seq + 10 and let superblock points to meta2. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# f687a33e	30-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0 With writeback cache, we define log space critical as free_space < 2 * reclaim_required_space So the deassert of R5C_LOG_CRITICAL could happen when 1. free_space increases 2. reclaim_required_space decreases Currently, run_no_space_stripes() is called when 1 happens, but not (always) when 2 happens. With this patch, run_no_space_stripes() is call when R5C_LOG_CRITICAL is cleared. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 1a0ec5c3	28-Nov-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly R5c_make_stripe_write_out has set this flag, do not need to set again. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# dbd22c8d	28-Nov-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log The next_cp_seq field is useless, remove it. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# bc8f167f	28-Nov-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: release the stripe_head at the appropriate location If we released the 'stripe_head' in r5c_recovery_flush_log, ctx->cached_list will both release the data-parity stripes and data-only stripes, which will become empty. And we also need to use the data-only stripes in r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite data-only stripes is done before releasing them. Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# fc833c2a	28-Nov-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: use ring add to prevent overflow 'write_pos' must be protected with 'r5l_ring_add', or it may overflow Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 9b69173e	28-Nov-2016	JackieLiu <liuyun01@kylinos.cn>	md/raid5-cache: remove unnecessary function parameters The function parameter 'recovery_list' is not used in body, we can delete it Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 462eb7d8	25-Nov-2016	Zhengyuan Liu <liuzhengyuan@kylinos.cn>	raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine would release the stripe later and add it into neither r5c_cached_full_stripes list or r5c_cached_partial_stripes list and set correct flag. Reviewed-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# f7b7bee7	25-Nov-2016	Zhengyuan Liu <liuzhengyuan@kylinos.cn>	raid5-cache: add another check conditon before replaying one stripe New stripe that was just allocated has no STRIPE_R5C_CACHING state too, add this check condition could avoid unnecessary replaying for empty stripe. r5l_recovery_replay_one_stripe would reset stripe for any case, delete it to make code more clean. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# d3014e21	24-Nov-2016	Dan Carpenter <dan.carpenter@oracle.com>	md/r5cache: enable IRQs on error path We need to re-enable the IRQs here before returning. Fixes: a39f7afde358 ("md/r5cache: write-out phase and reclaim support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Shaohua Li <shli@fb.com>
# d7bd398e	23-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: handle alloc_page failure RMW of r5c write back cache uses an extra page to store old data for prexor. handle_stripe_dirtying() allocates this page by calling alloc_page(). However, alloc_page() may fail. To handle alloc_page() failures, this patch adds an extra page to disk_info. When alloc_page fails, handle_stripe() trys to use these pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE), the stripe is added to delayed_list. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# ce1ccd07	21-Nov-2016	Shaohua Li <shli@fb.com>	raid5-cache: suspend reclaim thread instead of shutdown There is mechanism to suspend a kernel thread. Use it instead of playing create/destroy game. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>
# 3a83f467	22-Nov-2016	Ming Lei <tom.leiming@gmail.com>	block: bio: pass bvec table to bio_init() Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixed up the new O_DIRECT cases. Signed-off-by: Jens Axboe <axboe@fb.com>
# 3bddb7f8	18-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: handle FLUSH and FUA With raid5 cache, we committing data from journal device. When there is flush request, we need to flush journal device's cache. This was not needed in raid5 journal, because we will flush the journal before committing data to raid disks. This is similar to FUA, except that we also need flush journal for FUA. Otherwise, corruptions in earlier meta data will stop recovery from reaching FUA data. slightly changed the code by Shaohua Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 5aabf7c4	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: r5cache recovery: part 2 1. In previous patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In this patchpatch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# b4c625c6	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: r5cache recovery: part 1 Recovery of write-back cache has different logic to write-through only cache. Specifically, for write-back cache, the recovery need to scan through all active journal entries before flushing data out. Therefore, large portion of the recovery logic is rewritten here. To make the diffs cleaner, we split the rewrite as follows: 1. In this patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In next patch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions With cache feature, there are 2 different scenarios of recovery: 1. Data-Parity stripe: a stripe with complete parity in journal. 2. Data-Only stripe: a stripe with only data in journal (or partial parity). The code differentiate Data-Parity stripe from Data-Only stripe with flag STRIPE_R5C_CACHING. For Data-Parity stripes, we use the same procedure as raid5 journal, where all the data and parity are replayed to the RAID devices. For Data-Only strips, we need to finish complete calculate parity and finish the full reconstruct write or RMW write. For simplicity, in the recovery, we load the stripe to stripe cache. Once the array is started, the stripe cache state machine will handle these stripes through normal write path. r5c_recovery_flush_log contains the main procedure of recovery. The recovery code first scans through the journal and loads data to stripe cache. The code keeps tracks of all these stripes in a list (use sh->lru and ctx->cached_list), stripes in the list are organized in the order of its first appearance on the journal. During the scan, the recovery code assesses each stripe as Data-Parity or Data-Only. During scan, the array may run out of stripe cache. In these cases, the recovery code will also call raid5_set_cache_size to increase stripe cache size. If the array still runs out of stripe cache because there isn't enough memory, the array will not assemble. At the end of scan, the recovery code replays all Data-Parity stripes, and sets proper states for Data-Only stripes. The recovery code also increases seq number by 10 and rewrites all Data-Only stripes to journal. This is to avoid confusion after repeated crashes. More details is explained in raid5-cache.c before r5c_recovery_rewrite_data_only_stripes(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 9ed988f5	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: refactoring journal recovery code 1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block(); 2. pull the code that initialize r5l_meta_block from r5l_log_write_empty_meta_block() to a separate function r5l_recovery_create_empty_meta_block(), so that we can reuse this piece of code. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 2c7da14b	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: sysfs entry journal_mode With write cache, journal_mode is the knob to switch between write-back and write-through. Below is an example: root@virt-test:~/# cat /sys/block/md0/md/journal_mode [write-through] write-back root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode root@virt-test:~/# cat /sys/block/md0/md/journal_mode write-through [write-back] Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# a39f7afd	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: write-out phase and reclaim support There are two limited resources, stripe cache and journal disk space. For better performance, we priotize reclaim of full stripe writes. To free up more journal space, we free earliest data on the journal. In current implementation, reclaim happens when: 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim if there is no reclaim in the past 5 seconds. 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes, or cached stripes is enough for a full stripe (chunk size / 4k) (r5c_check_cached_full_stripe) 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage) 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data) r5c_do_reclaim() contains new logic of reclaim. For stripe cache: When stripe cache pressure is high (more than 3/4 stripes are cached, or there is empty inactive lists), flush all full stripe. If fewer than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes are flushed, flush some paritial stripes. When stripe cache pressure is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes. For log space: To avoid deadlock due to log space, we need to reserve enough space to flush cached data. The size of required log space depends on total number of cached stripes (stripe_in_journal_count). In current implementation, the writing-out phase automatically include pending data writes with parity writes (similar to write through case). Therefore, we need up to (conf->raid_disks + 1) pages for each cached stripe (1 page for meta data, raid_disks pages for all data and parity). r5c_log_required_to_flush_cache() calculates log space required to flush cache. In the following, we refer to the space calculated by r5c_log_required_to_flush_cache() as reclaim_required_space. Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log device is less than 2x of reclaim_required_space. r5c_cache keeps all data in cache (not fully committed to RAID) in a list (stripe_in_journal_list). These stripes are in the order of their first appearance on the journal. So the log tail (last_checkpoint) should point to the journal_start of the first item in the list. When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is set, the state machine only writes data that are already in the log device (in stripe_in_journal_list). This patch includes a fix to improve performance by Shaohua Li <shli@fb.com>. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 1e6d690b	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: caching phase of r5cache As described in previous patch, write back cache operates in two phases: caching and writing-out. The caching phase works as: 1. write data to journal (r5c_handle_stripe_dirtying, r5c_cache_data) 2. call bio_endio (r5c_handle_data_cached, r5c_return_dev_pending_writes). Then the writing-out phase is as: 1. Mark the stripe as write-out (r5c_make_stripe_write_out) 2. Calcualte parity (reconstruct or RMW) 3. Write parity (and maybe some other data) to journal device 4. Write data and parity to RAID disks This patch implements caching phase. The cache is integrated with stripe cache of raid456. It leverages code of r5l_log to write data to journal device. Writing-out phase of the cache is implemented in the next patch. With r5cache, write operation does not wait for parity calculation and write out, so the write latency is lower (1 write to journal device vs. read and then write to raid disks). Also, r5cache will reduce RAID overhead (multipile IO due to read-modify-write of parity) and provide more opportunities of full stripe writes. This patch adds 2 flags to stripe_head.state: - STRIPE_R5C_PARTIAL_STRIPE, - STRIPE_R5C_FULL_STRIPE, Instead of inactive_list, stripes with cached data are tracked in r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list. STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for stripes in these lists. Note: stripes in r5c_full/partial_stripe_list are not considered as "active". For RMW, the code allocates an extra page for each data block being updated. This is stored in r5dev->orig_page and the old data is read into it. Then the prexor calculation subtracts ->orig_page from the parity block, and the reconstruct calculation adds the ->page data back into the parity block. r5cache naturally excludes SkipCopy. When the array has write back cache, async_copy_data() will not skip copy. There are some known limitations of the cache implementation: 1. Write cache only covers full page writes (R5_OVERWRITE). Writes of smaller granularity are write through. 2. Only one log io (sh->log_io) for each stripe at anytime. Later writes for the same stripe have to wait. This can be improved by moving log_io to r5dev. 3. With writeback cache, read path must enter state machine, which is a significant bottleneck for some workloads. 4. There is no per stripe checkpoint (with r5l_payload_flush) in the log, so recovery code has to replay more than necessary data (sometimes all the log from last_checkpoint). This reduces availability of the array. This patch includes a fix proposed by ZhengYuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 2ded3703	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: State machine for raid5-cache write back mode This patch adds state machine for raid5-cache. With log device, the raid456 array could operate in two different modes (r5c_journal_mode): - write-back (R5C_MODE_WRITE_BACK) - write-through (R5C_MODE_WRITE_THROUGH) Existing code of raid5-cache only has write-through mode. For write-back cache, it is necessary to extend the state machine. With write-back cache, every stripe could operate in two different phases: - caching - writing-out In caching phase, the stripe handles writes as: - write to journal - return IO In writing-out phase, the stripe behaviors as a stripe in write through mode R5C_MODE_WRITE_THROUGH. STRIPE_R5C_CACHING is added to sh->state to differentiate caching and writing-out phase. Please note: this is a "no-op" patch for raid5-cache write-through mode. The following detailed explanation is copied from the raid5-cache.c: /* * raid5 cache state machine * * With rhe RAID cache, each stripe works in two phases: * - caching phase * - writing-out phase * * These two phases are controlled by bit STRIPE_R5C_CACHING: * if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase * if STRIPE_R5C_CACHING == 1, the stripe is in caching phase * * When there is no journal, or the journal is in write-through mode, * the stripe is always in writing-out phase. * * For write-back journal, the stripe is sent to caching phase on write * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off * the write-out phase by clearing STRIPE_R5C_CACHING. * * Stripes in caching phase do not write the raid disks. Instead, all * writes are committed from the log device. Therefore, a stripe in * caching phase handles writes as: * - write to log device * - return IO * * Stripes in writing-out phase handle writes as: * - calculate parity * - write pending data and parity to journal * - write data and parity to raid disks * - return IO for pending writes */ Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# c757ec95	17-Nov-2016	Song Liu <songliubraving@fb.com>	md/r5cache: Check array size in r5l_init_log Currently, r5l_write_stripe checks meta size for each stripe write, which is not necessary. With this patch, r5l_init_log checks maximal meta size of the array, which is (r5l_meta_block + raid_disks x r5l_payload_data_parity). If this is too big to fit in one page, r5l_init_log aborts. With current meta data, r5l_log support raid_disks up to 203. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 354b445b	16-Nov-2016	Shaohua Li <shli@fb.com>	raid5-cache: fix lockdep warning lockdep reports warning of the rcu_dereference usage. Using normal rdev access pattern to avoid the warning. Signed-off-by: Shaohua Li <shli@fb.com>
# 3fd880af	02-Nov-2016	JackieLiu <liuyun01@kylinos.cn>	raid5-cache: restrict the use area of the log_offset variable We can calculate this offset by using ctx->meta_total_blocks, without passing in from the function Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# 70fd7614	01-Nov-2016	Christoph Hellwig <hch@lst.de>	block,fs: use REQ_* flags directly Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
# 9a8b27fa	27-Oct-2016	Shaohua Li <shli@fb.com>	raid5-cache: correct condition for empty metadata write As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# 56056c2e	24-Oct-2016	Zhengyuan Liu <liuzhengyuan@kylinos.cn>	md/raid5: write an empty meta-block when creating log super-block If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# 28cd88e2	23-Oct-2016	Zhengyuan Liu <liuzhengyuan@kylinos.cn>	md/raid5: initialize next_checkpoint field before use No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
# 8e018c21	25-Aug-2016	Shaohua Li <shli@fb.com>	raid5-cache: fix a deadlock in superblock write There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail. Updating superblock (either directly call md_update_sb() or depend on md thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all IO finish, hence waitting for reclaim thread, while reclaim thread is calling this function and waitting for reconfig mutex. So there is a deadlock. We workaround this issue with a trylock. The downside of the solution is we could miss discard if we can't take reconfig mutex. But this should happen rarely (mainly in raid array stop), so miss discard shouldn't be a big problem. Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# 1eff9d32	05-Aug-2016	Jens Axboe <axboe@fb.com>	block: rename bio bi_rw to bi_opf Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
# 28a8f0d3	05-Jun-2016	Mike Christie <mchristi@redhat.com>	block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 796a5cf0	05-Jun-2016	Mike Christie <mchristi@redhat.com>	md: use bio op accessors Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
# 4e49ea4a	05-Jun-2016	Mike Christie <mchristi@redhat.com>	block/fs/drivers: remove rw argument from submit_bio This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
# 85ad1d13	03-May-2016	Guoqing Jiang <gqjiang@suse.com>	md: set MD_CHANGE_PENDING in a atomic region Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
# c888a8f9	13-Apr-2016	Jens Axboe <axboe@fb.com>	block: kill off q->flush_flags Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries. Signed-off-by: Jens Axboe <axboe@fb.com>
# 16a43f6a	06-Jan-2016	Shaohua Li <shli@fb.com>	raid5-cache: handle journal hotadd in quiesce Handle journal hotadd in quiesce to avoid creating duplicated threads. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# a62ab49e	06-Jan-2016	Shaohua Li <shli@fb.com>	md: set MD_HAS_JOURNAL in correct places Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 5036c390	20-Dec-2015	Christoph Hellwig <hch@lst.de>	raid5: allow r5l_io_unit allocations to fail And propagate the error up the stack so we can add the stripe to no_stripes_list and retry our log operation later. This avoids blocking raid5d due to reclaim, an it allows to get rid of the deadlock-prone GFP_NOFAIL allocation. shli: add missing mempool_destroy() Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
# e8deb638	20-Dec-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: use a mempool for the metadata block We only have a limited number in flight, so use a page based mempool. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
# c38d29b3	20-Dec-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: use a bio_set This allows us to make guaranteed forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
# f6b6ec5c	20-Dec-2015	Shaohua Li <shli@fb.com>	raid5-cache: add journal hot add/remove support Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# ad66d445	20-Dec-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: free meta_page earlier Once the I/O completed we don't need the meta page anymore. As the iounits can live on for a long time this reduces memory pressure a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 3848c0bc	20-Dec-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: simplify r5l_move_io_unit_list It's only used for one kind of move, so make that explicit. Also clean up the code a bit by using list_for_each_safe. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 7dde2ad3	08-Oct-2015	Shaohua Li <shli@fb.com>	raid5-cache: start raid5 readonly if journal is missing If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 6e74a9cf	08-Oct-2015	Shaohua Li <shli@fb.com>	raid5-cache: IO error handling There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 4b482044	08-Oct-2015	Shaohua Li <shli@fb.com>	raid5-cache: add trim support for log Since superblock is updated infrequently, we do a simple trim of log disk (a synchronous trim) Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 6143e2ce	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: use bio chaining Simplify the bio completion handler by using bio chaining and submitting bios as soon as they are full. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 2b8ef16e	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: small log->seq cleanup Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# c1b99198	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: new helper: r5_reserve_log_entry Factor out code to reserve log space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 51039cd0	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta This is the only user, and keeping all code initializing the io_unit structure together improves readbility. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 1e932a37	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: take rdev->data_offset into account early on Set up bi_sector properly when we allocate an bio instead of updating it at submission time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
# b349feb3	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: refactor bio allocation Split out a helper to allocate a bio for log writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 22581f58	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: clean up r5l_get_meta Remove the only partially used local 'io' variable to simplify the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 56fef7c6	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: simplify state machine when caches flushes are not needed For devices without a volatile write cache we don't need to send a FLUSH command to ensure writes are stable on disk, and thus can avoid the whole step of batching up bios for processing by the MD thread. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# d8858f43	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: factor out a helper to run all stripes for an I/O unit Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 04732f74	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: rename flushed_ios to finished_ios After this series we won't nessecarily have flushed the cache for these I/Os, so give the list a more neutral name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 17036461	05-Oct-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: free I/O units earlier There is no good reason to keep the I/O unit structures around after the stripe has been written back to the RAID array. The only information we need is the log sequence number, and the checkpoint offset of the highest successfull writeback. Store those in the log structure, and free the IO units from __r5l_stripe_write_finished. Besides simplifying the code this also avoid having to keep the allocation for the I/O unit around for a potentially long time as superblock updates that checkpoint the log do not happen very often. This also fixes the previously incorrect calculation of 'free' in r5l_do_reclaim as a side effect: previous if took the last unit which isn't checkpointed into account. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# e6c033f7	04-Oct-2015	Shaohua Li <shli@fb.com>	raid5-cache: move reclaim stop to quiesce Move reclaim stop to quiesce handling, where is safer for this stuff. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 253f9fd4	04-Sep-2015	Shaohua Li <shli@fb.com>	raid5-cache: don't delay stripe captured in log There is a case a stripe gets delayed forever. 1. a stripe finishes construction 2. a new bio hits the stripe 3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set since construction can't run for new bio (the stripe is locked since step 1) Without log, handle_stripe will call ops_run_io. After IO finishes, the stripe gets unlocked and the stripe will restart and run construction for the new bio. With log, ops_run_io need to run two times. If the DELAYED bit set, the stripe can't enter into the handle_list, so the second ops_run_io doesn't run, which leaves the stripe stalled. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 85f2f9a4	04-Sep-2015	Shaohua Li <shli@fb.com>	raid5-cache: check stripe finish out of order stripes could finish out of order. Hence r5l_move_io_unit_list() of __r5l_stripe_write_finished might not move any entry and leave stripe_end_ios list empty. This applies on top of http://marc.info/?l=linux-raid&m=144122700510667 Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 828cbe98	02-Sep-2015	Shaohua Li <shli@fb.com>	raid5-cache: optimize FLUSH IO with log enabled With log enabled, bio is written to raid disks after the bio is settled down in log disk. The recovery guarantees we can recovery the bio data from log disk, so we we skip FLUSH IO. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 509ffec7	02-Sep-2015	Christoph Hellwig <hch@lst.de>	raid5-cache: move functionality out of __r5l_set_io_unit_state Just keep __r5l_set_io_unit_state as a small set the state wrapper, and remove r5l_set_io_unit_state entirely after moving the real functionality to the two callers that need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 0fd22b45	02-Sep-2015	Shaohua Li <shli@fb.com>	raid5-cache: fix a user-after-free bug r5l_compress_stripe_end_list() can free an io_unit. This breaks the assumption only reclaimer can free io_unit. We can add a reference count based io_unit free, but since only reclaim can wait io_unit becoming to STRIPE_END state, we use a simple global wait queue here. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# a8c34f91	02-Sep-2015	Shaohua Li <shli@fb.com>	raid5-cache: switching to state machine for log disk cache flush Before we write stripe data to raid disks, we must guarantee stripe data is settled down in log disk. To do this, we flush log disk cache and wait the flush finish. That wait introduces sleep time in raid5d thread and impact performance. This patch moves the log disk cache flush process to the stripe handling state machine, which can remove the wait in raid5d. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 5cb2fbd6	28-Oct-2015	Shaohua Li <shli@fb.com>	raid5-cache: use crc32c checksum crc32c has lower overhead with cpu acceleration. It's a shame I didn't use it in first post, sorry. This changes disk format, but we are still ok in current stage. V2: delete unnecessary type conversion as pointed out by Bart Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
# 355810d1	13-Aug-2015	Shaohua Li <shli@fb.com>	raid5: log recovery This is the log recovery support. The process is quite straightforward. We scan the log and read all valid meta/data/parity into memory. If a stripe's data/parity checksum is correct, the stripe will be recoveried. Otherwise, it's discarded and we don't scan the log further. The reclaim process guarantees stripe which starts to be flushed raid disks has completed data/parity and has correct checksum. To recovery a stripe, we just copy its data/parity to corresponding raid disks. The trick thing is superblock update after recovery. we can't let superblock point to last valid meta block. The log might look like: \| meta 1\| meta 2\| meta 3\| meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock points to meta 1, we write a new valid meta 2n. If crash happens again, new recovery will start from meta 1. Since meta 2n is valid, recovery will think meta 3 is valid, which is wrong. The solution is we create a new meta in meta2 with its seq == meta 1's seq + 10 and let superblock points to meta2. recovery will not think meta 3 is a valid meta, because its seq is wrong Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# 0576b1c6	13-Aug-2015	Shaohua Li <shli@fb.com>	raid5: log reclaim support This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run if there is specific reclaimable space (1/4 disk size or 10G) or we are out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency. The size based force reclaim is to make sure log isn't too big, so recovery doesn't scan log too much. Recovery make sure raid disks and log disk have the same data of a stripe. If crash happens before 4, recovery might/might not recovery stripe's data/parity depending on if data/parity and its checksum matches. In either case, this doesn't change the syntax of an IO write. After step 3, stripe is guaranteed recoverable, because stripe's data/parity is persistent in log disk. In some cases, log disk content and raid disks content of a stripe are the same, but recovery will still copy log disk content to raid disks, this doesn't impact data consistency. space reuse happens after superblock update and cache flush. There is one situation we want to avoid. A broken meta in the middle of a log causes recovery can't find meta at the head of log. If operations require meta at the head persistent in log, we must make sure meta before it persistent in log too. The case is stripe data/parity is in log and we start write stripe to raid disks (before step 4). stripe data/parity must be persistent in log before we do the write to raid disks. The solution is we restrictly maintain io_unit list order. In this case, we only write stripes of an io_unit to raid disks till the io_unit is the first one whose data/parity is in log. The io_unit list order is important for other cases too. For example, some io_unit are reclaimable and others not. They can be mixed in the list, we shouldn't reuse space of an unreclaimable io_unit. Includes fixes to problems which were... Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
# f6bed0ef	13-Aug-2015	Shaohua Li <shli@fb.com>	raid5: add basic stripe log This introduces a simple log for raid5. Data/parity writing to raid array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up raid resync and fix write hole issue. The log structure is pretty simple. Data/meta data is stored in block unit, which is 4k generally. It has only one type of meta data block. The meta data block can track 3 types of data, stripe data, stripe parity and flush block. MD superblock will point to the last valid meta data block. Each meta data block has checksum/seq number, so recovery can scan the log correctly. We store a checksum of stripe data/parity to the metadata block, so meta data and stripe data/parity can be written to log disk together. otherwise, meta data write must wait till stripe data/parity is finished. For stripe data, meta data block will record stripe data sector and size. Currently the size is always 4k. This meta data record can be made simpler if we just fix write hole (eg, we can record data of a stripe's different disks together), but this format can be extended to support caching in the future, which must record data address/size. For stripe parity, meta data block will record stripe sector. It's size should be 4k (for raid5) or 8k (for raid6). We always store p parity first. This format should work for caching too. flush block indicates a stripe is in raid array disks. Fixing write hole doesn't need this type of meta data, it's for caching extension. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>