#
ad860670 |
|
25-Nov-2023 |
Yu Kuai <yukuai3@huawei.com> |
md/raid5: remove rcu protection to access rdev from conf Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid456. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-5-yukuai1@huaweicloud.com
|
#
86298d8b |
|
11-Sep-2023 |
Qi Zheng <zhengqi.arch@bytedance.com> |
md/raid5: dynamically allocate the md-raid5 shrinker In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the md-raid5 shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct r5conf. Link: https://lkml.kernel.org/r/20230911094444.68966-26-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Song Liu <song@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
44693154 |
|
22-May-2023 |
Yu Kuai <yukuai3@huawei.com> |
md: protect md_thread with rcu Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
|
#
2f088dfc |
|
12-May-2023 |
Kees Cook <keescook@chromium.org> |
md/raid5: Convert stripe_head's "dev" to flexible array member Replace old-style 1-element array of "dev" in struct stripe_head with modern C99 flexible array. In the future, we can additionally annotate it with the run-time size, found in the "disks" member. Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/lkml/20230522212114.gonna.589-kees@kernel.org/ --- It looks like this memory calculation: memory = conf->min_nr_stripes * (sizeof(struct stripe_head) + max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; ... was already buggy (i.e. it included the single "dev" bytes in the result). However, I'm not entirely sure if that is the right analysis, since "dev" is not related to struct bio nor PAGE_SIZE?
|
#
2f2d51ef |
|
11-Aug-2022 |
Logan Gunthorpe <logang@deltatee.com> |
md/raid5: Cleanup prototype of raid5_get_active_stripe() Drop the three bools in the prototype of raid5_get_active_stripe() and replace them with a flags parameter. At the same time, drop the distinction with __raid5_get_active_stripe(). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
|
#
9892fa99 |
|
11-Aug-2022 |
Logan Gunthorpe <logang@deltatee.com> |
md/raid5: Drop extern on function declarations in raid5.h externs should not be used in function declarations, so clean those up. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
|
#
20313b1b |
|
27-Jul-2022 |
Logan Gunthorpe <logang@deltatee.com> |
md/raid5: Ensure batch_last is released before sleeping for quiesce A race condition exists where if raid5_quiesce() is called in the middle of a request that has set batch_last, it will deadlock. batch_last will hold a reference to a stripe when raid5_quiesce() is called. This will cause the next raid5_get_active_stripe() call to sleep waiting for the quiesce to finish, but the raid5_quiesce() thread will wait for active_stripes to go to zero which will never happen because request thread is waiting for the quiesce to stop. Fix this by creating a special __raid5_get_active_stripe() function which takes the request context and clears the last_batch before sleeping. While we're at it, change the arguments of raid5_get_active_stripe() to bools. Fixes: 3312e6c887fe ("md/raid5: Keep a reference to last stripe_head for batch") Reported-by: David Sloan <David.Sloan@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
b0920ede |
|
07-Apr-2022 |
Logan Gunthorpe <logang@deltatee.com> |
md/raid5: Add __rcu annotation to struct disk_info rdev and replacement are protected in some circumstances with rcu_dereference and synchronize_rcu (in raid5_remove_disk()). However, they were not annotated with __rcu so a sparse warning is emitted for every rcu_dereference() call. Add the __rcu annotation and fix up the initialization with RCU_INIT_POINTER, all pointer modifications with rcu_assign_pointer(), a few cases where the pointer value is tested with rcu_access_pointer() and one case where READ_ONCE() is used instead of rcu_dereference(), a case in print_raid5_conf() that should have rcu_dereference() and rcu_read_[un]lock() calls. Additional sparse issues will be fixed up in further commits. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
|
#
3d9a644c |
|
07-Apr-2022 |
Logan Gunthorpe <logang@deltatee.com> |
md/raid5: Un-nest struct raid5_percpu definition Sparse reports many warnings of the form: drivers/md/raid5.c:1476:16: warning: dereference of noderef expression This is because all struct raid5_percpu definitions get marked as __percpu when really only the pointer in r5conf should have that annotation. Fix this by moving the defnition of raid5_precpu out of the definition of struct r5conf. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
|
#
770b1d21 |
|
15-Nov-2021 |
Davidlohr Bueso <dave@stgolabs.net> |
md/raid5: play nice with PREEMPT_RT raid_run_ops() relies on the implicitly disabled preemption for its percpu ops, although this is really about CPU locality. This breaks RT semantics as it can take regular (and thus sleeping) spinlocks, such as stripe_lock. Add a local_lock such that non-RT does not change and continues to be just map to preempt_disable/enable, but makes RT happy as the region will use a per-CPU spinlock and thus be preemptible and still guarantee CPU locality. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
046169f0 |
|
20-Aug-2020 |
Yufen Yu <yuyufen@huawei.com> |
md/raid5: let multiple devices of stripe_head share page In current implementation, grow_buffers() uses alloc_page() to allocate the buffers for each stripe_head, i.e. allocate a page for each dev[i] in stripe_head. After setting stripe_size as a configurable value by writing sysfs entry, it means that we always allocate 64K buffers, but just use 4K of them when stripe_size is 4K in 64KB arm64. To avoid wasting memory, we try to let multiple sh->dev share one real page. That means, multiple sh->dev[i].page will point to the only page with different offset. Example of 64K PAGE_SIZE and 4K stripe_size as following: 64K PAGE_SIZE +---+---+---+---+------------------------------+ | | | | | | | | | | +-+-+-+-+-+-+-+-+------------------------------+ ^ ^ ^ ^ | | | +----------------------------+ | | | | | | +-------------------+ | | | | | | +----------+ | | | | | | +-+ | | | | | | | +-----+-----+------+-----+------+-----+------+------+ sh | offset(0) | offset(4K) | offset(8K) | offset(12K) | + +-----------+------------+------------+-------------+ +----> dev[0].page dev[1].page dev[2].page dev[3].page A new 'pages' array will be added into stripe_head to record shared page used by this stripe_head. Allocate them when grow_buffers() and free them when shrink_buffers(). After trying to share page, the users of sh->dev[i].page need to take care of the related page offset: page of issued bio and page passed to xor compution functions. But thanks for previous different page offset supported. Here, we just need to set correct dev[i].offset. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
7aba13b7 |
|
20-Aug-2020 |
Yufen Yu <yuyufen@huawei.com> |
md/raid5: add a new member of offset into r5dev Add a new member of offset into struct r5dev. It indicates the offset of related dev[i].page. For now, since each device have a privated page, the value is always 0. Thus, we set offset as 0 when allcate page in grow_buffers() and resize_stripes(). To support following different page offset, we try to use the page offset rather than '0' directly for async_memcpy() and ops_run_io(). We try to support different page offset for xor compution functions in the following. To avoid repeatly allocate a new array each time, we add a memory region into scribble buffer to record offset. No functional change. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
0a87b25f |
|
20-Jul-2020 |
Ahmed S. Darwish <a.darwish@linutronix.de> |
raid5: Use sequence counter with associated spinlock A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Song Liu <song@kernel.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-20-a.darwish@linutronix.de
|
#
e2368582 |
|
18-Jul-2020 |
Yufen Yu <yuyufen@huawei.com> |
md/raid5: set default stripe_size as 4096 In RAID5, if issued bio size is bigger than stripe_size, it will be split in the unit of stripe_size and process them one by one. Even for size less then stripe_size, RAID5 also request data from disk at least of stripe_size. Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while RAID5 issue IO at least stripe_size (64KB) each time. That will waste resource of disk bandwidth and compute xor. To avoding the waste, we want to make stripe_size configurable. This patch just set default stripe_size as 4096. User can also set the value bigger than 4KB for some special requirements, such as we know the issued io size is more than 4KB. To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE. 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on /mnt directory. Then, trying to test it by dbench with command: dbench -D /mnt -t 1000 10. Result show as: 'stripe_size = 64KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9805011 0.021 64.728 Close 7202525 0.001 0.120 Rename 415213 0.051 44.681 Unlink 1980066 0.079 93.147 Deltree 240 1.793 6.516 Mkdir 120 0.004 0.007 Qpathinfo 8887512 0.007 37.114 Qfileinfo 1557262 0.001 0.030 Qfsinfo 1629582 0.012 0.152 Sfileinfo 798756 0.040 57.641 Find 3436004 0.019 57.782 WriteX 4887239 0.021 57.638 ReadX 15370483 0.005 37.818 LockX 31934 0.003 0.022 UnlockX 31933 0.001 0.021 Flush 687205 13.302 530.088 Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms ------------------------------------------------------- 'stripe_size = 4KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11999166 0.021 36.380 Close 8814128 0.001 0.122 Rename 508113 0.051 29.169 Unlink 2423242 0.070 38.141 Deltree 300 1.885 7.155 Mkdir 150 0.004 0.006 Qpathinfo 10875921 0.007 35.485 Qfileinfo 1905837 0.001 0.032 Qfsinfo 1994304 0.012 0.125 Sfileinfo 977450 0.029 26.489 Find 4204952 0.019 9.361 WriteX 5981890 0.019 27.804 ReadX 18809742 0.004 33.491 LockX 39074 0.003 0.025 UnlockX 39074 0.001 0.014 Flush 841022 10.712 458.848 Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms ------------------------------------------------------- It show that setting stripe_size as 4KB has higher thoughput, i.e. (376.777 vs 307.799) and has smaller latency than that setting as 64KB. 2) We try to evaluate IO throughput for /dev/md5 by fio with config: [4KB randwrite] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=4KB rw=randwrite [64KB write] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=1MB rw=write The result as follow: + + | stripe_size(64KB) | stripe_size(4KB) +----------------------------------------------------+ 4KB randwrite | 15MB/s | 100MB/s +----------------------------------------------------+ 1MB write | 1000MB/s | 700MB/s The result show that when size of io is bigger than 4KB (64KB), 64KB stripe_size has much higher IOPS. But for 4KB randwrite, that means, size of io issued to device are smaller, 4KB stripe_size have better performance. Normally, default value (4096) can get relatively good performance. But if each issued io is bigger than 4096, setting value more than 4096 may get better performance. Here, we just set default stripe_size as 4096, and we will try to support setting different stripe_size by sysfs interface in the following patch. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
c911c46c |
|
18-Jul-2020 |
Yufen Yu <yuyufen@huawei.com> |
md/raid456: convert macro STRIPE_* to RAID5_STRIPE_* Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT(). This patch is prepare for the following adjustable stripe_size. It will not change any existing functionality. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
067df25c |
|
11-Sep-2019 |
Guoqing Jiang <guoqing.jiang@cloud.ionos.com> |
raid5: use bio_end_sector in r5_next_bio Actually, we calculate bio's end sector here, so use the common way for the purpose. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
feb9bf98 |
|
11-Sep-2019 |
Guoqing Jiang <guoqing.jiang@cloud.ionos.com> |
raid5: remove STRIPE_OPS_REQ_PENDING This stripe state is not used anymore after commit 51acbcec6c42b24 ("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted state. gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r drivers/md/raid5.c: (1 << STRIPE_OPS_REQ_PENDING) | drivers/md/raid5.h: STRIPE_OPS_REQ_PENDING, Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
|
#
b330e6a4 |
|
12-Mar-2019 |
Kent Overstreet <kent.overstreet@gmail.com> |
md: convert to kvmalloc The code really just wants a big flat buffer, so just do that. Link: http://lkml.kernel.org/r/20181217131929.11727-3-kent.overstreet@gmail.com Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Shaohua Li <shli@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Eric Paris <eparis@parisplace.org> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
afeee514 |
|
20-May-2018 |
Kent Overstreet <kent.overstreet@gmail.com> |
md: convert to bioset_init()/mempool_init() Convert md to embedded bio sets. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
2cd259a7 |
|
19-Apr-2018 |
Mariusz Dabrowski <mariusz.dabrowski@intel.com> |
raid5: copy write hint from origin bio to stripe Store write hint from original bio in stripe head so it can be assigned to bio sent to each RAID device. Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
f2785b52 |
|
02-Feb-2018 |
NeilBrown <neilb@suse.com> |
md: document lifetime of internal rdev pointer. The rdev pointer kept in the local 'config' for each for raid1, raid10, raid4/5/6 has non-obvious lifetime rules. Sometimes RCU is needed, sometimes a lock, something nothing. Add documentation to explain this. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
|
#
b2441318 |
|
01-Nov-2017 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
dd7a8f5d |
|
04-Apr-2017 |
NeilBrown <neilb@suse.com> |
md/raid5: make chunk_aligned_read() split bios more cleanly. chunk_aligned_read() currently uses fs_bio_set - which is meant for filesystems to use - and loops if multiple splits are needed, which is not best practice. As this is only used for READ requests, not writes, it is unlikely to cause a problem. However it is best to be consistent in how we split bios, and to follow the pattern used in raid1/raid10. So create a private bioset, bio_split, and use it to perform a single split, submitting the remainder to generic_make_request() for later processing. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
78e470c2 |
|
22-Mar-2017 |
Heinz Mauelshagen <heinzm@redhat.com> |
md: add raid4/5/6 journal mode switching API Commit 2ded370373a4 ("md/r5cache: State machine for raid5-cache write back mode") added support for "write-back" caching on the raid journal device. In order to allow the dm-raid target to switch between the available "write-through" and "write-back" modes, provide a new r5c_journal_mode_set() API. Use the new API in existing r5c_journal_mode_store() Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Shaohua Li <shli@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
#
0472a42b |
|
14-Mar-2017 |
NeilBrown <neilb@suse.com> |
md/raid5: remove over-loading of ->bi_phys_segments. When a read request, which bypassed the cache, fails, we need to retry it through the cache. This involves attaching it to a sequence of stripe_heads, and it may not be possible to get all the stripe_heads we need at once. We do what we can, and record how far we got in ->bi_phys_segments so we can pick up again later. There is only ever one bio which may have a non-zero offset stored in ->bi_phys_segments, the one that is either active in the single thread which calls retry_aligned_read(), or is in conf->retry_read_aligned waiting for retry_aligned_read() to be called again. So we only need to store one offset value. This can be in a local variable passed between remove_bio_from_retry() and retry_aligned_read(), or in the r5conf structure next to the ->retry_read_aligned pointer. Storing it there allows the last usage of ->bi_phys_segments to be removed from md/raid5.c. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
016c76ac |
|
14-Mar-2017 |
NeilBrown <neilb@suse.com> |
md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter md/raid5 needs to keep track of how many stripe_heads are processing a bio so that it can delay calling bio_endio() until all stripe_heads have completed. It currently uses 16 bits of ->bi_phys_segments for this purpose. 16 bits is only enough for 256M requests, and it is possible for a single bio to be larger than this, which causes problems. Also, the bio struct contains a larger counter, __bi_remaining, which has a purpose very similar to the purpose of our counter. So stop using ->bi_phys_segments, and instead use __bi_remaining. This means we don't need to initialize the counter, as our caller initializes it to '1'. It also means we can call bio_endio() directly as it tests this counter internally. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
bd83d0a2 |
|
14-Mar-2017 |
NeilBrown <neilb@suse.com> |
md/raid5: call bio_endio() directly rather than queueing for later. We currently gather bios that need to be returned into a bio_list and call bio_endio() on them all together. The original reason for this was to avoid making the calls while holding a spinlock. Locking has changed a lot since then, and that reason is no longer valid. So discard return_io() and various return_bi lists, and just call bio_endio() directly as needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
16d997b7 |
|
14-Mar-2017 |
NeilBrown <neilb@suse.com> |
md/raid5: simplfy delaying of writes while metadata is updated. If a device fails during a write, we must ensure the failure is recorded in the metadata before the completion of the write is acknowleged. Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") added code for this, but it was unnecessarily complicated. We already had similar functionality for handling updates to the bad-block-list, thanks to Commit de393cdea66c ("md: make it easier to wait for bad blocks to be acknowledged.") So revert most of the former commit, and instead avoid collecting completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush the metadata and retry the stripe_head. As this change can leave a stripe_head ready for handling immediately after handle_active_stripes() returns, we change raid5_do_work() to pause when MD_CHANGE_PENDING is set, so that it doesn't spin. We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set asynchronously. After analyse_stripe(), we have collected stable data about the state of devices, which will be used to make decisions. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
3418d036 |
|
09-Mar-2017 |
Artur Paszkiewicz <artur.paszkiewicz@intel.com> |
raid5-ppl: Partial Parity Log write logging implementation Implement the calculation of partial parity for a stripe and PPL write logging functionality. The description of PPL is added to the documentation. More details can be found in the comments in raid5-ppl.c. Attach a page for holding the partial parity data to stripe_head. Allocate it only if mddev has the MD_HAS_PPL flag set. Partial parity is the xor of not modified data chunks of a stripe and is calculated as follows: - reconstruct-write case: xor data from all not updated disks in a stripe - read-modify-write case: xor old data and parity from all updated disks in a stripe Implement it using the async_tx API and integrate into raid_run_ops(). It must be called when we still have access to old data, so do it when STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is stored into sh->ppl_page. Partial parity is not meaningful for full stripe write and is not stored in the log or used for recovery, so don't attempt to calculate it when stripe has STRIPE_FULL_WRITE. Put the PPL metadata structures to md_p.h because userspace tools (mdadm) will also need to read/write PPL. Warn about using PPL with enabled disk volatile write-back cache for now. It can be removed once disk cache flushing before writing PPL is implemented. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
ff875738 |
|
09-Mar-2017 |
Artur Paszkiewicz <artur.paszkiewicz@intel.com> |
raid5: separate header for log functions Move raid5-cache declarations from raid5.h to raid5-log.h, add inline wrappers for functions which will be shared with ppl and use them in raid5 core instead of direct calls to raid5-cache. Remove unused parameter from r5c_cache_data(), move two duplicated pr_debug() calls to r5l_init_log(). Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
aaf9f12e |
|
03-Mar-2017 |
Shaohua Li <shli@fb.com> |
md/raid5: sort bios Previous patch (raid5: only dispatch IO from raid5d for harddisk raid) defers IO dispatching. The goal is to create better IO pattern. At that time, we don't sort the deffered IO and hope the block layer can do IO merge and sort. Now the raid5-cache writeback could create large amount of bios. And if we enable muti-thread for stripe handling, we can't control when to dispatch IO to raid disks. In a lot of time, we are dispatching IO which block layer can't do merge effectively. This patch moves further for the IO dispatching defer. We accumulate bios, but we don't dispatch all the bios after a threshold is met. This 'dispatch partial portion of bios' stragety allows bios coming in a large time window are sent to disks together. At the dispatching time, there is large chance the block layer can merge the bios. To make this more effective, we dispatch IO in ascending order. This increases request merge chance and reduces disk seek. Signed-off-by: Shaohua Li <shli@fb.com>
|
#
535ae4eb |
|
15-Feb-2017 |
Shaohua Li <shli@fb.com> |
md/raid5: prioritize stripes for writeback In raid5-cache writeback mode, we have two types of stripes to handle. - stripes which aren't cached yet - stripes which are cached and flushing out to raid disks Upperlayer is more sensistive to latency of the first type of stripes generally. But we only one handle list for all these stripes, where the two types of stripes are mixed together. When reclaim flushes a lot of stripes, the first type of stripes could be noticeably delayed. On the other hand, if the log space is tight, we'd like to handle the second type of stripes faster and free log space. This patch destinguishes the two types stripes. They are added into different handle list. When we try to get a stripe to handl, we prefer the first type of stripes unless log space is tight. This should have no impact for !writeback case. Signed-off-by: Shaohua Li <shli@fb.com>
|
#
e33fbb9c |
|
10-Feb-2017 |
Shaohua Li <shli@fb.com> |
md/raid5-cache: exclude reclaiming stripes in reclaim check stripes which are being reclaimed are still accounted into cached stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the stripes and does unnecessary stripe reclaim. In practice, I saw one stripe is reclaimed one time. This will cause bad IO pattern. Fixing this by excluding the reclaing stripes in the check. Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
03b047f4 |
|
11-Jan-2017 |
Song Liu <songliubraving@fb.com> |
md/r5cache: enable chunk_aligned_read with write back cache Chunk aligned read significantly reduces CPU usage of raid456. However, it is not safe to fully bypass the write back cache. This patch enables chunk aligned read with write back cache. For chunk aligned read, we track stripes in write back cache at a bigger granularity, "big_stripe". Each chunk may contain more than one stripe (for example, a 256kB chunk contains 64 4kB-page, so this chunk contain 64 stripes). For chunk_aligned_read, these stripes are grouped into one big_stripe, so we only need one lookup for the whole chunk. For each big_stripe, struct big_stripe_info tracks how many stripes of this big_stripe are in the write back cache. We count how many stripes of this big_stripe are in the write back cache. These counters are tracked in a radix tree (big_stripe_tree). r5c_tree_index() is used to calculate keys for the radix tree. chunk_aligned_read() calls r5c_big_stripe_cached() to look up big_stripe of each chunk in the tree. If this big_stripe is in the tree, chunk_aligned_read() aborts. This look up is protected by rcu_read_lock(). It is necessary to remember whether a stripe is counted in big_stripe_tree. Instead of adding new flag, we reuses existing flags: STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these two flags are set, the stripe is counted in big_stripe_tree. This requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to r5c_try_caching_write(); and moving clear_bit of STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to r5c_finish_stripe_write_out(). Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
765d704d |
|
04-Jan-2017 |
Shaohua Li <shli@fb.com> |
raid5: only dispatch IO from raid5d for harddisk raid We made raid5 stripe handling multi-thread before. It works well for SSD. But for harddisk, the multi-threading creates more disk seek, so not always improve performance. For several hard disks based raid5, multi-threading is required as raid5d becames a bottleneck especially for sequential write. To overcome the disk seek issue, we only dispatch IO from raid5d if the array is harddisk based. Other threads can still handle stripes, but can't dispatch IO. Idealy, we should control IO dispatching order according to IO position interrnally. Right now we still depend on block layer, which isn't very efficient sometimes though. My setup has 9 harddisks, each disk can do around 180M/s sequential write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential write. The test machine uses an ATOM CPU. I measure sequential write with large iodepth bandwidth to raid array: without patch: ~600M/s without patch and group_thread_cnt=4: 750M/s with patch and group_thread_cnt=4: 950M/s with patch, group_thread_cnt=4, skip_copy=1: 1150M/s We are pretty close to the maximum bandwidth in the large iodepth iodepth case. The performance gap of small iodepth sequential write between software raid and theory value is still very big though, because we don't have an efficient pipeline. Cc: NeilBrown <neilb@suse.com> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
2e38a37f |
|
24-Jan-2017 |
Song Liu <songliubraving@fb.com> |
md/r5cache: disable write back for degraded array write-back cache in degraded mode introduces corner cases to the array. Although we try to cover all these corner cases, it is safer to just disable write-back cache when the array is in degraded mode. In this patch, we disable writeback cache for degraded mode: 1. On device failure, if the array enters degraded mode, raid5_error() will submit async job r5c_disable_writeback_async to disable writeback; 2. In r5c_journal_mode_store(), it is invalid to enable writeback in degraded mode; 3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled in write-through mode. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
86aa1397 |
|
12-Jan-2017 |
Song Liu <songliubraving@fb.com> |
md/r5cache: read data into orig_page for prexor of cached data With write back cache, we use orig_page to do prexor. This patch makes sure we read data into orig_page for it. Flag R5_OrigPageUPTDODATE is added to show whether orig_page has the latest data from raid disk. We introduce a helper function uptodate_for_rmw() to simplify the a couple conditions in handle_stripe_dirtying(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
d7bd398e |
|
23-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: handle alloc_page failure RMW of r5c write back cache uses an extra page to store old data for prexor. handle_stripe_dirtying() allocates this page by calling alloc_page(). However, alloc_page() may fail. To handle alloc_page() failures, this patch adds an extra page to disk_info. When alloc_page fails, handle_stripe() trys to use these pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE), the stripe is added to delayed_list. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
3bddb7f8 |
|
18-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: handle FLUSH and FUA With raid5 cache, we committing data from journal device. When there is flush request, we need to flush journal device's cache. This was not needed in raid5 journal, because we will flush the journal before committing data to raid disks. This is similar to FUA, except that we also need flush journal for FUA. Otherwise, corruptions in earlier meta data will stop recovery from reaching FUA data. slightly changed the code by Shaohua Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
2c7da14b |
|
17-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: sysfs entry journal_mode With write cache, journal_mode is the knob to switch between write-back and write-through. Below is an example: root@virt-test:~/# cat /sys/block/md0/md/journal_mode [write-through] write-back root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode root@virt-test:~/# cat /sys/block/md0/md/journal_mode write-through [write-back] Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
a39f7afd |
|
17-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: write-out phase and reclaim support There are two limited resources, stripe cache and journal disk space. For better performance, we priotize reclaim of full stripe writes. To free up more journal space, we free earliest data on the journal. In current implementation, reclaim happens when: 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim if there is no reclaim in the past 5 seconds. 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes, or cached stripes is enough for a full stripe (chunk size / 4k) (r5c_check_cached_full_stripe) 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage) 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data) r5c_do_reclaim() contains new logic of reclaim. For stripe cache: When stripe cache pressure is high (more than 3/4 stripes are cached, or there is empty inactive lists), flush all full stripe. If fewer than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes are flushed, flush some paritial stripes. When stripe cache pressure is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes. For log space: To avoid deadlock due to log space, we need to reserve enough space to flush cached data. The size of required log space depends on total number of cached stripes (stripe_in_journal_count). In current implementation, the writing-out phase automatically include pending data writes with parity writes (similar to write through case). Therefore, we need up to (conf->raid_disks + 1) pages for each cached stripe (1 page for meta data, raid_disks pages for all data and parity). r5c_log_required_to_flush_cache() calculates log space required to flush cache. In the following, we refer to the space calculated by r5c_log_required_to_flush_cache() as reclaim_required_space. Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log device is less than 2x of reclaim_required_space. r5c_cache keeps all data in cache (not fully committed to RAID) in a list (stripe_in_journal_list). These stripes are in the order of their first appearance on the journal. So the log tail (last_checkpoint) should point to the journal_start of the first item in the list. When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is set, the state machine only writes data that are already in the log device (in stripe_in_journal_list). This patch includes a fix to improve performance by Shaohua Li <shli@fb.com>. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
1e6d690b |
|
17-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: caching phase of r5cache As described in previous patch, write back cache operates in two phases: caching and writing-out. The caching phase works as: 1. write data to journal (r5c_handle_stripe_dirtying, r5c_cache_data) 2. call bio_endio (r5c_handle_data_cached, r5c_return_dev_pending_writes). Then the writing-out phase is as: 1. Mark the stripe as write-out (r5c_make_stripe_write_out) 2. Calcualte parity (reconstruct or RMW) 3. Write parity (and maybe some other data) to journal device 4. Write data and parity to RAID disks This patch implements caching phase. The cache is integrated with stripe cache of raid456. It leverages code of r5l_log to write data to journal device. Writing-out phase of the cache is implemented in the next patch. With r5cache, write operation does not wait for parity calculation and write out, so the write latency is lower (1 write to journal device vs. read and then write to raid disks). Also, r5cache will reduce RAID overhead (multipile IO due to read-modify-write of parity) and provide more opportunities of full stripe writes. This patch adds 2 flags to stripe_head.state: - STRIPE_R5C_PARTIAL_STRIPE, - STRIPE_R5C_FULL_STRIPE, Instead of inactive_list, stripes with cached data are tracked in r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list. STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for stripes in these lists. Note: stripes in r5c_full/partial_stripe_list are not considered as "active". For RMW, the code allocates an extra page for each data block being updated. This is stored in r5dev->orig_page and the old data is read into it. Then the prexor calculation subtracts ->orig_page from the parity block, and the reconstruct calculation adds the ->page data back into the parity block. r5cache naturally excludes SkipCopy. When the array has write back cache, async_copy_data() will not skip copy. There are some known limitations of the cache implementation: 1. Write cache only covers full page writes (R5_OVERWRITE). Writes of smaller granularity are write through. 2. Only one log io (sh->log_io) for each stripe at anytime. Later writes for the same stripe have to wait. This can be improved by moving log_io to r5dev. 3. With writeback cache, read path must enter state machine, which is a significant bottleneck for some workloads. 4. There is no per stripe checkpoint (with r5l_payload_flush) in the log, so recovery code has to replay more than necessary data (sometimes all the log from last_checkpoint). This reduces availability of the array. This patch includes a fix proposed by ZhengYuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
2ded3703 |
|
17-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: State machine for raid5-cache write back mode This patch adds state machine for raid5-cache. With log device, the raid456 array could operate in two different modes (r5c_journal_mode): - write-back (R5C_MODE_WRITE_BACK) - write-through (R5C_MODE_WRITE_THROUGH) Existing code of raid5-cache only has write-through mode. For write-back cache, it is necessary to extend the state machine. With write-back cache, every stripe could operate in two different phases: - caching - writing-out In caching phase, the stripe handles writes as: - write to journal - return IO In writing-out phase, the stripe behaviors as a stripe in write through mode R5C_MODE_WRITE_THROUGH. STRIPE_R5C_CACHING is added to sh->state to differentiate caching and writing-out phase. Please note: this is a "no-op" patch for raid5-cache write-through mode. The following detailed explanation is copied from the raid5-cache.c: /* * raid5 cache state machine * * With rhe RAID cache, each stripe works in two phases: * - caching phase * - writing-out phase * * These two phases are controlled by bit STRIPE_R5C_CACHING: * if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase * if STRIPE_R5C_CACHING == 1, the stripe is in caching phase * * When there is no journal, or the journal is in write-through mode, * the stripe is always in writing-out phase. * * For write-back journal, the stripe is sent to caching phase on write * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off * the write-out phase by clearing STRIPE_R5C_CACHING. * * Stripes in caching phase do not write the raid disks. Instead, all * writes are committed from the log device. Therefore, a stripe in * caching phase handles writes as: * - write to log device * - return IO * * Stripes in writing-out phase handle writes as: * - calculate parity * - write pending data and parity to journal * - write data and parity to raid disks * - return IO for pending writes */ Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
937621c3 |
|
17-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: move some code to raid5.h Move some define and inline functions to raid5.h, so they can be used in raid5-cache.c Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
29c6d1bb |
|
18-Aug-2016 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
md/raid5: Convert to hotplug state machine Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Neil Brown <neilb@suse.com> Cc: linux-raid@vger.kernel.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160818125731.27256-10-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
6ab2a4b8 |
|
25-Feb-2016 |
Shaohua Li <shli@fb.com> |
RAID5: revert e9e4c377e2f563 to fix a livelock Revert commit e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe) The problem is raid5_get_active_stripe waits on conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes in this order: - release all stripes with hash 0 - raid5_get_active_stripe still sleeps since active_stripes > max_nr_stripes * 3 / 4 - release all stripes with hash other than 0. active_stripes becomes 0 - raid5_get_active_stripe still sleeps, since nobody wakes up wait_for_stripe[0] The system live locks. The problem is active_stripes isn't a per-hash count. Revert the patch makes the live lock go away. Cc: stable@vger.kernel.org (v4.2+) Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
27a353c0 |
|
24-Feb-2016 |
Shaohua Li <shli@fb.com> |
RAID5: check_reshape() shouldn't call mddev_suspend check_reshape() is called from raid5d thread. raid5d thread shouldn't call mddev_suspend(), because mddev_suspend() waits for all IO finish but IO is handled in raid5d thread, we could easily deadlock here. This issue is introduced by 738a273 ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
|
#
6e74a9cf |
|
08-Oct-2015 |
Shaohua Li <shli@fb.com> |
raid5-cache: IO error handling There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
e6c033f7 |
|
04-Oct-2015 |
Shaohua Li <shli@fb.com> |
raid5-cache: move reclaim stop to quiesce Move reclaim stop to quiesce handling, where is safer for this stuff. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
828cbe98 |
|
02-Sep-2015 |
Shaohua Li <shli@fb.com> |
raid5-cache: optimize FLUSH IO with log enabled With log enabled, bio is written to raid disks after the bio is settled down in log disk. The recovery guarantees we can recovery the bio data from log disk, so we we skip FLUSH IO. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
0576b1c6 |
|
13-Aug-2015 |
Shaohua Li <shli@fb.com> |
raid5: log reclaim support This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run if there is specific reclaimable space (1/4 disk size or 10G) or we are out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency. The size based force reclaim is to make sure log isn't too big, so recovery doesn't scan log too much. Recovery make sure raid disks and log disk have the same data of a stripe. If crash happens before 4, recovery might/might not recovery stripe's data/parity depending on if data/parity and its checksum matches. In either case, this doesn't change the syntax of an IO write. After step 3, stripe is guaranteed recoverable, because stripe's data/parity is persistent in log disk. In some cases, log disk content and raid disks content of a stripe are the same, but recovery will still copy log disk content to raid disks, this doesn't impact data consistency. space reuse happens after superblock update and cache flush. There is one situation we want to avoid. A broken meta in the middle of a log causes recovery can't find meta at the head of log. If operations require meta at the head persistent in log, we must make sure meta before it persistent in log too. The case is stripe data/parity is in log and we start write stripe to raid disks (before step 4). stripe data/parity must be persistent in log before we do the write to raid disks. The solution is we restrictly maintain io_unit list order. In this case, we only write stripes of an io_unit to raid disks till the io_unit is the first one whose data/parity is in log. The io_unit list order is important for other cases too. For example, some io_unit are reclaimable and others not. They can be mixed in the list, we shouldn't reuse space of an unreclaimable io_unit. Includes fixes to problems which were... Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
f6bed0ef |
|
13-Aug-2015 |
Shaohua Li <shli@fb.com> |
raid5: add basic stripe log This introduces a simple log for raid5. Data/parity writing to raid array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up raid resync and fix write hole issue. The log structure is pretty simple. Data/meta data is stored in block unit, which is 4k generally. It has only one type of meta data block. The meta data block can track 3 types of data, stripe data, stripe parity and flush block. MD superblock will point to the last valid meta data block. Each meta data block has checksum/seq number, so recovery can scan the log correctly. We store a checksum of stripe data/parity to the metadata block, so meta data and stripe data/parity can be written to log disk together. otherwise, meta data write must wait till stripe data/parity is finished. For stripe data, meta data block will record stripe data sector and size. Currently the size is always 4k. This meta data record can be made simpler if we just fix write hole (eg, we can record data of a stripe's different disks together), but this format can be extended to support caching in the future, which must record data address/size. For stripe parity, meta data block will record stripe sector. It's size should be 4k (for raid5) or 8k (for raid6). We always store p parity first. This format should work for caching too. flush block indicates a stripe is in raid array disks. Fixing write hole doesn't need this type of meta data, it's for caching extension. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
b70abcb2 |
|
13-Aug-2015 |
Shaohua Li <shli@fb.com> |
raid5: add a new state for stripe log handling When a stripe finishes construction, we write the stripe to raid in ops_run_io normally. With log, we do a bunch of other operations before the stripe is written to raid. Mainly write the stripe to log disk, flush disk cache and so on. The operations are still driven by raid5d and run in the stripe state machine. We introduce a new state for such stripe (trapped into log). The stripe is in this state from the time it first enters ops_run_io (finish construction) to the time it is written to raid. Since we know the state is only for log, we bypass other check/operation in handle_stripe. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
6d036f7d |
|
13-Aug-2015 |
Shaohua Li <shli@fb.com> |
raid5: export some functions Next several patches use some raid5 functions, rename them with raid5 prefix and export out. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
c3cce6cd |
|
13-Aug-2015 |
NeilBrown <neilb@suse.com> |
md/raid5: ensure device failure recorded before write request returns. When a write to one of the devices of a RAID5/6 fails, the failure is recorded in the metadata of the other devices so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that completed when MD_CHANGE_PENDING is set to only be processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>
|
#
34a6f80e |
|
13-Aug-2015 |
NeilBrown <neilb@suse.com> |
md/raid5: use bio_list for the list of bios to return. This will make it easier to splice two lists together which will be needed in future patch. Signed-off-by: NeilBrown <neilb@suse.com>
|
#
2d5b569b |
|
05-Jul-2015 |
NeilBrown <neilb@suse.com> |
md/raid5: avoid races when changing cache size. Cache size can grow or shrink due to various pressures at any time. So when we resize the cache as part of a 'grow' operation (i.e. change the size to allow more devices) we need to blocks that automatic growing/shrinking. So introduce a mutex. auto grow/shrink uses mutex_trylock() and just doesn't bother if there is a blockage. Resizing the whole cache holds the mutex to ensure that the correct number of new stripes is allocated. This bug can result in some stripes not being freed when an array is stopped. This leads to the kmem_cache not being freed and a subsequent array can try to use the same kmem_cache and get confused. Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2) Signed-off-by: NeilBrown <neilb@suse.com>
|
#
e9e4c377 |
|
08-May-2015 |
Yuanhan Liu <yuanhan.liu@linux.intel.com> |
md/raid5: per hash value and exclusive wait_for_stripe I noticed heavy spin lock contention at get_active_stripe() with fsmark multiple thread write workloads. Here is how this hot contention comes from. We have limited stripes, and it's a multiple thread write workload. Hence, those stripes will be taken soon, which puts later processes to sleep for waiting free stripes. When enough stripes(>= 1/4 total stripes) are released, all process are woken, trying to get the lock. But there is one only being able to get this lock for each hash lock, making other processes spinning out there for acquiring the lock. Thus, it's effectiveless to wakeup all processes and let them battle for a lock that permits one to access only each time. Instead, we could make it be a exclusive wake up: wake up one process only. That avoids the heavy spin lock contention naturally. To do the exclusive wake up, we've to split wait_for_stripe into multiple wait queues, to make it per hash value, just like the hash lock. Here are some test results I have got with this patch applied(all test run 3 times): `fsmark.files_per_sec' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose 17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose 188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync 254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync `time.system_time' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose 3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose Where, 1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark 1t, 64t: where 't' means thread 4M: means the single file size, corresponding to the '-s' option of fsmark 40G, 30G, 120G: means the total test size 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means the size of one ramdisk. So, it would be 48G in total. And we made a raid on those ramdisk As you can see, though there are no much performance gain for hard disk workload, the system time is dropped heavily, up to 97%. And as expected, the performance increased a lot, up to 260%, for fast device(ram disk). v2: use bits instead of array to note down wait queue need to wake up. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
b1b46486 |
|
08-May-2015 |
Yuanhan Liu <yuanhan.liu@linux.intel.com> |
md/raid5: split wait_for_stripe and introduce wait_for_quiescent I noticed heavy spin lock contention at get_active_stripe(), introduced at being wake up stage, where a bunch of processes try to re-hold the spin lock again. After giving some thoughts on this issue, I found the lock could be relieved(and even avoided) if we turn the wait_for_stripe to per waitqueue for each lock hash and make the wake up exclusive: wake up one process each time, which avoids the lock contention naturally. Before go hacking with wait_for_stripe, I found it actually has 2 usages: for the array to enter or leave the quiescent state, and also to wait for an available stripe in each of the hash lists. So this patch splits the first usage off into a separate wait_queue, wait_for_quiescent, and the next patch will turn the second usage into one waitqueue for each hash value, and make it exclusive, to relieve the lock contention. v2: wake_up(wait_for_quiescent) when (active_stripes == 0) Commit log refactor suggestion from Neil. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
1b956f7a |
|
20-May-2015 |
NeilBrown <neilb@suse.de> |
md/raid5: be more selective about distributing flags across batch. When a batch of stripes is broken up, we keep some of the flags that were per-stripe, and copy other flags from the head to all others. This only happens while a stripe is being handled, so many of the flags are irrelevant. The "SYNC_FLAGS" (which I've renamed to make it clear there are several) and STRIPE_DEGRADED are set per-stripe and so need to be preserved. STRIPE_INSYNC is the only flag that is set on the head that needs to be propagated to all others. For safety, add a WARN_ON if others are set, except: STRIPE_HANDLE - this is safe and per-stripe and we are going to set in several cases anyway STRIPE_INSYNC STRIPE_IO_STARTED - this is just a hint and doesn't hurt. STRIPE_ON_PLUG_LIST STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched stripe to be on one of these lists, but it can happen as can be safely ignored. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
d0852df5 |
|
26-May-2015 |
NeilBrown <neilb@suse.de> |
md/raid5: close race between STRIPE_BIT_DELAY and batching. When we add a write to a stripe we need to make sure the bitmap bit is set. While doing that the stripe is not locked so it could be added to a batch after which further changes to STRIPE_BIT_DELAY and ->bm_seq are ineffective. So we need to hold off adding to a stripe until bitmap_startwrite has completed at least once, and we need to avoid further changes to STRIPE_BIT_DELAY once the stripe has been added to a batch. If a bitmap_startwrite() completes after the stripe was added to a batch, it will not have set the bit, only incremented a counter, so no extra delay of the stripe is needed. Reported-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
edbe83ab |
|
25-Feb-2015 |
NeilBrown <neilb@suse.de> |
md/raid5: allow the stripe_cache to grow and shrink. The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
5423399a |
|
25-Feb-2015 |
NeilBrown <neilb@suse.de> |
md/raid5: change ->inactive_blocked to a bit-flag. This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
d06f191f |
|
14-Dec-2014 |
Markus Stockhausen <stockhausen@collogia.de> |
md/raid5: introduce configuration option rmw_level Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
584acdd4 |
|
14-Dec-2014 |
Markus Stockhausen <stockhausen@collogia.de> |
md/raid5: activate raid6 rmw feature Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
dabc4ec6 |
|
14-Dec-2014 |
shli@kernel.org <shli@kernel.org> |
raid5: handle expansion/resync case with stripe batching expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
72ac7330 |
|
14-Dec-2014 |
shli@kernel.org <shli@kernel.org> |
raid5: handle io error of batch list If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
59fc630b |
|
14-Dec-2014 |
shli@kernel.org <shli@kernel.org> |
RAID5: batch adjacent full stripe write stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
7a87f434 |
|
14-Dec-2014 |
shli@kernel.org <shli@kernel.org> |
raid5: track overwrite disk count Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
da41ba65 |
|
14-Dec-2014 |
shli@kernel.org <shli@kernel.org> |
raid5: add a new flag to track if a stripe can be batched A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
46d5b785 |
|
14-Dec-2014 |
shli@kernel.org <shli@kernel.org> |
raid5: use flex_array for scribble data Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
5c675f83 |
|
14-Dec-2014 |
NeilBrown <neilb@suse.de> |
md: make ->congested robust against personality changes. There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
f72ffdd6 |
|
29-Sep-2014 |
NeilBrown <neilb@suse.de> |
md: remove unwanted white space from md.c My editor shows much of this is RED. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
d592a996 |
|
21-May-2014 |
Shaohua Li <shli@kernel.org> |
raid5: add an option to avoid copy data from bio to stripe cache The stripe cache has two goals: 1. cache data, so next time if data can be found in stripe cache, disk access can be avoided. 2. stable data. data is copied from bio to stripe cache and calculated parity. data written to disk is from stripe cache, so if upper layer changes bio data, data written to disk isn't impacted. In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES can guarantee 2 too. For 1, it's not common too. block plug mechanism will dispatch a bunch of sequentail small requests together. And since I'm using SSD, I'm using small chunk size. It's rare case stripe cache is really useful. So I'd like to avoid the copy from bio to stripe cache and it's very helpful for performance. In my 1M randwrite tests, avoid the copy can increase the performance more than 30%. Of course, this shouldn't be enabled by default. It's reported enabling BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to control it. Neilb: changed BUG_ON to WARN_ON Removed some assignments from raid5_build_block which are now not needed. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
4bda556a |
|
13-Nov-2013 |
Shaohua Li <shli@kernel.org> |
raid5: relieve lock contention in get_active_stripe() track empty inactive list count, so md_raid5_congested() can use it to make decision. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
566c09c5 |
|
13-Nov-2013 |
Shaohua Li <shli@kernel.org> |
raid5: relieve lock contention in get_active_stripe() get_active_stripe() is the last place we have lock contention. It has two paths. One is stripe isn't found and new stripe is allocated, the other is stripe is found. The first path basically calls __find_stripe and init_stripe. It accesses conf->generation, conf->previous_raid_disks, conf->raid_disks, conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded, conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except stripe_hashtbl and inactive_list, other fields are changed very rarely. With this patch, we split inactive_list and add new hash locks. Each free stripe belongs to a specific inactive list. Which inactive list is determined by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a lock_hash assigned. Stripe's inactive list is protected by a hash lock, which is determined by it's lock_hash too. The lock_hash is derivied from current stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl list too. The goal of the new hash locks introduced is we can only use the new locks in the first path of get_active_stripe(). Since we have several hash locks, lock contention is relieved significantly. The first path of get_active_stripe() accesses other fields, since they are changed rarely, changing them now need take conf->device_lock and all hash locks. For a slow path, this isn't a problem. If we need lock device_lock and hash lock, we always lock hash lock first. The tricky part is release_stripe and friends. We need take device_lock first. Neil's suggestion is we put inactive stripes to a temporary list and readd it to inactive_list after device_lock is released. In this way, we add stripes to temporary list with device_lock hold and remove stripes from the list with hash lock hold. So we don't allow concurrent access to the temporary list, which means we need allocate temporary list for all participants of release_stripe. One downside is free stripes are maintained in their inactive list, they can't across between the lists. By default, we have total 256 stripes and 8 lists, so each list will have 32 stripes. It's possible one list has free stripe but other list hasn't. The chance should be rare because stripes allocation are even distributed. And we can always allocate more stripes for cache, several mega bytes memory isn't a big deal. This completely removes the lock contention of the first path of get_active_stripe(). It slows down the second code path a little bit though because we now need takes two locks, but since the hash lock isn't contended, the overhead should be quite small (several atomic instructions). The second path of get_active_stripe() (basically sequential write or big request size randwrite) still has lock contentions. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
aa5e5dc2 |
|
17-Sep-2013 |
Michael Opdenacker <michael.opdenacker@free-electrons.com> |
treewide: fix "distingush" typo Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
bfc90cb0 |
|
29-Aug-2013 |
Shaohua Li <shli@kernel.org> |
raid5: only wakeup necessary threads If there are not enough stripes to handle, we'd better not always queue all available work_structs. If one worker can only handle small or even none stripes, it will impact request merge and create lock contention. With this patch, the number of work_struct running will depend on pending stripes number. Note: some statistics info used in the patch are accessed without locking protection. This should doesn't matter, we just try best to avoid queue unnecessary work_struct. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
c46501b2 |
|
26-Aug-2013 |
NeilBrown <neilb@suse.de> |
md/raid5: use seqcount to protect access to shape in make_request. make_request() access various shape parameters (raid_disks, chunk_size etc) which might be changed by raid5_start_reshape(). If the later is called at and awkward time during the form, the wrong stripe_head might be used. So introduce a 'seqcount' and after finding a stripe_head make sure there is no reason to expect that we got the wrong one. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
851c30c9 |
|
28-Aug-2013 |
Shaohua Li <shli@kernel.org> |
raid5: offload stripe handle to workqueue This is another attempt to create multiple threads to handle raid5 stripes. This time I use workqueue. raid5 handles request (especially write) in stripe unit. A stripe is page size aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a state machine for the corresponding stripe, which includes reading some disks of the stripe, calculating parity, and writing some disks of the stripe. The state machine is running in raid5d thread currently. Since there is only one thread, it doesn't scale well for high speed storage. An obvious solution is multi-threading. To get better performance, we have some requirements: a. locality. stripe corresponding to request submitted from one cpu is better handled in thread in local cpu or local node. local cpu is preferred but some times could be a bottleneck, for example, parity calculation is too heavy. local node running has wide adaptability. b. configurablity. Different setup of raid5 array might need diffent configuration. Especially the thread number. More threads don't always mean better performance because of lock contentions. My original implementation is creating some kernel threads. There are interfaces to control which cpu's stripe each thread should handle. And userspace can set affinity of the threads. This provides biggest flexibility and configurability. But it's hard to use and apparently a new thread pool implementation is disfavor. Recent workqueue improvement is quite promising. unbound workqueue will be bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to do affinity setting. For example, we can only include one HT sibling in affinity. Since work is non-reentrant by default, and we can control running thread number by limiting dispatched work_struct number. In this patch, I created several stripe worker group. A group is a numa node. stripes from cpus of one node will be added to a group list. Workqueue thread of one node will only handle stripes of worker group of the node. In this way, stripe handling has numa node locality. And as I said, we can control thread number by limiting dispatched work_struct number. The work_struct callback function handles several stripes in one run. A typical work queue usage is to run one unit in each work_struct. In raid5 case, the unit is a stripe. But we can't do that: a. Though handling a stripe doesn't need lock because of reference accounting and stripe isn't in any list, queuing a work_struct for each stripe will make workqueue lock contended very heavily. b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we might dispatch request. If each work_struct only handles one stripe, such block plug is meaningless. This implementation can't do very fine grained configuration. But the numa binding is most popular usage model, should be enough for most workloads. Note: since we have only one stripe queue, switching to multi-thread might decrease request size dispatching down to low level layer. The impact depends on thread number, raid configuration and workload. So multi-thread raid5 might not be proper for all setups. Changes V1 -> V2: 1. remove WQ_NON_REENTRANT 2. disabling multi-threading by default 3. Add more descriptions in changelog Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
773ca82f |
|
27-Aug-2013 |
Shaohua Li <shli@kernel.org> |
raid5: make release_stripe lockless release_stripe still has big lock contention. We just add the stripe to a llist without taking device_lock. We let the raid5d thread to do the real stripe release, which must hold device_lock anyway. In this way, release_stripe doesn't hold any locks. The side effect is the released stripes order is changed. But sounds not a big deal, stripes are never handled in order. And I thought block layer can already do nice request merge, which means order isn't that important. I kept the unplug release batch, which is unnecessary with this patch from lock contention avoid point of view, and actually if we delete it, the stripe_head release_list and lru can share storage. But the unplug release batch is also helpful for request merge. We probably can delay wakeup raid5d till unplug, but I'm still afraid of the case which raid5d is running. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
f94c0b66 |
|
21-Jul-2013 |
NeilBrown <neilb@suse.de> |
md/raid5: fix interaction of 'replace' and 'recovery'. If a device in a RAID4/5/6 is being replaced while another is being recovered, then the writes to the replacement device currently don't happen, resulting in corruption when the replacement completes and the new drive takes over. This is because the replacement writes are only triggered when 's.replacing' is set and not when the similar 's.sync' is set (which is the case during resync and recovery - it means all devices need to be read). So schedule those writes when s.replacing is set as well. In this case we cannot use "STRIPE_INSYNC" to record that the replacement has happened as that is needed for recording that any parity calculation is complete. So introduce STRIPE_REPLACED to record if the replacement has happened. For safety we should also check that STRIPE_COMPUTE_RUN is not set. This has a similar effect to the "s.locked == 0" test. The latter ensure that now IO has been flagged but not started. The former checks if any parity calculation has been flagged by not started. We must wait for both of these to complete before triggering the 'replace'. Add a similar test to the subsequent check for "are we finished yet". This possibly isn't needed (is subsumed in the STRIPE_INSYNC test), but it makes it more obvious that the REPLACE will happen before we think we are finished. Finally if a NeedReplace device is not UPTODATE then that is an error. We really must trigger a warning. This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2 (md/raid5: detect and handle replacements during recovery.) which introduced replacement for raid5. That was in 3.3-rc3, so any stable kernel since then would benefit from this fix. Cc: stable@vger.kernel.org (3.3+) Reported-by: qindehua <13691222965@163.com> Tested-by: qindehua <qindehua@163.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
238f5908 |
|
11-Mar-2013 |
Paul Bolle <pebolle@tiscali.nl> |
md: remove CONFIG_MULTICORE_RAID456 entirely Once instance of this Kconfig macro remained after commit 51acbcec6c42b24482bac18e42befc822524535d ("md: remove CONFIG_MULTICORE_RAID456"). Remove that one too. And, while we're at it, also remove it from the defconfig files that carry it. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
f8dfcffd |
|
11-Mar-2013 |
NeilBrown <neilb@suse.de> |
md/raid5: ensure sync and DISCARD don't happen at the same time. A number of problems can occur due to races between resync/recovery and discard. - if sync_request calls handle_stripe() while a discard is happening on the stripe, it might call handle_stripe_clean_event before all of the individual discard requests have completed (so some devices are still locked, but not all). Since commit ca64cae96037de16e4af92678814f5d4bf0c1c65 md/raid5: Make sure we clear R5_Discard when discard is finished. this will cause R5_Discard to be cleared for the parity device, so handle_stripe_clean_event() will not be called when the other devices do become unlocked, so their ->written will not be cleared. This ultimately leads to a WARN_ON in init_stripe and a lock-up. - If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward time for resync, it can lead to s->uptodate being less than disks in handle_parity_checks5(), which triggers a BUG (because it is one). So: - keep R5_Discard on the parity device until all other devices have completed their discard request - make sure we don't try to have a 'discard' and a 'sync' action at the same time. This involves a new stripe flag to we know when a 'discard' is happening, and the use of R5_Overlap on the parity disk so when a discard is wanted while a sync is active, so we know to wake up the discard at the appropriate time. Discard support for RAID5 was added in 3.7, so this is suitable for any -stable kernel since 3.7. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
620125f2 |
|
10-Oct-2012 |
Shaohua Li <shli@kernel.org> |
MD: raid5 trim support Discard for raid4/5/6 has limitation. If discard request size is small, we do discard for one disk, but we need calculate parity and write parity disk. To correctly calculate parity, zero_after_discard must be guaranteed. Even it's true, we need do discard for one disk but write another disks, which makes the parity disks wear out fast. This doesn't make sense. So an efficient discard for raid4/5/6 should discard all data disks and parity disks, which requires the write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size is smaller than chunk_size, such pattern is almost impossible in practice. So in this patch, I only handle the case that A's size equals to chunk_size. That is discard request should be aligned to stripe size and its size is multiple of stripe size. Since we can only handle request with specific alignment and size (or part of the request fitting stripes), we can't guarantee zero_after_discard even zero_after_discard is true in low level drives. The block layer doesn't send down correctly aligned requests even correct discard alignment is set, so I must filter out. For raid4/5/6 parity calculation, if data is 0, parity is 0. So if zero_after_discard is true for all disks, data is consistent after discard. Otherwise, data might be lost. Let's consider a scenario: discard a stripe, write data to one disk and write parity disk. The stripe could be still inconsistent till then depending on using data from other data disks or parity disks to calculate new parity. If the disk is broken, we can't restore it. So in this patch, we only enable discard support if all disks have zero_after_discard. If discard fails in one disk, we face the similar inconsistent issue above. The patch will make discard follow the same path as normal write request. If discard fails, a resync will be scheduled to make the data consistent. This isn't good to have extra writes, but data consistency is important. If a subsequent read/write request hits raid5 cache of a discarded stripe, the discarded dev page should have zero filled, so the data is consistent. This patch will always zero dev page for discarded request stripe. This isn't optimal because discard request doesn't need such payload. Next patch will avoid it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
8811b596 |
|
01-Aug-2012 |
Shaohua Li <shli@kernel.org> |
raid5: make_request use batch stripe release make_request() does stripe release for every stripe and the stripe usually has count 1, which makes previous release_stripe() optimization not work. In my test, this release_stripe() becomes the heaviest pleace to take conf->device_lock after previous patches applied. Below patch makes stripe release batch. All the stripes will be released in unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe lru. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
3f9e7c14 |
|
30-Jul-2012 |
majianpeng <majianpeng@gmail.com> |
raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer Because bios will merge at block-layer,so bios-error may caused by other bio which be merged into to the same request. Using this flag,it will find exactly error-sector and not do redundant operation like re-write and re-read. V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block layer. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
b17459c0 |
|
19-Jul-2012 |
Shaohua Li <shli@kernel.org> |
raid5: add a per-stripe lock Add a per-stripe lock to protect stripe specific data. The purpose is to reduce lock contention of conf->device_lock. stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio list of the stripe is always serialized by this lock, so adding bio to the lists (add_stripe_bio()) and removing bio from the lists (like ops_run_biofill()) not race. If bio in ->read, ->written ... list are not shared by multiple stripes, we don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will protect them. If the bio are shared, there are two protections: 1. bi_phys_segments acts as a reference count 2. traverse the list uses r5_next_bio, which makes traverse never access bio not belonging to the stripe Let's have an example: | stripe1 | stripe2 | stripe3 | ...bio1......|bio2|bio3|....bio4..... stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2 because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and stripe3 will end_bio bio4. before add_stripe_bio() addes a bio to a stripe, we already increament the bio bi_phys_segments, so don't worry other stripes release the bio. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
bc0934f0 |
|
21-May-2012 |
Shaohua Li <shli@kernel.org> |
raid5: support sync request REQ_SYNC is ignored in current raid5 code. Block layer does use it to do policy, for example ioscheduler. This patch adds it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
b5254dd5 |
|
20-May-2012 |
NeilBrown <neilb@suse.de> |
md/raid5: allow for change in data_offset while managing a reshape. The important issue here is incorporating the different in data_offset into calculations concerning when we might need to over-write data that is still thought to be valid. To this end we find the minimum offset difference across all devices and add that where appropriate. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
9a3e1101 |
|
22-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: detect and handle replacements during recovery. During recovery we want to write to the replacement but not the original. So we have two new flags - R5_NeedReplace if this stripe has a replacement that needs to be written at some stage - R5_WantReplace if NeedReplace, and the data is available, and a 'sync' has been requested on this stripe. We also distinguish between 'sync and replace' which need to read all other devices, and 'replace' which only needs to read the devices being replaced. Note that during resync we always write to any replacement device. It might not need to be written to, but as we don't read to compare, we have to write to be sure. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
977df362 |
|
22-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: writes should get directed to replacement as well as original. When writing, we need to submit two writes, one to the original, and one to the replacement - if there is a replacement. If the write to the replacement results in a write error, we just fail the device. We only try to record write errors to the original. When writing for recovery, we shouldn't write to the original. This will be addressed in a subsequent patch that generally addresses recovery. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
ede7ee8b |
|
22-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: raid5.h cleanup Remove some #defines that are no longer used, and replace some others with an enum. And remove an unused field. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
671488cc |
|
22-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: allow each slot to have an extra replacement device Just enhance data structures to record a second device per slot to be used as a 'replacement' device, replacing the original. We also have a second bio in each slot in each stripe_head. This will only be used when writing to the array - we need to write to both the original and the replacement at the same time, so will need two bios. For now, only try using the replacement drive for aligned-reads. In this case, we prefer the replacement if it has been recovered far enough, otherwise use the original. This includes a small enhancement. Previously we would only do aligned reads if the target device was fully recovered. Now we also do them if it has recovered far enough. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
d1688a6d |
|
10-Oct-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: typedef removal: raid5_conf_t -> struct r5conf Signed-off-by: NeilBrown <neilb@suse.de>
|
#
2b8bf345 |
|
10-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: remove typedefs: mdk_thread_t -> struct md_thread Signed-off-by: NeilBrown <neilb@suse.de>
|
#
fd01b88c |
|
10-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: remove typedefs: mddev_t -> struct mddev Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
|
#
3cb03002 |
|
10-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: removing typedefs: mdk_rdev_t -> struct md_rdev The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
b84db560 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Clear bad blocks on successful write. On a successful write to a known bad block, flag the sh so that raid5d can remove the known bad block from the list. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
bc2607f3 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: write errors should be recorded as bad blocks if possible. When a write error is detected, don't mark the device as failed immediately but rather record the fact for handle_stripe to deal with. Handle_stripe then attempts to record a bad block. Only if that fails does the device get marked as faulty. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
7f0da59b |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: use bad-block log to improve handling of uncorrectable read errors. If we get an uncorrectable read error - record a bad block rather than failing the device. And if these errors (which may be due to known bad blocks) cause recovery to be impossible, record a bad block on the recovering devices, or abort the recovery. As we might abort a recovery without failing a device we need to teach RAID5 about recovery_disabled handling. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
c5709ef6 |
|
25-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: add some more fields to stripe_head_state Adding these three fields will allow more common code to be moved to handle_stripe() struct field rearrangement by Namhyung Kim. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
#
f2b3b44d |
|
25-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: unify stripe_head_state and r6_state 'struct stripe_head_state' stores state about the 'current' stripe that is passed around while handling the stripe. For RAID6 there is an extension structure: r6_state, which is also passed around. There is no value in keeping these separate, so move the fields from the latter into the former. This means that all code now needs to treat s->failed_num as an small array, but this is a small cost. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
#
c4c1663b |
|
25-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: replace sh->lock with an 'active' flag. sh->lock is now mainly used to ensure that two threads aren't running in the locked part of handle_stripe[56] at the same time. That can more neatly be achieved with an 'active' flag which we set while running handle_stripe. If we find the flag is set, we simply requeue the stripe for later by setting STRIPE_HANDLE. For safety we take ->device_lock while examining the state of the stripe and creating a summary in 'stripe_head_state / r6_state'. This possibly isn't needed but as shared fields like ->toread, ->towrite are checked it is safer for now at least. We leave the label after the old 'unlock' called "unlock" because it will disappear in a few patches, so renaming seems pointless. This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE later, but that is not a problem. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
#
83206d66 |
|
25-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Remove use of sh->lock in sync_request This is the start of a series of patches to remove sh->lock. sync_request takes sh->lock before setting STRIPE_SYNCING to ensure there is no race with testing it in handle_stripe[56]. Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early in handle_stripe[56] (after getting the same lock) and perform the same set/clear operations if it was set. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
#
482c0834 |
|
18-Apr-2011 |
NeilBrown <neilb@suse.de> |
md - remove old plugging code. md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
7eaceacc |
|
10-Mar-2011 |
Jens Axboe <jaxboe@fusionio.com> |
block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
e9c7469b |
|
03-Sep-2010 |
Tejun Heo <tj@kernel.org> |
md: implment REQ_FLUSH/FUA support This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
9f7c2220 |
|
25-Jul-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: export raid5 unplugging interface. Also remove remaining accesses to ->queue and ->gendisk when ->queue is NULL (As it is in a DM target). Signed-off-by: NeilBrown <neilb@suse.de>
|
#
2ac87401 |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: add simple plugging infrastructure. md/raid5 uses the plugging infrastructure provided by the block layer and 'struct request_queue'. However when we plug raid5 under dm there is no request queue so we cannot use that. So create a similar infrastructure that is much lighter weight and use it for raid5. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
11d8a6e3 |
|
25-Jul-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: export is_congested test the dm module will need this for dm-raid45. Also only access ->queue->backing_dev_info->congested_fn if ->queue actually exists. It won't in a dm target. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
f4be6b43 |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk We will shortly allow md devices with no gendisk (they are attached to a dm-target instead). That will cause mdname() to return 'mdX'. There is one place where mdname really needs to be unique: when creating the name for a slab cache. So in that case, if there is no gendisk, you the address of the mddev formatted in HEX to provide a unique name. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
c41d4ac4 |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: factor out code for changing size of stripe cache. Separate the actual 'change' code from the sysfs interface so that it can eventually be called internally. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
a29d8b8e |
|
01-Feb-2010 |
Tejun Heo <tj@kernel.org> |
percpu: add __percpu sparse annotations to what's left Add __percpu sparse annotations to places which didn't make it in one of the previous patches. All converions are trivial. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Borislav Petkov <borislav.petkov@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Neil Brown <neilb@suse.de>
|
#
e4424fee |
|
15-Oct-2009 |
NeilBrown <neilb@suse.de> |
md: fix problems with RAID6 calculations for DDF. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
417b8d4a |
|
15-Oct-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid456: downlevel multicore operations to raid_run_ops The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). By downleveling the parallelism to raid_run_ops the pathological stripe_head bouncing is eliminated. This version still exhibits an average 11% throughput loss for: mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 echo 1024 > /sys/block/md0/md/stripe_cache_size dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 ...but the results are at least stable and can be used as a base for further multicore experimentation. Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
ac6b53b6 |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: asynchronous raid6 operations [ Based on an original patch by Yuri Tikhonov ] The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pq+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_RECONSTRUCT - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case, and reuses some routines, namely: 1/ ops_complete_postxor (renamed to ops_complete_reconstruct) 2/ ops_complete_compute (updated to set up to 2 targets uptodate) 3/ ops_run_check (renamed to ops_run_check_p for xor parity checks) [neilb@suse.de: fixes to get it to pass mdadm regression suite] Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
ad283ea4 |
|
29-Aug-2009 |
Dan Williams <dan.j.williams@intel.com> |
async_tx: add sum check flags Replace the flat zero_sum_result with a collection of flags to contain the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead of DMA_ since these flags will be used on non-dma-zero-sum enabled platforms. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
d6f38f31 |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid5,6: add percpu scribble region for buffer lists Use percpu memory rather than stack for storing the buffer lists used in parity calculations. Include space for dma address conversions and pass that to async_tx via the async_submit_ctl.scribble pointer. [ Impact: move memory pressure from stack to heap ] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
36d1c647 |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: move the spare page to a percpu allocation In preparation for asynchronous handling of raid6 operations move the spare page to a percpu allocation to allow multiple simultaneous synchronous raid6 recovery operations. Make this allocation cpu hotplug aware to maximize allocation efficiency. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
09c9e5fa |
|
17-Jun-2009 |
Andre Noll <maan@systemlinux.org> |
md: convert conf->chunk_size and conf->prev_chunk to sectors. This kills some more shifts. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
070ec55d |
|
16-Jun-2009 |
NeilBrown <neilb@suse.de> |
md: remove mddev_to_conf "helper" macro Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
c8f517c4 |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5 revise rules for when to update metadata during reshape We currently update the metadata : 1/ every 3Megabytes 2/ When the place we will write new-layout data to is recorded in the metadata as still containing old-layout data. Rule one exists to avoid having to re-do too much reshaping in the face of a crash/restart. So it should really be time based rather than size based. So change it to "every 10 seconds". Rule two turns out to be too harsh when restriping an array 'in-place', as in that case the metadata much be updates for every stripe. For the in-place update, it can only possibly be safe from a crash if some user-space program data a backup of every e.g. few hundred stripes before allowing them to be reshaped. In that case, the constant metadata update is pointless. So only update the metadata if the new metadata will report that the end of the 'old-layout' data is beyond where we are currently writing 'new-layout' data. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
e183eaed |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: prepare for allowing reshape to change layout Add prev_algo to raid5_conf_t along the same lines as prev_chunk and previous_raid_disks. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
784052ec |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: prepare for allowing reshape to change chunksize. Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to remember what the chunk size was before the reshape that is currently underway. This seems like duplication with "chunk_size" and "new_chunk" in mddev_t, and to some extent it is, but there are differences. The values in mddev_t are always defined and often the same. The prev* values are only defined if a reshape is underway. Also (and more significantly) the raid5_conf_t values will be changed at the same time (inside an appropriate lock) that the reshape is started by setting reshape_position. In contrast, the new_chunk value is set when the sysfs file is written which could be well before the reshape starts. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
86b42c71 |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape. During a raid5 reshape, we have some stripes in the cache that are 'before' the reshape (and are still to be processed) and some that are 'after'. They are currently differentiated by having different ->disks values as the only reshape current supported involves changing the number of disks. However we will soon support reshapes that do not change the number of disks (chunk parity or chunk size). So make the difference more explicit with a 'generation' number. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
fef9c61f |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: change reshape-progress measurement to cope with reshaping backwards. When reducing the number of devices in a raid4/5/6, the reshape process has to start at the end of the array and work down to the beginning. So we need to handle expand_progress and expand_lo differently. This patch renames "expand_progress" and "expand_lo" to avoid the implication that anything is getting bigger (expand->reshape) and every place they are used, we make sure that they are used the right way depending on whether delta_disks is positive or negative. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
34e04e87 |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: drop qd_idx from r6_state We now have this value in stripe_head so we don't need to duplicate it. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
f701d589 |
|
30-Mar-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: move raid6 data processing to raid6_pq.ko Move the raid6 data processing routines into a standalone module (raid6_pq) to prepare them to be called from async_tx wrappers and other non-md drivers/modules. This precludes a circular dependency of raid456 needing the async modules for data processing while those modules in turn depend on raid456 for the base level synchronous raid6 routines. To support this move: 1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h 2/ The raid6_call, recovery calls, and table symbols are exported 3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to compile Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
#
91adb564 |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: refactor raid5 "run" .. so that the code to create the private data structures is separate. This will help with future code to change the level of an active array. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
67cc2b81 |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: finish support for DDF/raid6 DDF requires RAID6 calculations over different devices in a different order. For md/raid6, we calculate over just the data devices, starting immediately after the 'Q' block. For ddf/raid6 we calculate over all devices, using zeros in place of the P and Q blocks. This requires unfortunately complex loops... Signed-off-by: NeilBrown <neilb@suse.de>
|
#
99c0fb5f |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: Add support for new layouts for raid5 and raid6. DDF uses different layouts for P and Q blocks than current md/raid6 so add those that are missing. Also add support for RAID6 layouts that are identical to various raid5 layouts with the simple addition of one device to hold all of the 'Q' blocks. Finally add 'raid5' layouts to match raid4. These last to will allow online level conversion. Note that this does not provide correct support for DDF/raid6 yet as the order in which data blocks are summed to produce the Q block is significant and different between current md code and DDF requirements. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
d0dabf7e |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid6: remove expectation that Q device is immediately after P device. Code currently assumes that the devices in a raid6 stripe are 0 1 ... N-1 P Q in some rotated order. We will shortly add new layouts in which this strict pattern is broken. So remove this expectation. We still assume that the data disks are roughly in-order. However P and Q can be inserted anywhere within that order. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
bff61975 |
|
30-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: move lots of #include lines out of .h files and into .c This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>
|
#
ef740c37 |
|
30-Mar-2009 |
Christoph Hellwig <hch@lst.de> |
md: move headers out of include/linux/raid/ Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>
|