#
2b712e3b |
|
25-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused included headers With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
078b8b90 |
|
19-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: merge ordered work callbacks in btrfs_work into one There are two callbacks defined in btrfs_work but only two actually make use of them, otherwise there are NULLs. We can get rid of the freeing callback making it a special case of the normal work. This reduces the size of btrfs_work by 8 bytes, final layout: struct btrfs_work { btrfs_func_t func; /* 0 8 */ btrfs_ordered_func_t ordered_func; /* 8 8 */ struct work_struct normal_work; /* 16 32 */ struct list_head ordered_list; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btrfs_workqueue * wq; /* 64 8 */ long unsigned int flags; /* 72 8 */ /* size: 80, cachelines: 2, members: 6 */ /* last cacheline: 16 bytes */ }; This in turn reduces size of other structures (on a release config): - async_chunk 160 -> 152 - async_submit_bio 152 -> 144 - btrfs_async_delayed_work 104 -> 96 - btrfs_caching_control 176 -> 168 - btrfs_delalloc_work 144 -> 136 - btrfs_fs_info 3608 -> 3600 - btrfs_ordered_extent 440 -> 424 - btrfs_writepage_fixup 104 -> 96 Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5335f437 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: include trace header in where necessary If we no longer include the tracepoints from ctree.h we fail to compile because we have the dependency in some of the header files and source files. Add the include where we have these dependencies to allow us to remove the include from ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
58e814fc |
|
25-May-2023 |
Tejun Heo <tj@kernel.org> |
btrfs: use alloc_ordered_workqueue() to create ordered workqueues BACKGROUND ========== When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. They may get executed in any order and simultaneously. When fully serialized execution - one by one in the queueing order - is needed, an ordered workqueue should be used which can be created with alloc_ordered_workqueue(). However, alloc_ordered_workqueue() was a later addition. Before it, an ordered workqueue could be obtained by creating an UNBOUND workqueue with @max_active==1. This originally was an implementation side-effect which was broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered"). Because there were users that depended on the ordered execution, 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") made workqueue allocation path to implicitly promote UNBOUND workqueues w/ @max_active==1 to ordered workqueues. While this has worked okay, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue updates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. This patch series audits all call sites that create an UNBOUND workqueue w/ @max_active==1 and converts them to alloc_ordered_workqueue() as necessary. BTRFS ===== * fs_info->scrub_workers initialized in scrub_workers_get() was setting @max_active to 1 when @is_dev_replace is set and it seems that the workqueue actually needs to be ordered if @is_dev_replace. Update the code so that alloc_ordered_workqueue() is used if @is_dev_replace. * fs_info->discard_ctl.discard_workers initialized in btrfs_init_workqueues() was directly using alloc_workqueue() w/ @max_active==1. Converted to alloc_ordered_workqueue(). * fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue, which are allocated with btrfs_alloc_workqueue(). btrfs_workqueue implements automatic @max_active adjustment which is disabled when the specified max limit is below a certain threshold, so calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered workqueue whose @max_active won't be changed as the auto-tuning is disabled. This is rather brittle in that nothing clearly indicates that the two workqueues should be ordered or btrfs_alloc_workqueue() must disable auto-tuning when @limit_active==1. This patch factors out the common btrfs_workqueue init code into btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue(). The two workqueues are converted to use the new ordered allocation interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a31b4a43 |
|
17-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue Just let the one caller that wants optional WQ_HIGHPRI handling allocate a separate btrfs_workqueue for that. This allows to rename struct __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and separate allocation for all btrfs_workqueue users and generally simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
45da9c17 |
|
02-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: fix memory ordering between normal and ordered work functions Ordered work functions aren't guaranteed to be handled by the same thread which executed the normal work functions. The only way execution between normal/ordered functions is synchronized is via the WORK_DONE_BIT, unfortunately the used bitops don't guarantee any ordering whatsoever. This manifested as seemingly inexplicable crashes on ARM64, where async_chunk::inode is seen as non-null in async_cow_submit which causes submit_compressed_extents to be called and crash occurs because async_chunk::inode suddenly became NULL. The call trace was similar to: pc : submit_compressed_extents+0x38/0x3d0 lr : async_cow_submit+0x50/0xd0 sp : ffff800015d4bc20 <registers omitted for brevity> Call trace: submit_compressed_extents+0x38/0x3d0 async_cow_submit+0x50/0xd0 run_ordered_work+0xc8/0x280 btrfs_work_helper+0x98/0x250 process_one_work+0x1f0/0x4ac worker_thread+0x188/0x504 kthread+0x110/0x114 ret_from_fork+0x10/0x18 Fix this by adding respective barrier calls which ensure that all accesses preceding setting of WORK_DONE_BIT are strictly ordered before setting the flag. At the same time add a read barrier after reading of WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads would be strictly ordered after reading the bit. This in turn ensures are all accesses before WORK_DONE_BIT are going to be strictly ordered before any access that can occur in ordered_func. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: 08a9ff326418 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue") CC: stable@vger.kernel.org # 4.4+ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f0cc2cd7 |
|
28-Feb-2020 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix crash during unmount due to race with delayed inode workers During unmount we can have a job from the delayed inode items work queue still running, that can lead to at least two bad things: 1) A crash, because the worker can try to create a transaction just after the fs roots were freed; 2) A transaction leak, because the worker can create a transaction before the fs roots are freed and just after we committed the last transaction and after we stopped the transaction kthread. A stack trace example of the crash: [79011.691214] kernel BUG at lib/radix-tree.c:982! [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2 (...) [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs] [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170 (...) [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293 [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0 [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0 [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000 [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001 [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000 [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0 [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [79011.715448] Call Trace: [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs] [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs] [79011.717925] start_transaction+0xdd/0x5c0 [btrfs] [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs] [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs] [79011.720773] process_one_work+0x26d/0x6a0 [79011.721497] worker_thread+0x4f/0x3e0 [79011.722153] ? process_one_work+0x6a0/0x6a0 [79011.722901] kthread+0x103/0x140 [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70 [79011.724379] ret_from_fork+0x3a/0x50 (...) The following diagram shows a sequence of steps that lead to the crash during ummount of the filesystem: CPU 1 CPU 2 CPU 3 btrfs_punch_hole() btrfs_btree_balance_dirty() btrfs_balance_delayed_items() --> sees fs_info->delayed_root->items with value 200, which is greater than BTRFS_DELAYED_BACKGROUND (128) and smaller than BTRFS_DELAYED_WRITEBACK (512) btrfs_wq_run_delayed_node() --> queues a job for fs_info->delayed_workers to run btrfs_async_run_delayed_root() btrfs_async_run_delayed_root() --> job queued by CPU 1 --> starts picking and running delayed nodes from the prepare_list list close_ctree() btrfs_delete_unused_bgs() btrfs_commit_super() btrfs_join_transaction() --> gets transaction N btrfs_commit_transaction(N) --> set transaction state to TRANTS_STATE_COMMIT_START btrfs_first_prepared_delayed_node() --> picks delayed node X through the prepared_list list btrfs_run_delayed_items() btrfs_first_delayed_node() --> also picks delayed node X but through the node_list list __btrfs_commit_inode_delayed_items() --> runs all delayed items from this node and drops the node's item count to 0 through call to btrfs_release_delayed_inode() --> finishes running any remaining delayed nodes --> finishes transaction commit --> stops cleaner and transaction threads btrfs_free_fs_roots() --> frees all roots and removes them from the radix tree fs_info->fs_roots_radix btrfs_join_transaction() start_transaction() btrfs_record_root_in_trans() record_root_in_trans() radix_tree_tag_set() --> crashes because the root is not in the radix tree anymore If the worker is able to call btrfs_join_transaction() before the unmount task frees the fs roots, we end up leaking a transaction and all its resources, since after the call to btrfs_commit_super() and stopping the transaction kthread, we don't expect to have any transaction open anymore. When this situation happens the worker has a delayed node that has no more items to run, since the task calling btrfs_run_delayed_items(), which is doing a transaction commit, picks the same node and runs all its items first. We can not wait for the worker to complete when running delayed items through btrfs_run_delayed_items(), because we call that function in several phases of a transaction commit, and that could cause a deadlock because the worker calls btrfs_join_transaction() and the task doing the transaction commit may have already set the transaction state to TRANS_STATE_COMMIT_DOING. Also it's not possible to get into a situation where only some of the items of a delayed node are added to the fs/subvolume tree in the current transaction and the remaining ones in the next transaction, because when running the items of a delayed inode we lock its mutex, effectively waiting for the worker if the worker is running the items of the delayed node already. Since this can only cause issues when unmounting a filesystem, fix it in a simple way by waiting for any jobs on the delayed workers queue before calling btrfs_commit_supper() at close_ctree(). This works because at this point no one can call btrfs_btree_balance_dirty() or btrfs_balance_delayed_items(), and if we end up waiting for any worker to complete, btrfs_commit_super() will commit the transaction created by the worker. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e1f60a65 |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add __pure attribute to functions The attribute is more relaxed than const and the functions could dereference pointers, as long as the observable state is not changed. We do have such functions, based on -Wsuggest-attribute=pure . The visible effects of this patch are negligible, there are differences in the assembly but hard to summarize. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c9eb55db |
|
16-Sep-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of pointless wtag variable in async-thread.c Commit ac0c7cf8be00 ("btrfs: fix crash when tracepoint arguments are freed by wq callbacks") added a void pointer, wtag, which is passed into trace_btrfs_all_work_done() instead of the freed work item. This is silly for a few reasons: 1. The freed work item still has the same address. 2. work is still in scope after it's freed, so assigning wtag doesn't stop anyone from using it. 3. The tracepoint has always taken a void * argument, so assigning wtag doesn't actually make things any more type-safe. (Note that the original bug in commit bc074524e123 ("btrfs: prefix fsid to all trace events") was that the void * was implicitly casted when it was passed to btrfs_work_owner() in the trace point itself). Instead, let's add some clearer warnings as comments. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0cac0ec |
|
16-Sep-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of unique workqueue helper functions Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c495dcd6 |
|
16-Sep-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: don't prematurely free work in run_ordered_work() We hit the following very strange deadlock on a system with Btrfs on a loop device backed by another Btrfs filesystem: 1. The top (loop device) filesystem queues an async_cow work item from cow_file_range_async(). We'll call this work X. 2. Worker thread A starts work X (normal_work_helper()). 3. Worker thread A executes the ordered work for the top filesystem (run_ordered_work()). 4. Worker thread A finishes the ordered work for work X and frees X (work->ordered_free()). 5. Worker thread A executes another ordered work and gets blocked on I/O to the bottom filesystem (still in run_ordered_work()). 6. Meanwhile, the bottom filesystem allocates and queues an async_cow work item which happens to be the recently-freed X. 7. The workqueue code sees that X is already being executed by worker thread A, so it schedules X to be executed _after_ worker thread A finishes (see the find_worker_executing_work() call in process_one_work()). Now, the top filesystem is waiting for I/O on the bottom filesystem, but the bottom filesystem is waiting for the top filesystem to finish, so we deadlock. This happens because we are breaking the workqueue assumption that a work item cannot be recycled while it still depends on other work. Fix it by waiting to free the work item until we are done with all of the related ordered work. P.S.: One might ask why the workqueue code doesn't try to detect a recycled work item. It actually does try by checking whether the work item has the same work function (find_worker_executing_work()), but in our case the function is the same. This is the only key that the workqueue code has available to compare, short of adding an additional, layer-violating "custom key". Considering that we're the only ones that have ever hit this, we should just play by the rules. Unfortunately, we haven't been able to create a minimal reproducer other than our full container setup using a compress-force=zstd filesystem on top of another compress-force=zstd filesystem. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f64ce7b8 |
|
01-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: async-thread: convert defines to enums Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ce3ded10 |
|
17-Jan-2019 |
David Sterba <dsterba@suse.com> |
btrfs: simplify workqueue name when allocating The workqueue name is constructed from a format string but the prefix does not need to be set by %s. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1d7c514 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace GPL boilerplate by SPDX -- sources Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6939f667 |
|
13-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix confusing worker helper info in stacktrace We've seen the following backtrace stack in ftrace or dmesg log, kworker/u16:10-4244 [000] 241942.480955: function: btrfs_put_ordered_extent kworker/u16:10-4244 [000] 241942.480956: kernel_stack: <stack trace> => finish_ordered_fn (ffffffffa0384475) => btrfs_scrubparity_helper (ffffffffa03ca577) <-----"incorrect" => btrfs_freespace_write_helper (ffffffffa03ca98e) <-----"correct" => process_one_work (ffffffff81117b2f) => worker_thread (ffffffff81118c2a) => kthread (ffffffff81121de0) => ret_from_fork (ffffffff81d7087a) btrfs_freespace_write_helper is actually calling normal_worker_helper instead of btrfs_scrubparity_helper, so somehow kernel has parsed the incorrect function address while unwinding the stack, btrfs_scrubparity_helper really shouldn't be shown up. It's caused by compiler doing inline for our helper function, adding a noinline tag can fix that. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use noinline_for_stack ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a35b637 |
|
28-Jun-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: constify tracepoint arguments Tracepoint arguments are all read-only. If we mark the arguments as const, we're able to keep or convert those arguments to const where appropriate. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ac0c7cf8 |
|
06-Jan-2017 |
David Sterba <dsterba@suse.com> |
btrfs: fix crash when tracepoint arguments are freed by wq callbacks Enabling btrfs tracepoints leads to instant crash, as reported. The wq callbacks could free the memory and the tracepoints started to dereference the members to get to fs_info. The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2 removed the tracepoints but we could preserve them by passing only the required data in a safe way. Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events") CC: stable@vger.kernel.org # 4.8+ Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2939e1a8 |
|
12-Dec-2016 |
Maxim Patlasov <mpatlasov@virtuozzo.com> |
btrfs: limit async_work allocation and worker func duration Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-128 10144 10144 128 32 1 : tunables 0 0 0 : slabdata 317 317 0 kmalloc-128 9992352 9992352 128 32 1 : tunables 0 0 0 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752 128 32 1 : tunables 0 0 0 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240 128 32 1 : tunables 0 0 0 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Changed in v2: remove support of thresh == NO_THRESHOLD. Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v3.15+
|
#
cb001095 |
|
09-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: plumb fs_info into btrfs_work In order to provide an fsid for trace events, we'll need a btrfs_fs_info pointer. The most lightweight way to do that for btrfs_work structures is to associate it with the __btrfs_workqueue structure. Each queued btrfs_work structure has a workqueue associated with it, so that's a natural fit. It's a privately defined structures, so we add accessors to retrieve the fs_info pointer. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0a95b851 |
|
21-Jan-2016 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: async-thread: Fix a use-after-free error for trace Parameter of trace_btrfs_work_queued() can be freed in its workqueue. So no one use use that pointer after queue_work(). Fix the user-after-free bug by move the trace line before queue_work(). Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
61dd5ae6 |
|
01-Dec-2015 |
David Sterba <dsterba@suse.com> |
btrfs: use GFP_KERNEL for allocations of workqueues We don't have to use GFP_NOFS to allocate workqueue structures, this is done from mount context or potentially scrub start context, safe to fail in both cases. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c6dd6ea5 |
|
19-Aug-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: async_thread: Fix workqueue 'max_active' value when initializing At initializing time, for threshold-able workqueue, it's max_active of kernel workqueue should be 1 and grow if it hits threshold. But due to the bad naming, there is both 'max_active' for kernel workqueue and btrfs workqueue. So wrong value is given at workqueue initialization. This patch fixes it, and to avoid further misunderstanding, change the member name of btrfs_workqueue to 'current_active' and 'limit_active'. Also corresponding comment is added for readability. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
20b2e302 |
|
04-Jun-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() lockdep report following warning in test: [25176.843958] ================================= [25176.844519] [ INFO: inconsistent lock state ] [25176.845047] 4.1.0-rc3 #22 Tainted: G W [25176.845591] --------------------------------- [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes: [25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs] [25176.847838] {SOFTIRQ-ON-W} state was registered at: [25176.848396] [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10 [25176.848955] [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0 [25176.849491] [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410 [25176.850029] [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs] [25176.850575] [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs] [25176.851110] [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs] [25176.851660] [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs] [25176.852189] [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs] [25176.852771] [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs] [25176.853315] [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570 [25176.853868] [<ffffffff8121c851>] SyS_ioctl+0x41/0x80 [25176.854406] [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f [25176.854935] irq event stamp: 51506 [25176.855511] hardirqs last enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0 [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0 [25176.856642] softirqs last enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640 [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120 [25176.857746] other info that might help us debug this: [25176.858845] Possible unsafe locking scenario: [25176.859981] CPU0 [25176.860537] ---- [25176.861059] lock(&wr_ctx->wr_lock); [25176.861705] <Interrupt> [25176.862272] lock(&wr_ctx->wr_lock); [25176.862881] *** DEADLOCK *** Reason: Above warning is caused by: Interrupt -> bio_endio() -> ... -> scrub_put_ctx() -> scrub_free_ctx() *1 -> ... -> mutex_lock(&wr_ctx->wr_lock); scrub_put_ctx() is allowed to be called in end_bio interrupt, but in code design, it will never call scrub_free_ctx(sctx) in interrupe context(above *1), because btrfs_scrub_dev() get one additional reference of sctx->refs, which makes scrub_free_ctx() only called withine btrfs_scrub_dev(). Now the code runs out of our wish, because free sequence in scrub_pending_bio_dec() have a gap. Current code: -----------------------------------+----------------------------------- scrub_pending_bio_dec() | btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); | wake_up(&sctx->list_wait); | | scrub_put_ctx() | -> atomic_dec_and_test(&sctx->refs) scrub_put_ctx(sctx); | -> atomic_dec_and_test(&sctx->refs)| -> scrub_free_ctx() | -----------------------------------+----------------------------------- We expected: -----------------------------------+----------------------------------- scrub_pending_bio_dec() | btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); | wake_up(&sctx->list_wait); | scrub_put_ctx(sctx); | -> atomic_dec_and_test(&sctx->refs)| | scrub_put_ctx() | -> atomic_dec_and_test(&sctx->refs) | -> scrub_free_ctx() -----------------------------------+----------------------------------- Fix: Move scrub_pending_bio_dec() to a workqueue, to avoid this function run in interrupt context. Tested by check tracelog in debug. Changelog v1->v2: Use workqueue instead of adjust function call sequence in v1, because v1 will introduce a bug pointed out by: Filipe David Manana <fdmanana@gmail.com> Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
6f011058 |
|
16-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: use correct type for workqueue flags Through all the local wrappers to alloc_workqueue, __alloc_workqueue_key takes an unsigned int. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
5d99a998 |
|
29-Sep-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove unlikely from NULL checks Unlikely is implicit for NULL checks of pointers. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
8b110e39 |
|
12-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: implement repair function when direct read fails This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
9e0af237 |
|
15-Aug-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix task hang under heavy compressed write This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
800ee224 |
|
07-Apr-2014 |
Sergei Trofimovich <slyfox@gentoo.org> |
btrfs: fix crash in remount(thread_pool=) case Reproducer: mount /dev/ubda /mnt mount -oremount,thread_pool=42 /mnt Gives a crash: ? btrfs_workqueue_set_max+0x0/0x70 btrfs_resize_thread_pool+0xe3/0xf0 ? sync_filesystem+0x0/0xc0 ? btrfs_resize_thread_pool+0x0/0xf0 btrfs_remount+0x1d2/0x570 ? kern_path+0x0/0x80 do_remount_sb+0xd9/0x1c0 do_mount+0x26a/0xbf0 ? kfree+0x0/0x1b0 SyS_mount+0xc4/0x110 It's a call btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size); with fs_info->scrub_wr_completion_workers = NULL; as scrub wqs get created only on user's demand. Patch skips not-created-yet workqueues. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> CC: Qu Wenruo <quwenruo@cn.fujitsu.com> CC: Chris Mason <clm@fb.com> CC: Josef Bacik <jbacik@fb.com> CC: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com>
|
#
c3a46891 |
|
12-Mar-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Add trace for btrfs_workqueue alloc/destroy Since most of the btrfs_workqueue is printed as pointer address, for easier analysis, add trace for btrfs_workqueue alloc/destroy. So it is possible to determine the workqueue that a given work belongs to(by comparing the wq pointer address with alloc trace event). Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ef66af10 |
|
11-Mar-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: add missing kfree in btrfs_destroy_workqueue Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
52483bc2 |
|
05-Mar-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Add ftrace for btrfs_workqueue Add ftrace for btrfs_workqueue for further workqueue tunning. This patch needs to applied after the workqueue replace patchset. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
6db8914f |
|
05-Mar-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the btrfs_workqueue related function type The new btrfs_workqueue still use open-coded function defition, this patch will change them into btrfs_func_t type which is much the same as kernel workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
d458b054 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the "_struct" suffix in btrfs_workequeue Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
a046e9c8 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the old btrfs_worker. Since all the btrfs_worker is replaced with the newly created btrfs_workqueue, the old codes can be easily remove. Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
0bd9289c |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Add threshold workqueue based on kernel workqueue The original btrfs_workers has thresholding functions to dynamically create or destroy kthreads. Though there is no such function in kernel workqueue because the worker is not created manually, we can still use the workqueue_set_max_active to simulated the behavior, mainly to achieve a better HDD performance by setting a high threshold on submit_workers. (Sadly, no resource can be saved) So in this patch, extra workqueue pending counters are introduced to dynamically change the max active of each btrfs_workqueue_struct, hoping to restore the behavior of the original thresholding function. Also, workqueue_set_max_active use a mutex to protect workqueue_struct, which is not meant to be called too frequently, so a new interval mechanism is applied, that will only call workqueue_set_max_active after a count of work is queued. Hoping to balance both the random and sequence performance on HDD. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
1ca08976 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Add high priority workqueue support for btrfs_workqueue_struct Add high priority function to btrfs_workqueue. This is implemented by embedding a btrfs_workqueue into a btrfs_workqueue and use some helper functions to differ the normal priority wq and high priority wq. So the high priority wq is completely independent from the normal workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
08a9ff32 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue Use kernel workqueue to implement a new btrfs_workqueue_struct, which has the ordering execution feature like the btrfs_worker. The func is executed in a concurrency way, and the ordred_func/ordered_free is executed in the sequence them are queued after the corresponding func is done. The new btrfs_workqueue works much like the original one, one workqueue for normal work and a list for ordered work. When a work is queued, ordered work will be added to the list and helper function will be queued into the workqueue. The helper function will execute a normal work and then check and execute as many ordered work as possible in the sequence they were queued. At this patch, high priority work queue or thresholding is not added yet. The high priority feature and thresholding will be added in the following patches. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
51b98eff |
|
08-Feb-2014 |
Stanislaw Gruszka <stf_xl@wp.pl> |
btrfs: always choose work from prio_head first In case we do not refill, we can overwrite cur pointer from prio_head by one from not prioritized head, what looks as something that was not intended. This change make we always take works from prio_head first until it's not empty. Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
ba69994a |
|
03-Nov-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix __btrfs_start_workers retval __btrfs_start_workers returns 0 in case it raced with btrfs_stop_workers and lost the race. This is wrong because worker in this case is not allowed to start and is in fact destroyed. Return -EINVAL instead. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
67871254 |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Fix checkpatch.pl warning of spacing issues Fix spacing issues detected via checkpatch.pl in accordance with the kernel style guidelines. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
964fb15a |
|
02-Oct-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: eliminate races in worker stopping code The current implementation of worker threads in Btrfs has races in worker stopping code, which cause all kinds of panics and lockups when running btrfs/011 xfstest in a loop. The problem is that btrfs_stop_workers is unsynchronized with respect to check_idle_worker, check_busy_worker and __btrfs_start_workers. E.g., check_idle_worker race flow: btrfs_stop_workers(): check_idle_worker(aworker): - grabs the lock - splices the idle list into the working list - removes the first worker from the working list - releases the lock to wait for its kthread's completion - grabs the lock - if aworker is on the working list, moves aworker from the working list to the idle list - releases the lock - grabs the lock - puts the worker - removes the second worker from the working list ...... btrfs_stop_workers returns, aworker is on the idle list FS is umounted, memory is freed ...... aworker is waken up, fireworks ensue With this applied, I wasn't able to trigger the problem in 48 hours, whereas previously I could reliably reproduce at least one of these races within an hour. Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
e9fbcb42 |
|
25-Jul-2012 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: call the ordered free operation without any locks held Each ordered operation has a free callback, and this was called with the worker spinlock held. Josef made the free callback also call iput, which we can't do with the spinlock. This drops the spinlock for the free operation and grabs it again before moving through the rest of the list. We'll circle back around to this and find a cleaner way that doesn't bounce the lock around so much. Signed-off-by: Chris Mason <chris.mason@fusionio.com> cc: stable@kernel.org
|
#
143bede5 |
|
01-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: return void in functions without error conditions Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
8d532b2a |
|
23-Dec-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix worker lock misuse in find_worker Dan Carpenter noticed that we were doing a double unlock on the worker lock, and sometimes picking a worker thread without the lock held. This fixes both errors. Signed-off-by: Chris Mason <chris.mason@oracle.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
|
#
0dc3b84a |
|
18-Nov-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix num_workers_starting bug and other bugs in async thread Al pointed out we have some random problems with the way we account for num_workers_starting in the async thread stuff. First of all we need to make sure to decrement num_workers_starting if we fail to start the worker, so make __btrfs_start_workers do this. Also fix __btrfs_start_workers so that it doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we failed to create a worker. Also check_pending_worker_creates needs to call __btrfs_start_work in it's work function since it already increments num_workers_starting. People only start one worker at a time, so get rid of the num_workers argument everywhere, and make btrfs_queue_worker a void since it will always succeed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
8f3b65a3 |
|
15-Dec-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add a cond_resched() into the worker loop If we have a constant stream of end_io completions or crc work, we can hit softlockup messages from the async helper threads. This adds a cond_resched() into the loop to avoid them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a0acae0e |
|
21-Nov-2011 |
Tejun Heo <tj@kernel.org> |
freezer: unexport refrigerator() and update try_to_freeze() slightly There is no reason to export two functions for entering the refrigerator. Calling refrigerator() instead of try_to_freeze() doesn't save anything noticeable or removes any race condition. * Rename refrigerator() to __refrigerator() and make it return bool indicating whether it scheduled out for freezing. * Update try_to_freeze() to return bool and relay the return value of __refrigerator() if freezing(). * Convert all refrigerator() users to try_to_freeze(). * Update documentation accordingly. * While at it, add might_sleep() to try_to_freeze(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Christoph Hellwig <hch@infradead.org>
|
#
ed3b3d314 |
|
25-May-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't walk around with task->state != TASK_RUNNING Yan Zheng noticed two places we were doing a lot of work without task->state set to TASK_RUNNING. This sets the state properly after we get ready to sleep but decide not to. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
61d92c32 |
|
02-Oct-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix deadlock on async thread startup The btrfs async worker threads are used for a wide variety of things, including processing bio end_io functions. This means that when the endio threads aren't running, the rest of the FS isn't able to do the final processing required to clear PageWriteback. The endio threads also try to exit as they become idle and start more as the work piles up. The problem is that starting more threads means kthreadd may need to allocate ram, and that allocation may wait until the global number of writeback pages on the system is below a certain limit. The result of that throttling is that end IO threads wait on kthreadd, who is waiting on IO to end, which will never happen. This commit fixes the deadlock by handing off thread startup to a dedicated thread. It also fixes a bug where the on-demand thread creation was creating far too many threads because it didn't take into account threads being started by other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6e74057c |
|
15-Sep-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix async thread shutdown race It was possible for an async worker thread to be selected to receive a new work item, but exit before the work item was actually placed into that thread's work list. This commit fixes the race by incrementing the num_pending counter earlier, and making sure to check the number of pending work items before a thread exits. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
627e421a |
|
15-Sep-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix worker thread double spin_lock_irq The exit-on-idle code for async worker threads was incorrectly calling spin_lock_irq with interrupts already off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3e99d8eb |
|
15-Sep-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix async worker startup race After a new worker thread starts, it is placed into the list of idle threads. But, this may race with a check for idle done by the worker thread itself, resulting in a double list_add operation. This fix adds a check to make sure the idle thread addition is done properly. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4f878e84 |
|
07-Aug-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: reduce worker thread spin_lock_irq hold times This changes the btrfs worker threads to batch work items into a local list. It allows us to pull work items in large chunks and significantly reduces the number of times we need to take the worker thread spinlock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4e3f9c50 |
|
05-Aug-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: keep irqs on more often in the worker threads The btrfs worker thread spinlock was being used both for the queueing of IO and for the processing of ordered events. The ordered events never happen from end_io handlers, and so they don't need to use the _irq version of spinlocks. This adds a dedicated lock to the ordered lists so they don't have to run with irqs off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9042846b |
|
04-Aug-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allow worker threads to exit when idle The Btrfs worker threads don't currently die off after they have been idle for a while, leading to a lot of threads sitting around doing nothing for each mount. Also, they are unable to start atomically (from end_io hanlders). This commit reworks the worker threads so they can be started from end_io handlers (just setting a flag that asks for a thread to be added at a later date) and so they can exit if they have been idle for a long time. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
29c5e8ce |
|
22-Jul-2009 |
Julia Lawall <julia@diku.dk> |
Btrfs: convert nested spin_lock_irqsave to spin_lock If spin_lock_irqsave is called twice in a row with the same second argument, the interrupt state at the point of the second call overwrites the value saved by the first call. Indeed, the second call does not need to save the interrupt state, so it is changed to a simple spin_lock. Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9b627e9b |
|
02-Jul-2009 |
Jiri Slaby <jirislaby@kernel.org> |
Btrfs: fix use after free in btrfs_start_workers fail path worker memory is already freed on one fail path in btrfs_start_workers, but is still dereferenced. Switch the dereference and kfree. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fd0fb038 |
|
10-Jun-2009 |
Shin Hong <hongshin@gmail.com> |
Btrfs: init worker struct fields before kthread-run This patch fixes a bug which may result race condition between btrfs_start_workers() and worker_loop(). btrfs_start_workers() executed in a parent thread writes on workers->worker and worker_loop() in a child thread reads workers->worker. However, there is no synchronization enforcing the order of two operations. This patch makes btrfs_start_workers() fill workers->worker before it starts a child thread with worker_loop() Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d313d7a3 |
|
20-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add a priority queue to the async thread helpers Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a higher priority. But, the checksumming helper threads prevent it from being fully effective. There are two problems. First, a big queue of pending checksumming will delay the synchronous IO behind other lower priority writes. Second, the checksumming uses an ordered async work queue. The ordering makes sure that IOs are sent to the block layer in the same order they are sent to the checksumming threads. Usually this gives us less seeky IO. But, when we start mixing IO priorities, the lower priority IO can delay the higher priority IO. This patch solves both problems by adding a high priority list to the async helper threads, and a new btrfs_set_work_high_prio(), which is used to make put a new async work item onto the higher priority list. The ordering is still done on high priority IO, but all of the high priority bios are ordered separately from the low priority bios. This ordering is purely an IO optimization, it is not involved in data or metadata integrity. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2e966ed2 |
|
02-Apr-2009 |
Jim Owens <jowens@hp.com> |
Btrfs: remove unused ftrace include Signed-off-by: jim owens <jowens@hp.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b5555f77 |
|
02-Apr-2009 |
Amit Gud <amitgud@gmail.com> |
Btrfs: fix race in worker_loop Need to check kthread_should_stop after schedule_timeout() before calling schedule(). This causes threads to sleep with potentially no one to wake them up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop. Signed-off-by: Amit Gud <gud@ksu.edu> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b51912c9 |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: async threads should try harder to find work Tracing shows the delay between when an async thread goes to sleep and when more work is added is often very short. This commit adds a little bit of delay and extra checking to the code right before we schedule out. It allows more work to be added to the worker without requiring notifications from other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a6837051 |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Catch missed bios in the async bio submission thread The async bio submission thread was missing some bios that were added after it had decided there was no work left to do. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7eaebe7d |
|
21-Jan-2009 |
Huang Weiyi <weiyi.huang@gmail.com> |
Btrfs: removed unused #include <version.h>'s Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d397712b |
|
05-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0df49b91 |
|
12-Nov-2008 |
yanhai zhu <zhu.yanhai@gmail.com> |
Btrfs: Check kthread_should_stop() before schedule() in worker_loop In worker_loop(), the func should check whether it has been requested to stop before it decides to schedule out. Otherwise if the stop request(also the last wake_up()) sent by btrfs_stop_workers() happens when worker_loop() running after the "while" judgement and before schedule(), woker_loop() will schedule away and never be woken up, which will also cause btrfs_stop_workers() wait forever. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4a69a410 |
|
06-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add ordered async work queues Btrfs uses kernel threads to create async work queues for cpu intensive operations such as checksumming and decompression. These work well, but they make it difficult to keep IO order intact. A single writepages call from pdflush or fsync will turn into a number of bios, and each bio is checksummed in parallel. Once the checksum is computed, the bio is sent down to the disk, and since we don't control the order in which the parallel operations happen, they might go down to the disk in almost any order. The code deals with this somewhat by having deep work queues for a single kernel thread, making it very likely that a single thread will process all the bios for a single inode. This patch introduces an explicitly ordered work queue. As work structs are placed into the queue they are put onto the tail of a list. They have three callbacks: ->func (cpu intensive processing here) ->ordered_func (order sensitive processing here) ->ordered_free (free the work struct, all processing is done) The work struct has three callbacks. The func callback does the cpu intensive work, and when it completes the work struct is marked as done. Every time a work struct completes, the list is checked to see if the head is marked as done. If so the ordered_func callback is used to do the order sensitive processing and the ordered_free callback is used to do any cleanup. Then we loop back and check the head of the list again. This patch also changes the checksumming code to use the ordered workqueues. One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
75ccf47d |
|
30-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix multi-device code to use raid policies set by mkfs When reading in block groups, a global mask of the available raid policies should be adjusted based on the types of block groups found on disk. This global mask is then used to decide which raid policy to use for new block groups. The recent allocator changes dropped the call that updated the global mask, making all the block groups allocated at run time single striped onto a single drive. This also fixes the async worker threads to set any thread that uses the requeue mechanism as busy. This allows us to avoid blocking on get_request_wait for the async bio submission threads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d352ac68 |
|
29-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add and improve comments This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2b1f55b0 |
|
24-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Remove Btrfs compat code for older kernels Btrfs had compatibility code for kernels back to 2.6.18. These have been removed, and will be maintained in a separate backport git tree from now on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
53863232 |
|
15-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Lower contention on the csum mutex This takes the csum mutex deeper in the call chain and releases it more often. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4854ddd0 |
|
15-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Wait for kernel threads to make progress during async submission Before this change, btrfs would use a bdi congestion function to make sure there weren't too many pending async checksum work items. This change makes the process creating async work items wait instead, leading to fewer congestion returns from the bdi. This improves pdflush background_writeout scanning. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5443be45 |
|
15-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Give all the worker threads descriptive names Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
61b49440 |
|
31-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix streaming read performance with checksumming on Large streaming reads make for large bios, which means each entry on the list async work queues represents a large amount of data. IO congestion throttling on the device was kicking in before the async worker threads decided a single thread was busy and needed some help. The end result was that a streaming read would result in a single CPU running at 100% instead of balancing the work off to other CPUs. This patch also changes the pre-IO checksum lookup done by reads to work on a per-bio basis instead of a per-page. This results in many extra btree lookups on large streaming reads. Doing the checksum lookup right before bio submit allows us to reuse searches while processing adjacent offsets. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3bf10418 |
|
30-Jul-2008 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: async-thread: fix possible memory leak When kthread_run() returns failure, this worker hasn't been added to the list, so btrfs_stop_workers() won't free it. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
35d8ba66 |
|
11-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Worker thread optimizations This changes the worker thread pool to maintain a list of idle threads, avoiding a complex search for a good thread to wake up. Threads have two states: idle - we try to reuse the last thread used in hopes of improving the batching ratios busy - each time a new work item is added to a busy task, the task is rotated to the end of the line. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d05e5a4d |
|
11-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add backport for the kthread work on kernels older than 2.6.20 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8b712842 |
|
11-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add async worker threads for pre and post IO checksumming Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|