#
a8b70c7f |
|
21-Feb-2024 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: don't skip block groups with 100% zone unusable Commit f4a9f219411f ("btrfs: do not delete unused block group if it may be used soon") changed the behaviour of deleting unused block-groups on zoned filesystems. Starting with this commit, we're using btrfs_space_info_used() to calculate the number of used bytes in a space_info. But btrfs_space_info_used() also accounts btrfs_space_info::bytes_zone_unusable as used bytes. So if a block group is 100% zone_unusable it is skipped from the deletion step. In order not to skip fully zone_unusable block-groups, also check if the block-group has bytes left that can be used on a zoned filesystem. Fixes: f4a9f219411f ("btrfs: do not delete unused block group if it may be used soon") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7ec28f83 |
|
29-Feb-2024 |
Lijuan Li <lilijuan@iscas.ac.cn> |
btrfs: mark btrfs_put_caching_control() static btrfs_put_caching_control() is only used in block-group.c, so mark it static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97ec3320 |
|
19-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: handle block group lookup error when it's being removed The unlikely case of lookup error in btrfs_remove_block_group() can be handled properly, in its caller this would lead to a transaction abort. We can't do anything else, a block group must have been loaded first. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
edebd19a |
|
25-Jan-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add comment about list_is_singular() use at btrfs_delete_unused_bgs() At btrfs_delete_unused_bgs(), the use of the list_is_singular() check on a block group may not be immediately obvious. It is there to prevent losing raid profile information for a block group type (data, metadata or system), as that information is removed from fs_info->avail_[data|metadata|system]_alloc_bits when the last block group of a given type is deleted. So deleting the block group would later result in creating block groups of that type with a single profile (because fs_info->avail_*_alloc_bits would have a value of 0). This check was added in commit aefbe9a633b5 ("btrfs: Fix lost-data-profile caused by auto removing bg"). So add a comment mentioning the need for the check. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
44a6c343 |
|
12-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: return errors from unpin_extent_range() Handle the lookup failure of the block group to unpin, this is a logic error as the block group must exist at this point. If not, something else must have freed it, like clean_pinned_extents() would do without locking the unused_bg_unpin_mutex. Push the errors to the callers, proper handling will be done in followup patches. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
12c5128f |
|
25-Jan-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add new unused block groups to the list of unused block groups Space reservations for metadata are, most of the time, pessimistic as we reserve space for worst possible cases - where tree heights are at the maximum possible height (8), we need to COW every extent buffer in a tree path, need to split extent buffers, etc. For data, we generally reserve the exact amount of space we are going to allocate. The exception here is when using compression, in which case we reserve space matching the uncompressed size, as the compression only happens at writeback time and in the worst possible case we need that amount of space in case the data is not compressible. This means that when there's not available space in the corresponding space_info object, we may need to allocate a new block group, and then that block group might not be used after all. In this case the block group is never added to the list of unused block groups and ends up never being deleted - except if we unmount and mount again the fs, as when reading block groups from disk we add unused ones to the list of unused block groups (fs_info->unused_bgs). Otherwise a block group is only added to the list of unused block groups when we deallocate the last extent from it, so if no extent is ever allocated, the block group is kept around forever. This also means that if we have a bunch of tasks reserving space in parallel we can end up allocating many block groups that end up never being used or kept around for too long without being used, which has the potential to result in ENOSPC failures in case for example we over allocate too many metadata block groups and then end up in a state without enough unallocated space to allocate a new data block group. This is more likely to happen with metadata reservations as of kernel 6.7, namely since commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction"), because we started to always reserve space for delayed references when starting a transaction handle for a non-zero number of items, and also to try to reserve space to fill the gap between the delayed block reserve's reserved space and its size. So to avoid this, when finishing the creation a new block group, add the block group to the list of unused block groups if it's still unused at that time. This way the next time the cleaner kthread runs, it will delete the block group if it's still unused and not needed to satisfy existing space reservations. Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f4a9f219 |
|
25-Jan-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not delete unused block group if it may be used soon Before deleting a block group that is in the list of unused block groups (fs_info->unused_bgs), we check if the block group became used before deleting it, as extents from it may have been allocated after it was added to the list. However even if the block group was not yet used, there may be tasks that have only reserved space and have not yet allocated extents, and they might be relying on the availability of the unused block group in order to allocate extents. The reservation works first by increasing the "bytes_may_use" field of the corresponding space_info object (which may first require flushing delayed items, allocating a new block group, etc), and only later a task does the actual allocation of extents. For metadata we usually don't end up using all reserved space, as we are pessimistic and typically account for the worst cases (need to COW every single node in a path of a tree at maximum possible height, etc). For data we usually reserve the exact amount of space we're going to allocate later, except when using compression where we always reserve space based on the uncompressed size, as compression is only triggered when writeback starts so we don't know in advance how much space we'll actually need, or if the data is compressible. So don't delete an unused block group if the total size of its space_info object minus the block group's size is less then the sum of used space and space that may be used (space_info->bytes_may_use), as that means we have tasks that reserved space and may need to allocate extents from the block group. In this case, besides skipping the deletion, re-add the block group to the list of unused block groups so that it may be reconsidered later, in case the tasks that reserved space end up not needing to allocate extents from it. Allowing the deletion of the block group while we have reserved space, can result in tasks failing to allocate metadata extents (-ENOSPC) while under a transaction handle, resulting in a transaction abort, or failure during writeback for the case of data extents. CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1693d544 |
|
25-Jan-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helper to check if block group is used Add a helper function to determine if a block group is being used and make use of it at btrfs_delete_unused_bgs(). This helper will also be used in future code changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eefaf0a1 |
|
05-Dec-2023 |
David Sterba <dsterba@suse.com> |
btrfs: fix typos found by codespell Signed-off-by: David Sterba <dsterba@suse.com>
|
#
71fca47b |
|
21-Nov-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove stripe size local variable from insert_dev_extents() It's not needed to have a local variable to store the stripe size at insert_dev_extents(), we can just take from the chunk map as it's only used once and typing 'map->stripe_size' is not much more verbose than simply typing 'stripe_size'. So remove the local variable. This was added before the recent addition of a dedicated structure for chunk mappings because the stripe size was encoded in the 'orig_block_len' field of an extent_map structure, so the use of the local variable made things more readable. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7dc66abb |
|
21-Nov-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use a dedicated data structure for chunk maps Currently we abuse the extent_map structure for two purposes: 1) To actually represent extents for inodes; 2) To represent chunk mappings. This is odd and has several disadvantages: 1) To create a chunk map, we need to do two memory allocations: one for an extent_map structure and another one for a map_lookup structure, so more potential for an allocation failure and more complicated code to manage and link two structures; 2) For a chunk map we actually only use 3 fields (24 bytes) of the respective extent map structure: the 'start' field to have the logical start address of the chunk, the 'len' field to have the chunk's size, and the 'orig_block_len' field to contain the chunk's stripe size. Besides wasting a memory, it's also odd and not intuitive at all to have the stripe size in a field named 'orig_block_len'. We are also using 'block_len' of the extent_map structure to contain the chunk size, so we have 2 fields for the same value, 'len' and 'block_len', which is pointless; 3) When an extent map is associated to a chunk mapping, we set the bit EXTENT_FLAG_FS_MAPPING on its flags and then make its member named 'map_lookup' point to the associated map_lookup structure. This means that for an extent map associated to an inode extent, we are not using this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform); 4) Extent maps associated to a chunk mapping are never merged or split so it's pointless to use the existing extent map infrastructure. So add a dedicated data structure named 'btrfs_chunk_map' to represent chunk mappings, this is basically the existing map_lookup structure with some extra fields: 1) 'start' to contain the chunk logical address; 2) 'chunk_len' to contain the chunk's length; 3) 'stripe_size' for the stripe size; 4) 'rb_node' for insertion into a rb tree; 5) 'refs' for reference counting. This way we do a single memory allocation for chunk mappings and we don't waste memory for them with unused/unnecessary fields from an extent_map. We also save 8 bytes from the extent_map structure by removing the 'map_lookup' pointer, so the size of struct extent_map is reduced from 144 bytes down to 136 bytes, and we can now have 30 extents map per 4K page instead of 28. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3128b548 |
|
21-Nov-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: split assert into two different asserts when removing block group When starting a transaction to remove a block group we have one ASSERT that checks we found an extent map and that the extent map's start offset matches the desired chunk offset. In case one of the conditions fails, we get a stack trace that point to the respective line of code, however we can't tell which condition failed: either there's no extent map or we got one with an unexpected start offset. To make such an issue easier to debug and analyse, split the assertion into two, one for each condition. This was actually triggered during development of another upcoming change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ef17228 |
|
28-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: stop reserving excessive space for block group item insertions Space for block group item insertions, necessary after allocating a new block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_insert_metadata_size(), since we only need to insert a block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_insert_metadata_size() we will need to reserve 2 times less space when using the free space tree, putting less pressure on space reservation. So use helpers to reserve and release space for block group item insertions that use btrfs_calc_insert_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f66e0209 |
|
28-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: stop reserving excessive space for block group item updates Space for block group item updates, necessary after allocating or deallocating an extent from a block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_metadata_size(), since we only need to update an existing block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_metadata_size() we will need to reserve 4 times less space when using the free space tree and 2 times less space when not using it, putting less pressure on space reservation. So use helpers to reserve and release space for block group item updates that use btrfs_calc_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b9d0322 |
|
22-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove redundant root argument from btrfs_update_inode() The root argument for btrfs_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
078b8b90 |
|
19-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: merge ordered work callbacks in btrfs_work into one There are two callbacks defined in btrfs_work but only two actually make use of them, otherwise there are NULLs. We can get rid of the freeing callback making it a special case of the normal work. This reduces the size of btrfs_work by 8 bytes, final layout: struct btrfs_work { btrfs_func_t func; /* 0 8 */ btrfs_ordered_func_t ordered_func; /* 8 8 */ struct work_struct normal_work; /* 16 32 */ struct list_head ordered_list; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btrfs_workqueue * wq; /* 64 8 */ long unsigned int flags; /* 72 8 */ /* size: 80, cachelines: 2, members: 6 */ /* last cacheline: 16 bytes */ }; This in turn reduces size of other structures (on a release config): - async_chunk 160 -> 152 - async_submit_bio 152 -> 144 - btrfs_async_delayed_work 104 -> 96 - btrfs_caching_control 176 -> 168 - btrfs_delalloc_work 144 -> 136 - btrfs_fs_info 3608 -> 3600 - btrfs_ordered_extent 440 -> 424 - btrfs_writepage_fixup 104 -> 96 Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4d20c1de |
|
12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove pointless loop from btrfs_update_block_group() When an extent is allocated or freed, we call btrfs_update_block_group() to update its block group and space info. An extent always belongs to a single block group, it can never span multiple block groups, so the loop we have at btrfs_update_block_group() is pointless, as it always has a single iteration. The loop was added in the very early days, 2007, when the block group code was added in commit 9078a3e1e4e4 ("Btrfs: start of block group code"), but even back then it seemed pointless. So remove the loop and assert the block group containing the start offset of the extent also contains the whole extent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50564b65 |
|
12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction on generation mismatch when marking eb as dirty When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
adb86dbe |
|
08-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: stop doing excessive space reservation for csum deletion Currently when reserving space for deleting the csum items for a data extent, when adding or updating a delayed ref head, we determine how many leaves of csum items we can have and then pass that number to the helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating space for all tree modifications we need when running delayed references, however the amount of space it computes is excessive for deleting csum items because: 1) It uses btrfs_calc_insert_metadata_size() which is excessive because we only need to delete csum items from the csum tree, we don't need to insert any items, so btrfs_calc_metadata_size() is all we need (as it computes space needed to delete an item); 2) If the free space tree is enabled, it doubles the amount of space, which is pointless for csum deletion since we don't need to touch the free space tree or any other tree other than the csum tree. So improve on this by tracking how many csum deletions we have and using a new helper to calculate space for csum deletions (just a wrapper around btrfs_calc_metadata_size() with a comment). This reduces the amount of space we need to reserve for csum deletions by a factor of 4, and it helps reduce the number of times we have to block space reservations and have the reclaim task enter the space flushing algorithm (flush delayed items, flush delayed refs, etc) in order to satisfy tickets. For example this results in a total time decrease when unlinking (or truncating) files with many extents, as we end up having to block on space metadata reservations less often. Example test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/test umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 100G gives at least 983040 extents with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar # Flush all delalloc and clear all metadata from memory. umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) rm -f $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "rm took $dur milliseconds" umount $MNT Before this change rm took: 7504 milliseconds After this change rm took: 6574 milliseconds (-12.4%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8a526c44 |
|
08-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: allow to run delayed refs by bytes to be released instead of count When running delayed references, through btrfs_run_delayed_refs(), we can specify how many to run, run all existing delayed references and keep running delayed references while we can find any. This is controlled with the value of the 'count' argument, where a value of 0 means to run all delayed references that exist by the time btrfs_run_delayed_refs() is called, (unsigned long)-1 means to keep running delayed references while we are able find any, and any other value to run that exact number of delayed references. Typically a specific value other than 0 or -1 is used when flushing space to try to release a certain amount of bytes for a ticket. In this case we just simply calculate how many delayed reference heads correspond to a specific amount of bytes, with calc_delayed_refs_nr(). However that only takes into account the space reserved for the reference heads themselves, and does not account for the space reserved for deleting checksums from the csum tree (see add_delayed_ref_head() and update_existing_head_ref()) in case we are going to delete a data extent. This means we may end up running more delayed references than necessary in case we process delayed references for deleting a data extent. So change the logic of btrfs_run_delayed_refs() to take a bytes argument to specify how many bytes of delayed references to run/release, using the special values of 0 to mean all existing delayed references and U64_MAX (or (u64)-1) to keep running delayed references while we can find any. This prevents running more delayed references than necessary, when we have delayed references for deleting data extents, but also makes the upcoming changes/patches simpler and it's preparatory work for them. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2d6cd791 |
|
03-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between finishing block group creation and its item update Commit 675dfe1223a6 ("btrfs: fix block group item corruption after inserting new block group") fixed one race that resulted in not persisting a block group's item when its "used" bytes field decreases to zero. However there's another race that can happen in a much shorter time window that results in the same problem. The following sequence of steps explains how it can happen: 1) Task A creates a metadata block group X, its "used" and "commit_used" fields are initialized to 0; 2) Two extents are allocated from block group X, so its "used" field is updated to 32K, and its "commit_used" field remains as 0; 3) Transaction commit starts, by some task B, and it enters btrfs_start_dirty_block_groups(). There it tries to update the block group item for block group X, which currently has its "used" field with a value of 32K and its "commit_used" field with a value of 0. However that fails since the block group item was not yet inserted, so at update_block_group_item(), the btrfs_search_slot() call returns 1, and then we set 'ret' to -ENOENT. Before jumping to the label 'fail'... 4) The block group item is inserted by task A, when for example btrfs_create_pending_block_groups() is called when releasing its transaction handle. This results in insert_block_group_item() inserting the block group item in the extent tree (or block group tree), with a "used" field having a value of 32K and setting "commit_used", in struct btrfs_block_group, to the same value (32K); 5) Task B jumps to the 'fail' label and then resets the "commit_used" field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was returned from update_block_group_item(), we add the block group again to the list of dirty block groups, so that we will try again in the critical section of the transaction commit when calling btrfs_write_dirty_block_groups(); 6) Later the two extents from block group X are freed, so its "used" field becomes 0; 7) If no more extents are allocated from block group X before we get into btrfs_write_dirty_block_groups(), then when we call update_block_group_item() again for block group X, we will not update the block group item to reflect that it has 0 bytes used, because the "used" and "commit_used" fields in struct btrfs_block_group have the same value, a value of 0. As a result after committing the transaction we have an empty block group with its block group item having a 32K value for its "used" field. This will trigger errors from fsck ("btrfs check" command) and after mounting again the fs, the cleaner kthread will not automatically delete the empty block group, since its "used" field is not 0. Possibly there are other issues due to this inconsistency. When this issue happens, the error reported by fsck is like this: [1/7] checking root items [2/7] checking extents block group [1104150528 1073741824] used 39796736 but extent items used 0 ERROR: errors found in extent allocation tree or chunk allocation (...) So fix this by not resetting the "commit_used" field of a block group when we don't find the block group item at update_block_group_item(). Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same") CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a7d107e |
|
07-Aug-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: don't activate non-DATA BG on allocation Now that a non-DATA block group is activated at write time, don't activate it on allocation time. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
13bb483d |
|
07-Aug-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: activate metadata block group on write time In the current implementation, block groups are activated at reservation time to ensure that all reserved bytes can be written to an active metadata block group. However, this approach has proven to be less efficient, as it activates block groups more frequently than necessary, putting pressure on the active zone resource and leading to potential issues such as early ENOSPC or hung_task. Another drawback of the current method is that it hampers metadata over-commit, and necessitates additional flush operations and block group allocations, resulting in decreased overall performance. To address these issues, this commit introduces a write-time activation of metadata and system block group. This involves reserving at least one active block group specifically for a metadata and system block group. Since metadata write-out is always allocated sequentially, when we need to write to a non-active block group, we can wait for the ongoing IOs to complete, activate a new block group, and then proceed with writing to the new block group. Fixes: b09315139136 ("btrfs: zoned: activate metadata block group on flush_space") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
98b5a8fd |
|
30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move btrfs_free_excluded_extents() into block-group.c The function btrfs_free_excluded_extents() is only used by block-group.c, so move it into block-group.c and make it static. Also removed unnecessary variables that are used only once. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b1c8f527 |
|
30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: open code trivial btrfs_add_excluded_extent() The code for btrfs_add_excluded_extent() is trivial, it's just a set_extent_bit() call. However it's defined in extent-tree.c but it is only used (twice) in block-group.c. So open code it in block-group.c, reducing the need to export a trivial function. Also since the only caller btrfs_add_excluded_extent() is prepared to deal with errors, stop ignoring errors from the set_extent_bit() call. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5860f82 |
|
30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make find_first_extent_bit() return a boolean Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any errors, so make the return value a boolean and invert the logic to make more sense: return true if it found a range and false if it didn't find any range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3b9f0995 |
|
30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename add_new_free_space() to btrfs_add_new_free_space() Since add_new_free_space() is exported, used outside block-group.c, rename it to include the 'btrfs_' prefix. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
28f60894 |
|
30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update documentation for add_new_free_space() The documentation for add_new_free_space() is stale and no longer correct: 1) It's no longer used only when caching a block group. It's also called when creating a block group (btrfs_make_block_group()), when reading a block group at mount time (read_one_block_group()) and when reading the free space tree for a block group (typically the first time we attempt to allocate from the block group); 2) It has nothing to do with pinned extents. It only deals with the excluded extents io tree, which is used to track the locations of super blocks in order to make sure we never add the location of a super block to the free space cache of a block group. So update the documention and also add a description of the arguments and return values. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc1f91b9 |
|
21-Jul-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: wait for actual caching progress during allocation Recently we've been having mysterious hangs while running generic/475 on the CI system. This turned out to be something like this: Task 1 dmsetup suspend --nolockfs -> __dm_suspend -> dm_wait_for_completion -> dm_wait_for_bios_completion -> Unable to complete because of IO's on a plug in Task 2 Task 2 wb_workfn -> wb_writeback -> blk_start_plug -> writeback_sb_inodes -> Infinite loop unable to make an allocation Task 3 cache_block_group ->read_extent_buffer_pages ->Waiting for IO to complete that can't be submitted because Task 1 suspended the DM device The problem here is that we need Task 2 to be scheduled completely for the blk plug to flush. Normally this would happen, we normally wait for the block group caching to finish (Task 3), and this schedule would result in the block plug flushing. However if there's enough free space available from the current caching to satisfy the allocation we won't actually wait for the caching to complete. This check however just checks that we have enough space, not that we can make the allocation. In this particular case we were trying to allocate 9MiB, and we had 10MiB of free space, but we didn't have 9MiB of contiguous space to allocate, and thus the allocation failed and we looped. We specifically don't cycle through the FFE loop until we stop finding cached block groups because we don't want to allocate new block groups just because we're caching, so we short circuit the normal loop once we hit LOOP_CACHING_WAIT and we found a caching block group. This is normally fine, except in this particular case where the caching thread can't make progress because the DM device has been suspended. Fix this by not only waiting for free space to >= the amount of space we want to allocate, but also that we make some progress in caching from the time we start waiting. This will keep us from busy looping when the caching is taking a while but still theoretically has enough space for us to allocate from, and fixes this particular case by forcing us to actually sleep and wait for forward progress, which will flush the plug. With this fix we're no longer hanging with generic/475. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8ccbd21 |
|
30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove BUG_ON()'s in add_new_free_space() At add_new_free_space() we have these BUG_ON()'s that are there to deal with any failure to add free space to the in memory free space cache. Such failures are mostly -ENOMEM that should be very rare. However there's no need to have these BUG_ON()'s, we can just return any error to the caller and all callers and their upper call chain are already dealing with errors. So just make add_new_free_space() return any errors, while removing the BUG_ON()'s, and returning the total amount of added free space to an optional u64 pointer argument. Reported-by: syzbot+3ba856e07b7127889d8c@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000e9cb8305ff4e8327@google.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f1a07c2b |
|
02-Jul-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: zoned: fix memory leak after finding block group with super blocks At exclude_super_stripes(), if we happen to find a block group that has super blocks mapped to it and we are on a zoned filesystem, we error out as this is not supposed to happen, indicating either a bug or maybe some memory corruption for example. However we are exiting the function without freeing the memory allocated for the logical address of the super blocks. Fix this by freeing the logical address. Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0657b20c |
|
28-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix use-after-free of new block group that became unused If a task creates a new block group and that block group becomes unused before we finish its creation, at btrfs_create_pending_block_groups(), then when btrfs_mark_bg_unused() is called against the block group, we assume that the block group is currently in the list of block groups to reclaim, and we move it out of the list of new block groups and into the list of unused block groups. This has two consequences: 1) We move it out of the list of new block groups associated to the current transaction. So the block group creation is not finished and if we attempt to delete the bg because it's unused, we will not find the block group item in the extent tree (or the new block group tree), its device extent items in the device tree etc, resulting in the deletion to fail due to the missing items; 2) We don't increment the reference count on the block group when we move it to the list of unused block groups, because we assumed the block group was on the list of block groups to reclaim, and in that case it already has the correct reference count. However the block group was on the list of new block groups, in which case no extra reference was taken because it's local to the current task. This later results in doing an extra reference count decrement when removing the block group from the unused list, eventually leading the reference count to 0. This second case was caught when running generic/297 from fstests, which produced the following assertion failure and stack trace: [589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299 [589.559] ------------[ cut here ]------------ [589.559] kernel BUG at fs/btrfs/block-group.c:4299! [589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI [589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.561] Code: 68 62 da c0 (...) [589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246 [589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000 [589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff [589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50 [589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00 [589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100 [589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000 [589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0 [589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [589.564] Call Trace: [589.564] <TASK> [589.565] ? __die_body+0x1b/0x60 [589.565] ? die+0x39/0x60 [589.565] ? do_trap+0xeb/0x110 [589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.566] ? do_error_trap+0x6a/0x90 [589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.566] ? exc_invalid_op+0x4e/0x70 [589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.567] ? asm_exc_invalid_op+0x16/0x20 [589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.567] close_ctree+0x35d/0x560 [btrfs] [589.568] ? fsnotify_sb_delete+0x13e/0x1d0 [589.568] ? dispose_list+0x3a/0x50 [589.568] ? evict_inodes+0x151/0x1a0 [589.568] generic_shutdown_super+0x73/0x1a0 [589.569] kill_anon_super+0x14/0x30 [589.569] btrfs_kill_super+0x12/0x20 [btrfs] [589.569] deactivate_locked_super+0x2e/0x70 [589.569] cleanup_mnt+0x104/0x160 [589.570] task_work_run+0x56/0x90 [589.570] exit_to_user_mode_prepare+0x160/0x170 [589.570] syscall_exit_to_user_mode+0x22/0x50 [589.570] ? __x64_sys_umount+0x12/0x20 [589.571] do_syscall_64+0x48/0x90 [589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc [589.571] RIP: 0033:0x7f497ff0a567 [589.571] Code: af 98 0e (...) [589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567 [589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0 [589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100 [589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0 [589.573] </TASK> [589.574] Modules linked in: dm_snapshot dm_thin_pool (...) [589.576] ---[ end trace 0000000000000000 ]--- Fix this by adding a runtime flag to the block group to tell that the block group is still in the list of new block groups, and therefore it should not be moved to the list of unused block groups, at btrfs_mark_bg_unused(), until the flag is cleared, when we finish the creation of the block group at btrfs_create_pending_block_groups(). Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb091225 |
|
22-Jun-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: fix remaining u32 overflows when left shifting stripe_nr There was regression caused by a97699d1d610 ("btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by a7299a18a179 ("btrfs: fix u32 overflows when left shifting stripe_nr"). To avoid code churn the fix was open coding the type casts but unfortunately missed one which was still possible to hit [1]. The missing place was assignment of bioc->full_stripe_logical inside btrfs_map_block(). Fix it by adding a helper that does the safe calculation of the offset and use it everywhere even though it may not be strictly necessary due to already using u64 types. This replaces all remaining "<< BTRFS_STRIPE_LEN_SHIFT" calls. [1] https://lore.kernel.org/linux-btrfs/20230622065438.86402-1-wqu@suse.com/ Fixes: a7299a18a179 ("btrfs: fix u32 overflows when left shifting stripe_nr") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
160fe8f6 |
|
05-Jun-2023 |
Matt Corallo <blnxfsl@bluematt.me> |
btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile Callers of `btrfs_reduce_alloc_profile` expect it to return exactly one allocation profile flag, and failing to do so may ultimately result in a WARN_ON and remount-ro when allocating new blocks, like the below transaction abort on 6.1. `btrfs_reduce_alloc_profile` has two ways of determining the profile, first it checks if a conversion balance is currently running and uses the profile we're converting to. If no balance is currently running, it returns the max-redundancy profile which at least one block in the selected block group has. This works by simply checking each known allocation profile bit in redundancy order. However, `btrfs_reduce_alloc_profile` has not been updated as new flags have been added - first with the `DUP` profile and later with the RAID1C34 profiles. Because of the way it checks, if we have blocks with different profiles and at least one is known, that profile will be selected. However, if none are known we may return a flag set with multiple allocation profiles set. This is currently only possible when a balance from one of the three unhandled profiles to another of the unhandled profiles is canceled after allocating at least one block using the new profile. In that case, a transaction abort like the below will occur and the filesystem will need to be mounted with -o skip_balance to get it mounted rw again (but the balance cannot be resumed without a similar abort). [770.648] ------------[ cut here ]------------ [770.648] BTRFS: Transaction aborted (error -22) [770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs] [770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G W 6.1.0-0.deb11.7-powerpc64le #1 Debian 6.1.20-2~bpo11+1a~test [770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV [770.648] NIP: c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0 [770.648] REGS: c000200089afe9a0 TRAP: 0700 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test) [770.648] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28848282 XER: 20040000 [770.648] CFAR: c000000000135110 IRQMASK: 0 GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026 GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027 GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8 GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000 GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001 GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800 GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001 [770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs] [770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] [770.648] Call Trace: [770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable) [770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs] [770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs] [770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs] [770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs] [770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs] [770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs] [770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs] [770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs] [770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs] [770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs] [770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs] [770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs] [770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs] [770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs] [770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310 [770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250 [770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258 [770.648] --- interrupt: c00 at 0x7fff94126800 [770.648] NIP: 00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000 [770.648] REGS: c000200089affe80 TRAP: 0c00 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test) [770.648] MSR: 900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE> CR: 24002848 XER: 00000000 [770.648] IRQMASK: 0 GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003 GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000 GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73 GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000 GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001 GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000 [770.648] NIP [00007fff94126800] 0x7fff94126800 [770.648] LR [0000000107e0b594] 0x107e0b594 [770.648] --- interrupt: c00 [770.648] Instruction dump: [770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000 [770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070 [770.648] ---[ end trace 0000000000000000 ]--- [770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown [770.648] BTRFS info (device dm-2: state EA): forced readonly [770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown [770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22 [770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown [770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)") Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Matt Corallo <blnxfsl@bluematt.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7e271809 |
|
05-Jun-2023 |
Naohiro Aota <naota@elisp.net> |
btrfs: reinsert BGs failed to reclaim The reclaim process can temporarily fail. For example, if the space is getting tight, it fails to make the block group read-only. If there are no further writes on that block group, the block group will never get back to the reclaim list, and the BG never gets reclaimed. In a certain workload, we can leave many such block groups never reclaimed. So, let's get it back to the list and give it a chance to be reclaimed. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
93463ff7 |
|
05-Jun-2023 |
Naohiro Aota <naota@elisp.net> |
btrfs: bail out reclaim process if filesystem is read-only When a filesystem is read-only, we cannot reclaim a block group as it cannot rewrite the data. Just bail out in that case. Note that it can drop block groups in this case. As we did sb_start_write(), read-only filesystem means we got a fatal error and forced read-only. There is no chance to reclaim them again. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a9f18971 |
|
05-Jun-2023 |
Naohiro Aota <naota@elisp.net> |
btrfs: move out now unused BG from the reclaim list An unused block group is easy to remove to free up space and should be reclaimed fast. Such block group can often already be a target of the reclaim process. As we check list_empty(&bg->bg_list), we keep it in the reclaim list. That block group is never reclaimed until the file system is filled e.g. up to 75%. Instead, we can move unused block group to the unused list and delete it fast. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3ed01616 |
|
05-Jun-2023 |
Naohiro Aota <naota@elisp.net> |
btrfs: delete unused BGs while reclaiming BGs The reclaiming process only starts after the filesystem volumes are allocated to a certain level (75% by default). Thus, the list of reclaiming target block groups can build up so huge at the time the reclaim process kicks in. On a test run, there were over 1000 BGs in the reclaim list. As the reclaim involves rewriting the data, it takes really long time to reclaim the BGs. While the reclaim is running, btrfs_delete_unused_bgs() won't proceed because the reclaim side is holding fs_info->reclaim_bgs_lock. As a result, we will have a large number of unused BGs kept in the unused list. On my test run, I got 1057 unused BGs. Since deleting a block group is relatively easy and fast work, we can call btrfs_delete_unused_bgs() while it reclaims BGs, to avoid building up unused BGs. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1d126800 |
|
24-May-2023 |
David Sterba <dsterba@suse.com> |
btrfs: drop gfp from parameter extent state helpers Now that all extent state bit helpers effectively take the GFP_NOFS mask (and GFP_NOWAIT is encoded in the bits) we can remove the parameter. This reduces stack consumption in many functions and simplifies a lot of code. Net effect on module on a release build: text data bss dec hex filename 1250432 20985 16088 1287505 13a551 pre/btrfs.ko 1247074 20985 16088 1284147 139833 post/btrfs.ko DELTA: -3358 Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7dde7a8a |
|
24-May-2023 |
David Sterba <dsterba@suse.com> |
btrfs: drop NOFAIL from set_extent_bit allocation masks The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010 (commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata reservation")), without any explanation why it would be needed. Meanwhile we've updated the semantics of set_extent_bit to handle failed allocations and do unlock, sleep and retry if needed. The use of the NOFAIL flag is also an outlier, we never want any of the set/clear extent bit helpers to fail, they're used for many critical changes like extent locking, besides the extent state bit changes. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fe1a598c |
|
24-May-2023 |
David Sterba <dsterba@suse.com> |
btrfs: open code set_extent_dirty The helper is used a few times, that it's setting the DIRTY extent bit is still clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7561551e |
|
12-Apr-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: scrub: try harder to mark RAID56 block groups read-only Currently we allow a block group not to be marked read-only for scrub. But for RAID56 block groups if we require the block group to be read-only, then we're allowed to use cached content from scrub stripe to reduce unnecessary RAID56 reads. So this patch would: - Make btrfs_inc_block_group_ro() try harder During my tests, for cases like btrfs/061 and btrfs/064, we can hit ENOSPC from btrfs_inc_block_group_ro() calls during scrub. The reason is if we only have one single data chunk, and trying to scrub it, we won't have any space left for any newer data writes. But this check should be done by the caller, especially for scrub cases we only temporarily mark the chunk read-only. And newer data writes would always try to allocate a new data chunk when needed. - Return error for scrub if we failed to mark a RAID56 chunk read-only Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9255d6c |
|
29-Mar-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: scrub: remove the old scrub recheck code The old scrub code has different entrance to verify the content, and since we have removed the writeback path, now we can start removing the re-check part, including: - scrub_recover structure - scrub_sector::recover member - function scrub_setup_recheck_block() - function scrub_recheck_block() - function scrub_recheck_block_checksum() - function scrub_repair_block_group_good_copy() - function scrub_repair_sector_from_good_copy() - function scrub_is_page_on_raid56() - function full_stripe_lock() - function search_full_stripe_lock() - function get_full_stripe_logical() - function insert_full_stripe_lock() - function lock_full_stripe() - function unlock_full_stripe() - btrfs_block_group::full_stripe_locks_root member - btrfs_full_stripe_locks_tree structure This infrastructure is to ensure RAID56 scrub is properly handling recovery and P/Q scrub correctly. This is no longer needed, before P/Q scrub we will wait for all the involved data stripes to be scrubbed first, and RAID56 code has internal lock to ensure no race in the same full stripe. - function scrub_print_warning() - function scrub_get_recover() - function scrub_put_recover() - function scrub_handle_errored_block() - function scrub_setup_recheck_block() - function scrub_bio_wait_endio() - function scrub_submit_raid56_bio_wait() - function scrub_recheck_block_on_raid56() - function scrub_recheck_block() - function scrub_recheck_block_checksum() - function scrub_repair_block_from_good_copy() - function scrub_repair_sector_from_good_copy() And two more functions exported temporarily for later cleanup: - alloc_scrub_sector() - alloc_scrub_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5758d1bd |
|
21-Mar-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove bytes_used argument from btrfs_make_block_group() The only caller of btrfs_make_block_group() always passes 0 as the value for the bytes_used argument, so remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ded22c1 |
|
16-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32 There are quite some div64 calls inside btrfs_map_block() and its variants. Such calls are for @stripe_nr, where @stripe_nr is the number of stripes before our logical bytenr inside a chunk. However we can eliminate such div64 calls by just reducing the width of @stripe_nr from 64 to 32. This can be done because our chunk size limit is already 10G, with fixed stripe length 64K. Thus a U32 is definitely enough to contain the number of stripes. With such width reduction, we can get rid of slower div64, and extra warning for certain 32bit arch. This patch would do: - Add a new tree-checker chunk validation on chunk length Make sure no chunk can reach 256G, which can also act as a bitflip checker. - Reduce the width from u64 to u32 for @stripe_nr variables - Replace unnecessary div64 calls with regular modulo and division 32bit division and modulo are much faster than 64bit operations, and we are finally free of the div64 fear at least in those involved functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a97699d1 |
|
16-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN Currently btrfs doesn't support stripe lengths other than 64KiB. This is already set in the tree-checker. There is really no meaning to record that fixed value in map_lookup for now, and can all be replaced with BTRFS_STRIPE_LEN. Furthermore we can use the fix stripe length to do the following optimization: - Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division Now we only need to do a right shift. And the value of BTRFS_STRIPE_LEN itself is already too large for bit shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift, a compiler warning would be triggered. Thus this bit shift optimization would be safe. - Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e15acc25 |
|
13-Mar-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: drop space_info->active_total_bytes The space_info->active_total_bytes is no longer necessary as we now count the region of newly allocated block group as zone_unusable. Drop its usage. Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
df384da5 |
|
01-Mar-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use temporary variable for space_info in btrfs_update_block_group We do cache->space_info->counter += num_bytes; everywhere in here. This is makes the lines longer than they need to be, and will be especially noticeable when we add the active tracking in, so add a temp variable for the space_info so this is cleaner. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
675dfe12 |
|
06-Mar-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix block group item corruption after inserting new block group We can often end up inserting a block group item, for a new block group, with a wrong value for the used bytes field. This happens if for the new allocated block group, in the same transaction that created the block group, we have tasks allocating extents from it as well as tasks removing extents from it. For example: 1) Task A creates a metadata block group X; 2) Two extents are allocated from block group X, so its "used" field is updated to 32K, and its "commit_used" field remains as 0; 3) Transaction commit starts, by some task B, and it enters btrfs_start_dirty_block_groups(). There it tries to update the block group item for block group X, which currently has its "used" field with a value of 32K. But that fails since the block group item was not yet inserted, and so on failure update_block_group_item() sets the "commit_used" field of the block group back to 0; 4) The block group item is inserted by task A, when for example btrfs_create_pending_block_groups() is called when releasing its transaction handle. This results in insert_block_group_item() inserting the block group item in the extent tree (or block group tree), with a "used" field having a value of 32K, but without updating the "commit_used" field in the block group, which remains with value of 0; 5) The two extents are freed from block X, so its "used" field changes from 32K to 0; 6) The transaction commit by task B continues, it enters btrfs_write_dirty_block_groups() which calls update_block_group_item() for block group X, and there it decides to skip the block group item update, because "used" has a value of 0 and "commit_used" has a value of 0 too. As a result, we end up with a block item having a 32K "used" field but no extents allocated from it. When this issue happens, a btrfs check reports an error like this: [1/7] checking root items [2/7] checking extents block group [1104150528 1073741824] used 39796736 but extent items used 0 ERROR: errors found in extent allocation tree or chunk allocation (...) Fix this by making insert_block_group_item() update the block group's "commit_used" field. Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same") CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95cd356c |
|
21-Feb-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: fix percent calculation for bg reclaim message We have a report, that the info message for block-group reclaim is crossing the 100% used mark. This is happening as we were truncating the divisor for the division (the block_group->length) to a 32bit value. Fix this by using div64_u64() to not truncate the divisor. In the worst case, it can lead to a div by zero error and should be possible to trigger on 4 disks RAID0, and each device is large enough: $ mkfs.btrfs -f /dev/test/scratch[1234] -m raid1 -d raid0 btrfs-progs v6.1 [...] Filesystem size: 40.00GiB Block group profiles: Data: RAID0 4.00GiB <<< Metadata: RAID1 256.00MiB System: RAID1 8.00MiB Reported-by: Forza <forza@tnonline.net> Link: https://lore.kernel.org/linux-btrfs/e99483.c11a58d.1863591ca52@tnonline.net/ Fixes: 5f93e776c673 ("btrfs: zoned: print unusable percentage when reclaiming block groups") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add Qu's note ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
12148367 |
|
15-Feb-2023 |
Boris Burkov <boris@bur.io> |
btrfs: fix potential dead lock in size class loading logic As reported by Filipe, there's a potential deadlock caused by using btrfs_search_forward on commit_root. The locking there is unconditional, even if ->skip_locking and ->search_commit_root is set. It's not meant to be used for commit roots, so it always needs to do locking. So if another task is COWing a child node of the same root node and then needs to wait for block group caching to complete when trying to allocate a metadata extent, it deadlocks. For example: [539604.239315] sysrq: Show Blocked State [539604.240133] task:kworker/u16:6 state:D stack:0 pid:2119594 ppid:2 flags:0x00004000 [539604.241613] Workqueue: btrfs-cache btrfs_work_helper [btrfs] [539604.242673] Call Trace: [539604.243129] <TASK> [539604.243925] __schedule+0x41d/0xee0 [539604.244797] ? rcu_read_lock_sched_held+0x12/0x70 [539604.245399] ? rwsem_down_read_slowpath+0x185/0x490 [539604.246111] schedule+0x5d/0xf0 [539604.246593] rwsem_down_read_slowpath+0x2da/0x490 [539604.247290] ? rcu_barrier_tasks_trace+0x10/0x20 [539604.248090] __down_read_common+0x3d/0x150 [539604.248702] down_read_nested+0xc3/0x140 [539604.249280] __btrfs_tree_read_lock+0x24/0x100 [btrfs] [539604.250097] btrfs_read_lock_root_node+0x48/0x60 [btrfs] [539604.250915] btrfs_search_forward+0x59/0x460 [btrfs] [539604.251781] ? btrfs_global_root+0x50/0x70 [btrfs] [539604.252476] caching_thread+0x1be/0x920 [btrfs] [539604.253167] btrfs_work_helper+0xf6/0x400 [btrfs] [539604.253848] process_one_work+0x24f/0x5a0 [539604.254476] worker_thread+0x52/0x3b0 [539604.255166] ? __pfx_worker_thread+0x10/0x10 [539604.256047] kthread+0xf0/0x120 [539604.256591] ? __pfx_kthread+0x10/0x10 [539604.257212] ret_from_fork+0x29/0x50 [539604.257822] </TASK> [539604.258233] task:btrfs-transacti state:D stack:0 pid:2236474 ppid:2 flags:0x00004000 [539604.259802] Call Trace: [539604.260243] <TASK> [539604.260615] __schedule+0x41d/0xee0 [539604.261205] ? rcu_read_lock_sched_held+0x12/0x70 [539604.262000] ? rwsem_down_read_slowpath+0x185/0x490 [539604.262822] schedule+0x5d/0xf0 [539604.263374] rwsem_down_read_slowpath+0x2da/0x490 [539604.266228] ? lock_acquire+0x160/0x310 [539604.266917] ? rcu_read_lock_sched_held+0x12/0x70 [539604.267996] ? lock_contended+0x19e/0x500 [539604.268720] __down_read_common+0x3d/0x150 [539604.269400] down_read_nested+0xc3/0x140 [539604.270057] __btrfs_tree_read_lock+0x24/0x100 [btrfs] [539604.271129] btrfs_read_lock_root_node+0x48/0x60 [btrfs] [539604.272372] btrfs_search_slot+0x143/0xf70 [btrfs] [539604.273295] update_block_group_item+0x9e/0x190 [btrfs] [539604.274282] btrfs_start_dirty_block_groups+0x1c4/0x4f0 [btrfs] [539604.275381] ? __mutex_unlock_slowpath+0x45/0x280 [539604.276390] btrfs_commit_transaction+0xee/0xed0 [btrfs] [539604.277391] ? lock_acquire+0x1a4/0x310 [539604.278080] ? start_transaction+0xcb/0x6c0 [btrfs] [539604.279099] transaction_kthread+0x142/0x1c0 [btrfs] [539604.279996] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [539604.280673] kthread+0xf0/0x120 [539604.281050] ? __pfx_kthread+0x10/0x10 [539604.281496] ret_from_fork+0x29/0x50 [539604.281966] </TASK> [539604.282255] task:fsstress state:D stack:0 pid:2236483 ppid:1 flags:0x00004006 [539604.283897] Call Trace: [539604.284700] <TASK> [539604.285088] __schedule+0x41d/0xee0 [539604.285660] schedule+0x5d/0xf0 [539604.286175] btrfs_wait_block_group_cache_progress+0xf2/0x170 [btrfs] [539604.287342] ? __pfx_autoremove_wake_function+0x10/0x10 [539604.288450] find_free_extent+0xd93/0x1750 [btrfs] [539604.289256] ? _raw_spin_unlock+0x29/0x50 [539604.289911] ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs] [539604.290843] btrfs_reserve_extent+0x147/0x290 [btrfs] [539604.291943] btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs] [539604.292903] __btrfs_cow_block+0x138/0x580 [btrfs] [539604.293773] btrfs_cow_block+0x10e/0x240 [btrfs] [539604.294595] btrfs_search_slot+0x7f3/0xf70 [btrfs] [539604.295585] btrfs_update_device+0x71/0x1b0 [btrfs] [539604.296459] btrfs_chunk_alloc_add_chunk_item+0xe0/0x340 [btrfs] [539604.297489] btrfs_chunk_alloc+0x1bf/0x490 [btrfs] [539604.298335] find_free_extent+0x6fa/0x1750 [btrfs] [539604.299174] ? _raw_spin_unlock+0x29/0x50 [539604.299950] ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs] [539604.300918] btrfs_reserve_extent+0x147/0x290 [btrfs] [539604.301797] btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs] [539604.303017] ? lock_release+0x224/0x4a0 [539604.303855] __btrfs_cow_block+0x138/0x580 [btrfs] [539604.304789] btrfs_cow_block+0x10e/0x240 [btrfs] [539604.305611] btrfs_search_slot+0x7f3/0xf70 [btrfs] [539604.306682] ? btrfs_global_root+0x50/0x70 [btrfs] [539604.308198] lookup_inline_extent_backref+0x17b/0x7a0 [btrfs] [539604.309254] lookup_extent_backref+0x43/0xd0 [btrfs] [539604.310122] __btrfs_free_extent+0xf8/0x810 [btrfs] [539604.310874] ? lock_release+0x224/0x4a0 [539604.311724] ? btrfs_merge_delayed_refs+0x17b/0x1d0 [btrfs] [539604.313023] __btrfs_run_delayed_refs+0x2ba/0x1260 [btrfs] [539604.314271] btrfs_run_delayed_refs+0x8f/0x1c0 [btrfs] [539604.315445] ? rcu_read_lock_sched_held+0x12/0x70 [539604.316706] btrfs_commit_transaction+0xa2/0xed0 [btrfs] [539604.317855] ? do_raw_spin_unlock+0x4b/0xa0 [539604.318544] ? _raw_spin_unlock+0x29/0x50 [539604.319240] create_subvol+0x53d/0x6e0 [btrfs] [539604.320283] btrfs_mksubvol+0x4f5/0x590 [btrfs] [539604.321220] __btrfs_ioctl_snap_create+0x11b/0x180 [btrfs] [539604.322307] btrfs_ioctl_snap_create_v2+0xc6/0x150 [btrfs] [539604.323295] btrfs_ioctl+0x9f7/0x33e0 [btrfs] [539604.324331] ? rcu_read_lock_sched_held+0x12/0x70 [539604.325137] ? lock_release+0x224/0x4a0 [539604.325808] ? __x64_sys_ioctl+0x87/0xc0 [539604.326467] __x64_sys_ioctl+0x87/0xc0 [539604.327109] do_syscall_64+0x38/0x90 [539604.327875] entry_SYSCALL_64_after_hwframe+0x72/0xdc [539604.328792] RIP: 0033:0x7f05a7babaeb This needs to use regular btrfs_search_slot() with some skip and stop logic. Since we only consider five samples (five search slots), don't bother with the complexity of looking for commit_root_sem contention. If necessary, it can be added to the load function in between samples. Reported-by: Filipe Manana <fdmanana@kernel.org> Link: https://lore.kernel.org/linux-btrfs/CAL3q7H7eKMD44Z1+=Kb-1RFMMeZpAm2fwyO59yeBwCcSOU80Pg@mail.gmail.com/ Fixes: c7eec3d9aa95 ("btrfs: load block group size class when caching") Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1eb82ef8 |
|
12-Dec-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove the bdev argument to btrfs_rmap_block The only user in the zoned remap code is gone now, so remove the argument. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb0922f2 |
|
15-Dec-2022 |
Boris Burkov <boris@bur.io> |
btrfs: don't use size classes for zoned file systems When a file system has ZNS devices which are constrained by a maximum number of active block groups, then not being able to use all the block groups for every allocation is not ideal, and could cause us to loop a ton with mixed size allocations. In general, since zoned doesn't write into gaps behind where block groups are writing, it is not susceptible to the same sort of fragmentation that size classes are designed to solve, so we can skip size classes for zoned file systems in general, even though there would probably be no harm for SMR devices. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7eec3d9 |
|
15-Dec-2022 |
Boris Burkov <boris@bur.io> |
btrfs: load block group size class when caching Since the size class is an artifact of an arbitrary anti fragmentation strategy, it doesn't really make sense to persist it. Furthermore, most of the size class logic assumes fresh block groups. That is of course not a reasonable assumption -- we will be upgrading kernels with existing filesystems whose block groups are not classified. To work around those issues, implement logic to compute the size class of the block groups as we cache them in. To perfectly assess the state of a block group, we would have to read the entire extent tree (since the free space cache mashes together contiguous extent items) which would be prohibitively expensive for larger file systems with more extents. We can do it relatively cheaply by implementing a simple heuristic of sampling a handful of extents and picking the smallest one we see. In the happy case where the block group was classified, we will only see extents of the correct size. In the unhappy case, we will hopefully find one of the smaller extents, but there is no perfect answer anyway. Autorelocation will eventually churn up the block group if there is significant freeing anyway. There was no regression in mount performance at end state of the fsperf test suite, and the delay until the block group is marked cached is minimized by the constant number of extent samples. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
52bb7a21 |
|
15-Dec-2022 |
Boris Burkov <boris@bur.io> |
btrfs: introduce size class to block group allocator The aim of this patch is to reduce the fragmentation of block groups under certain unhappy workloads. It is particularly effective when the size of extents correlates with their lifetime, which is something we have observed causing fragmentation in the fleet at Meta. This patch categorizes extents into size classes: - x < 128KiB: "small" - 128KiB < x < 8MiB: "medium" - x > 8MiB: "large" and as much as possible reduces allocations of extents into block groups that don't match the size class. This takes advantage of any (possible) correlation between size and lifetime and also leaves behind predictable re-usable gaps when extents are freed; small writes don't gum up bigger holes. Size classes are implemented in the following way: - Mark each new block group with a size class of the first allocation that goes into it. - Add two new passes to ffe: "unset size class" and "wrong size class". First, try only matching block groups, then try unset ones, then allow allocation of new ones, and finally allow mismatched block groups. - Filtering is done just by skipping inappropriate ones, there is no special size class indexing. Other solutions I considered were: - A best fit allocator with an rb-tree. This worked well, as small writes didn't leak big holes from large freed extents, but led to regressions in ffe and write performance due to lock contention on the rb-tree with every allocation possibly updating it in parallel. Perhaps something clever could be done to do the updates in the background while being "right enough". - A fixed size "working set". This prevents freeing an extent drastically changing where writes currently land, and seems like a good option too. Doesn't take advantage of size in any way. - The same size class idea, but implemented with xarray marks. This turned out to be slower than looping the linked list and skipping wrong block groups, and is also less flexible since we must have only 3 size classes (max #marks). With the current approach we can have as many as we like. Performance testing was done via: https://github.com/josefbacik/fsperf Of particular relevance are the new fragmentation specific tests. A brief summary of the testing results: - Neutral results on existing tests. There are some minor regressions and improvements here and there, but nothing that truly stands out as notable. - Improvement on new tests where size class and extent lifetime are correlated. Fragmentation in these cases is completely eliminated and write performance is generally a little better. There is also significant improvement where extent sizes are just a bit larger than the size class boundaries. - Regression on one new tests: where the allocations are sized intentionally a hair under the borders of the size classes. Results are neutral on the test that intentionally attacks this new scheme by mixing extent size and lifetime. The full dump of the performance results can be found here: https://bur.io/fsperf/size-class-2022-11-15.txt (there are ANSI escape codes, so best to curl and view in terminal) Here is a snippet from the full results for a new test which mixes buffered writes appending to a long lived set of files and large short lived fallocates: bufferedappendvsfallocate results metric baseline current stdev diff ====================================================================================== avg_commit_ms 31.13 29.20 2.67 -6.22% bg_count 14 15.60 0 11.43% commits 11.10 12.20 0.32 9.91% elapsed 27.30 26.40 2.98 -3.30% end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38% end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07% find_free_extent_calls 116244.30 114354.30 964.56 -1.63% find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67% find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80% find_free_extent_ns_min 500 512 6.67 2.40% find_free_extent_ns_p50 2848 2876 37.65 0.98% find_free_extent_ns_p95 4916 5000 75.45 1.71% find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90% frag_pct_max 61.67 0 8.05 -100.00% frag_pct_mean 43.59 0 6.10 -100.00% frag_pct_min 25.91 0 16.60 -100.00% frag_pct_p50 42.53 0 7.25 -100.00% frag_pct_p95 61.67 0 8.05 -100.00% frag_pct_p99 61.67 0 8.05 -100.00% fragmented_bg_count 6.10 0 1.45 -100.00% max_commit_ms 49.80 46 5.37 -7.63% sys_cpu 2.59 2.62 0.29 1.39% write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23% write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14% write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62% write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90% write_io_kbytes 4194304 4194304 0 0.00% write_iops 2476.15 2556.10 274.29 3.23% write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11% write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22% write_lat_ns_min 17353.10 16250 1646.08 -6.36% There are some mixed improvements/regressions in most metrics along with an elimination of fragmentation in this workload. On the balance, the drastic 1->0 improvement in the happy cases seems worth the mix of regressions and improvements we do observe. Some considerations for future work: - Experimenting with more size classes - More hinting/search ordering work to approximate a best-fit allocator Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efbf35a1 |
|
16-Dec-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix uninitialized variable warning in btrfs_update_block_group reclaim isn't set in the alloc case, however we only care about reclaim in the !alloc case. This isn't an actual problem, however -Wmaybe-uninitialized will complain, so initialize reclaim to quiet the compiler. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d7764ff |
|
31-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: convert btrfs_block_group::needs_free_space to runtime flag We already have flags in block group to track various status bits, convert needs_free_space as well and reduce size of btrfs_block_group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
428c8e03 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: simplify percent calculation helpers, rename div_factor The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43dd529a |
|
27-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: update function comments Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0231804 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move extent-tree helpers into their own header file Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
07e81dc9 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7f13d42 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move fs wide helpers out of ctree.h We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7248e0ce |
|
09-Sep-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: skip update of block group item if used bytes are the same [BACKGROUND] When committing a transaction, we will update block group items for all dirty block groups. But in fact, dirty block groups don't always need to update their block group items. It's pretty common to have a metadata block group which experienced several COW operations, but still have the same amount of used bytes. In that case, we may unnecessarily COW a tree block doing nothing. [ENHANCEMENT] This patch will introduce btrfs_block_group::commit_used member to remember the last used bytes, and use that new member to skip unnecessary block group item update. This would be more common for large filesystems, where metadata block group can be as large as 1GiB, containing at most 64K metadata items. In that case, if COW added and then deleted one metadata item near the end of the block group, then it's completely possible we don't need to touch the block group item at all. [BENCHMARK] The change itself can have quite a high chance (20~80%) to skip block group item updates in lot of workloads. As a result, it would result shorter time spent on btrfs_write_dirty_block_groups(), and overall reduce the execution time of the critical section of btrfs_commit_transaction(). Here comes a fio command, which will do random writes in 4K block size, causing a very heavy metadata updates. fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \ --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \ --name=random_write --fallocate=none --time_based --fsync_on_close=1 The file size (512M) and number of threads (4) means 2GiB file size in total, but during the full 300s run time, my dedicated SATA SSD is able to write around 20~25GiB, which is over 10 times the file size. Thus after we fill the initial 2G, we should not cause much block group item updates. Please note, the fio numbers by themselves don't have much change, but if we look deeper, there is some reduced execution time, especially for the critical section of btrfs_commit_transaction(). I added extra trace_printk() to measure the following per-transaction execution time: - Critical section of btrfs_commit_transaction() By re-using the existing update_commit_stats() function, which has already calculated the interval correctly. - The while() loop for btrfs_write_dirty_block_groups() Although this includes the execution time of btrfs_run_delayed_refs(), it should still be representative overall. Both result involves transid 7~30, the same amount of transaction committed. The result looks like this: | Before | After | Diff ----------------------+-------------------+----------------+-------- Transaction interval | 229247198.5 | 215016933.6 | -6.2% Block group interval | 23133.33333 | 18970.83333 | -18.0% The change in block group item updates is more obvious, as skipped block group item updates also mean less delayed refs. And the overall execution time for that block group update loop is pretty small, thus we can assume the extent tree is already mostly cached. If we can skip an uncached tree block, it would cause more obvious change. Unfortunately the overall reduction in commit transaction critical section is much smaller, as the block group item updates loop is not really the major part, at least not for the above fio script. But still we have a observable reduction in the critical section. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
81531225 |
|
13-Oct-2022 |
Boris Burkov <boris@bur.io> |
btrfs: re-check reclaim condition in reclaim worker I have observed the following case play out and lead to unnecessary relocations: 1. write a file across multiple block groups 2. delete the file 3. several block groups fall below the reclaim threshold 4. reclaim the first, moving extents into the others 5. reclaim the others which are now actually very full, leading to poor reclaim behavior with lots of writing, allocating new block groups, etc. I believe the risk of missing some reasonable reclaims is worth it when traded off against the savings of avoiding overfull reclaims. Going forward, it could be interesting to make the check more advanced (zoned aware, fragmentation aware, etc...) so that it can be a really strong signal both at extent delete and reclaim time. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cc4804bf |
|
13-Oct-2022 |
Boris Burkov <boris@bur.io> |
btrfs: skip reclaim if block_group is empty As we delete extents from a block group, at some deletion we cross below the reclaim threshold. It is possible we are still in the middle of deleting more extents and might soon hit 0. If the block group is empty by the time the reclaim worker runs, we will still relocate it. This works just fine, as relocating an empty block group ultimately results in properly deleting it. However, we have more direct ways of removing empty block groups in the cleaner thread. Those are either async discard or the unused_bgs list. In fact, when we decide whether to relocate a block group during extent deletion, we do check for emptiness and prefer the discard/unused_bgs mechanisms when possible. Not using relocation for this case reduces some modest overhead from empty bg relocation: - extra transactions - extra metadata use/churn for creating relocation metadata - trying to read the extent tree to look for extents (and in this case finding none) Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
06d61cb1 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_should_fragment_free_space into block-group.c This function uses functions that are not defined in block-group.h, move it into block-group.c in order to keep the header clean. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
295a53cc |
|
11-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: delete stale comments after merge conflict resolution There are two comments in btrfs_cache_block_group that I left when resolving conflict between commits ced8ecf026fd8 "btrfs: fix space cache corruption and potential double allocations" and 527c490f44f6f "btrfs: delete btrfs_wait_space_cache_v1_finished". The former reworked the caching logic to wait until the caching ends in btrfs_cache_block_group while the latter only open coded the waiting. Both removed btrfs_wait_space_cache_v1_finished, the correct code is with the waiting and returning error. Thus the conflict resolution was OK. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1daedb1d |
|
12-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add the ability to use NO_FLUSH for data reservations In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH data reservations, so plumb this through the delalloc reservation system. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c29abab4 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_full_stripe_locks_tree into block-group.h This is actually embedded in struct btrfs_block_group, so move this definition to block-group.h, and then open-code the init of the tree where we init the rest of the block group instead of using a helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
81d5d614 |
|
08-Aug-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: enhance unsupported compat RO flags handling Currently there are two corner cases not handling compat RO flags correctly: - Remount We can still mount the fs RO with compat RO flags, then remount it RW. We should not allow any write into a fs with unsupported RO flags. - Still try to search block group items In fact, behavior/on-disk format change to extent tree should not need a full incompat flag. And since we can ensure fs with unsupported RO flags never got any writes (with above case fixed), then we can even skip block group items search at mount time. This patch will enhance the unsupported RO compat flags by: - Reject read-write remount if there are unsupported RO compat flags - Go dummy block group items directly for unsupported RO compat flags In fact, only changes to chunk/subvolume/root/csum trees should go incompat flags. The latter part should allow future change to extent tree to be compat RO flags. Thus this patch also needs to be backported to all stable trees. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc80f7ac |
|
08-Aug-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove use btrfs_remove_free_space_cache instead of variant We are calling __btrfs_remove_free_space_cache everywhere to cleanup the block group free space, however we can just use btrfs_remove_free_space_cache and pass in the block group in all of these places. Then we can remove __btrfs_remove_free_space_cache and rename __btrfs_remove_free_space_cache_locked to __btrfs_remove_free_space_cache. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
48ff7083 |
|
16-Aug-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of block group caching progress logic struct btrfs_caching_ctl::progress and struct btrfs_block_group::last_byte_to_unpin were previously needed to ensure that unpin_extent_range() didn't return a range to the free space cache before the caching thread had a chance to cache that range. However, the commit "btrfs: fix space cache corruption and potential double allocations" made it so that we always synchronously cache the block group at the time that we pin the extent, so this machinery is no longer necessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
527c490f |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: delete btrfs_wait_space_cache_v1_finished We used to use this in a few spots, but now we only use it directly inside of block-group.c, so remove the helper and just open code where we were using it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b9c293b |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove BLOCK_GROUP_FLAG_HAS_CACHING_CTL This is used mostly to determine if we need to look at the caching ctl list and clean up any references to this block group. However we never clear this flag, specifically because we need to know if we have to remove a caching ctl we have for this block group still. This is in the remove block group path which isn't a fast path, so the optimization doesn't really matter, simplify this logic and remove the flag. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50c31eaa |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: simplify block group traversal in btrfs_put_block_group_cache We're breaking out and re-searching for the next block group while evicting any of the block group cache inodes. This is not needed, the block groups aren't disappearing here, we can simply loop through the block groups like normal and iput any inode that we find. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3349b57f |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: convert block group bit field to use bit helpers We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
723de71d |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_info We previously had the pattern of btrfs_update_space_info(all, the, bg, fields, &space_info); link_block_group(bg); bg->space_info = space_info; Now that we're passing the bg into btrfs_add_bg_to_space_info we can do the linking in that function, transforming this to simply btrfs_add_bg_to_space_info(fs_info, bg); and put the link_block_group() and bg->space_info assignment directly in btrfs_add_bg_to_space_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9d4b0a12 |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: simplify arguments of btrfs_update_space_info and rename This function has grown a bunch of new arguments, and it just boils down to passing in all the block group fields as arguments. Simplify this by passing in the block group itself and updating the space_info fields based on the block group fields directly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f12741f |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_fs_closing for background bg work For both unused bg deletion and async balance work we'll happily run if the fs is closing. However I want to move these to their own worker thread, and they can be long running jobs, so add a check to see if we're closing and simply bail. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ced8ecf0 |
|
23-Aug-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: fix space cache corruption and potential double allocations When testing space_cache v2 on a large set of machines, we encountered a few symptoms: 1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree. All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time. Specifically, the race is as follows: 1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() -> add_to_free_space_tree() adds the deleted extent back to the free space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() -> btrfs_cache_block_group() queues up the block group to get cached. block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls switch_commit_roots(). It sets block_group->last_byte_to_unpin to block_group->progress, which is block_group->start because the block group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots were already switched, load_free_space_tree() sees the deleted extent as free and adds it to the space cache. It finishes caching and sets block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since transaction A is already in TRANS_STATE_SUPER_COMMITTED and the commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls switch_commit_roots(). This time, the block group has already been cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls btrfs_finish_extent_commit(), which calls unpin_extent_range() for the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by transaction B!), so it adds the deleted extent to the space cache again! This explains all of our symptoms above: * If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated *again*, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corruption, it will continue to get worse as extents are deleted and reallocated. The v1 space_cache is synchronously loaded when an extent is deleted (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() with load_cache_only=1), so it is not normally affected by this bug. However, as noted above, if we fail to load the space cache, we will fall back to caching from the extent tree and may hit this bug. The easiest fix for this race is to also make caching from the free space tree or extent tree synchronous. Josef tested this and found no performance regressions. A few extra changes fall out of this change. Namely, this fix does the following, with step 2 being the crucial fix: 1. Factor btrfs_caching_ctl_wait_done() out of btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl that we already hold a reference to. 2. Change the call in btrfs_cache_block_group() of btrfs_wait_space_cache_v1_finished() to btrfs_caching_ctl_wait_done(), which makes us wait regardless of the space_cache option. 3. Delete the now unused btrfs_wait_space_cache_v1_finished() and space_cache_v1_done(). 4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to `bool wait` to more accurately describe its new meaning. 5. Change a few callers which had a separate call to btrfs_wait_block_group_cache_done() to use wait = true instead. 6. Make btrfs_wait_block_group_cache_done() static now that it's not used outside of block-group.c anymore. Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
74944c87 |
|
25-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: reset RO counter on block group if we fail to relocate With the automatic block group reclaim code we will preemptively try to mark the block group RO before we start the relocation. We do this to make sure we should actually try to relocate the block group. However if we hit an error during the actual relocation we won't clean up our RO counter and the block group will remain RO. This was observed internally with file systems reporting less space available from df when we had failed background relocations. Fix this by doing the dec_ro in the error case. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b6a98021 |
|
08-Jul-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: activate necessary block group There are two places where allocating a chunk is not enough. These two places are trying to ensure the space by allocating a chunk. To meet the condition for active_total_bytes, we also need to activate a block group there. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a921de5 |
|
08-Jul-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: introduce space_info->active_total_bytes The active_total_bytes, like the total_bytes, accounts for the total bytes of active block groups in the space_info. With an introduction of active_total_bytes, we can check if the reserved bytes can be written to the block groups without activating a new block group. The check is necessary for metadata allocation on zoned filesystem. We cannot finish a block group, which may require waiting for the current transaction, from the metadata allocation context. Instead, we need to ensure the ongoing allocation (reserved bytes) fits in active block groups. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ac067734 |
|
23-Jun-2022 |
David Sterba <dsterba@suse.com> |
btrfs: merge calculations for simple striped profiles in btrfs_rmap_block Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1) and RAID10 (map->sub_stripes is 2), with equivalent results. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1314ca78 |
|
13-Jun-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: reset block group chunk force if we have to wait If you try to force a chunk allocation, but you race with another chunk allocation, you will end up waiting on the chunk allocation that just occurred and then allocate another chunk. If you have many threads all doing this at once you can way over-allocate chunks. Fix this by resetting force to NO_FORCE, that way if we think we need to allocate we can, otherwise we don't force another chunk allocation if one is already happening. Reviewed-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
74e91b12 |
|
03-May-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: zone finish unused block group While the active zones within an active block group are reset, and their active resource is released, the block group itself is kept in the active block group list and marked as active. As a result, the list will contain more than max_active_zones block groups. That itself is not fatal for the device as the zones are properly reset. However, that inflated list is, of course, strange. Also, a to-appear patch series, which deactivates an active block group on demand, gets confused with the wrong list. So, fix the issue by finishing the unused block group once it gets read-only, so that we can release the active resource in an early stage. Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2306e83e |
|
13-Apr-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid double search for block group during NOCOW writes When doing a NOCOW write, either through direct IO or buffered IO, we do two lookups for the block group that contains the target extent: once when we call btrfs_inc_nocow_writers() and then later again when we call btrfs_dec_nocow_writers() after creating the ordered extent. The lookups require taking a lock and navigating the red black tree used to track all block groups, which can take a non-negligible amount of time for a large filesystem with thousands of block groups, as well as lock contention and cache line bouncing. Improve on this by having a single block group search: making btrfs_inc_nocow_writers() return the block group to its caller and then have the caller pass that block group to btrfs_dec_nocow_writers(). This is part of a patchset comprised of the following patches: btrfs: remove search start argument from first_logical_byte() btrfs: use rbtree with leftmost node cached for tracking lowest block group btrfs: use a read/write lock for protecting the block groups tree btrfs: return block group directly at btrfs_next_block_group() btrfs: avoid double search for block group during NOCOW writes The following test was used to test these changes from a performance perspective: $ cat test.sh #!/bin/bash modprobe null_blk nr_devices=0 NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0 mkdir $NULL_DEV_PATH if [ $? -ne 0 ]; then echo "Failed to create nullb0 directory." exit 1 fi echo 2 > $NULL_DEV_PATH/submit_queues echo 16384 > $NULL_DEV_PATH/size # 16G echo 1 > $NULL_DEV_PATH/memory_backed echo 1 > $NULL_DEV_PATH/power DEV=/dev/nullb0 MNT=/mnt/nullb0 LOOP_MNT="$MNT/loop" MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [io_uring_writes] rw=randwrite fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bs=64k filesize=1g runtime=300 time_based directory=$LOOP_MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT mkdir $LOOP_MNT truncate -s 4T $MNT/loopfile mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT # Trigger the allocation of about 3500 data block groups, without # actually consuming space on underlying filesystem, just to make # the tree of block group large. fallocate -l 3500G $LOOP_MNT/filler fio /tmp/fio-job.ini umount $LOOP_MNT umount $MNT echo 0 > $NULL_DEV_PATH/power rmdir $NULL_DEV_PATH The test was run on a non-debug kernel (Debian's default kernel config), the result were the following. Before patchset: WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec After patchset: WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec +3.3% write throughput and +3.3% IO done in the same time period. The test has somewhat limited coverage scope, as with only NOCOW writes we get less contention on the red black tree of block groups, since we don't have the extra contention caused by COW writes, namely when allocating data extents, pinning and unpinning data extents, but on the hand there's access to tree in the NOCOW path, when incrementing a block group's number of NOCOW writers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b01f931 |
|
13-Apr-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: return block group directly at btrfs_next_block_group() At btrfs_next_block_group(), we have this long line with two statements: cache = btrfs_lookup_first_block_group(...); return cache; This makes it a bit harder to read due to two statements on the same line, so change that to directly return the result of the call to btrfs_lookup_first_block_group(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16b0c258 |
|
13-Apr-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use a read/write lock for protecting the block groups tree Currently we use a spin lock to protect the red black tree that we use to track block groups. Most accesses to that tree are actually read only and for large filesystems, with thousands of block groups, it actually has a bad impact on performance, as concurrent read only searches on the tree are serialized. Read only searches on the tree are very frequent and done when: 1) Pinning and unpinning extents, as we need to lookup the respective block group from the tree; 2) Freeing the last reference of a tree block, regardless if we pin the underlying extent or add it back to free space cache/tree; 3) During NOCOW writes, both buffered IO and direct IO, we need to check if the block group that contains an extent is read only or not and to increment the number of NOCOW writers in the block group. For those operations we need to search for the block group in the tree. Similarly, after creating the ordered extent for the NOCOW write, we need to decrement the number of NOCOW writers from the same block group, which requires searching for it in the tree; 4) Decreasing the number of extent reservations in a block group; 5) When allocating extents and freeing reserved extents; 6) Adding and removing free space to the free space tree; 7) When releasing delalloc bytes during ordered extent completion; 8) When relocating a block group; 9) During fitrim, to iterate over the block groups; 10) etc; Write accesses to the tree, to add or remove block groups, are much less frequent as they happen only when allocating a new block group or when deleting a block group. We also use the same spin lock to protect the list of currently caching block groups. Additions to this list are made when we need to cache a block group, because we don't have a free space cache for it (or we have but it's invalid), and removals from this list are done when caching of the block group's free space finishes. These cases are also not very common, but when they happen, they happen only once when the filesystem is mounted. So switch the lock that protects the tree of block groups from a spinning lock to a read/write lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
08dddb29 |
|
13-Apr-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use rbtree with leftmost node cached for tracking lowest block group We keep track of the start offset of the block group with the lowest start offset at fs_info->first_logical_byte. This requires explicitly updating that field every time we add, delete or lookup a block group to/from the red black tree at fs_info->block_group_cache_tree. Since the block group with the lowest start address happens to always be the one that is the leftmost node of the tree, we can use a red black tree that caches the left most node. Then when we need the start address of that block group, we can just quickly get the leftmost node in the tree and extract the start offset of that node's block group. This avoids the need to explicitly keep track of that address in the dedicated member fs_info->first_logical_byte, and it also allows the next patch in the series to switch the lock that protects the red black tree from a spin lock to a read/write lock - without this change it would be tricky because block group searches also update fs_info->first_logical_byte. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3687fcb0 |
|
29-Mar-2022 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: make auto-reclaim less aggressive The current auto-reclaim algorithm starts reclaiming all block groups with a zone_unusable value above a configured threshold. This is causing a lot of reclaim IO even if there would be enough free zones on the device. Instead of only accounting a block groups zone_unusable value, also take the ratio of free and not usable (written as well as zone_unusable) bytes a device has into account. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ac2f1e63 |
|
29-Mar-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: allow block group background reclaim for non-zoned filesystems This will allow us to set a threshold for block groups to be automatically relocated even if we don't have zoned devices. We have found this feature invaluable at Facebook due to how our workload interacts with the allocator. We have been using this in production for months with only a single problem that has already been fixed. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
36dfbbe2 |
|
09-Mar-2022 |
Gabriel Niebler <gniebler@suse.com> |
btrfs: use btrfs_for_each_slot in find_first_block_group This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
760e69c4 |
|
22-Mar-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: activate block group only for extent allocation In btrfs_make_block_group(), we activate the allocated block group, expecting that the block group is soon used for allocation. However, the chunk allocation from flush_space() context broke the assumption. There can be a large time gap between the chunk allocation time and the extent allocation time from the chunk. Activating the empty block groups pre-allocated from flush_space() context can exhaust the active zone counter of a device. Once we use all the active zone counts for empty pre-allocated block groups, we cannot activate new block group for the other things: metadata, tree-log, or data relocation block group. That failure results in a fake -ENOSPC. This patch introduces CHUNK_ALLOC_FORCE_FOR_EXTENT to distinguish the chunk allocation from find_free_extent(). Now, the new block group is activated only in that context. Fixes: eb66a010d518 ("btrfs: zoned: activate new block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
820c363b |
|
22-Mar-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: return allocated block group from do_chunk_alloc() Return the allocated block group from do_chunk_alloc(). This is a preparation patch for the next patch. CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6d4a6b51 |
|
24-Mar-2022 |
Nathan Chancellor <nathan@kernel.org> |
btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups() Clang's version of -Wunused-but-set-variable recently gained support for unary operations, which reveals two unused variables: fs/btrfs/block-group.c:2949:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable] int num_started = 0; ^ fs/btrfs/block-group.c:3116:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable] int num_started = 0; ^ 2 errors generated. These variables appear to be unused from their introduction, so just remove them to silence the warnings. Fixes: c9dc4c657850 ("Btrfs: two stage dirty block group writeout") Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") CC: stable@vger.kernel.org # 5.4+ Link: https://github.com/ClangBuiltLinux/linux/issues/1614 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ca5e4ea0 |
|
17-Feb-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: mark relocation as writing There is a hung_task issue with running generic/068 on an SMR device. The hang occurs while a process is trying to thaw the filesystem. The process is trying to take sb->s_umount to thaw the FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is waiting for an ordered extent to finish. However, as the FS is frozen, the ordered extents never finish. Having an ordered extent while the FS is frozen is the root cause of the hang. The ordered extent is initiated from btrfs_relocate_chunk() which is called from btrfs_reclaim_bgs_work(). This commit adds sb_*_write() around btrfs_relocate_chunk() call site. For the usual "btrfs balance" command, we already call it with mnt_want_file() in btrfs_ioctl_balance(). Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13+ Link: https://github.com/naota/linux/issues/56 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7238e50 |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add support for multiple global roots With extent tree v2 you will be able to create multiple csum, extent, and free space trees. They will be used based on the block group, which will now use the block_group_item->chunk_objectid to point to the set of global roots that it will use. When allocating new block groups we'll simply mod the gigabyte offset of the block group against the number of global roots we have and that will be the block groups global id. >From there we can take the bytenr that we're modifying in the respective tree, look up the block group and get that block groups corresponding global root id. From there we can get to the appropriate global root for that bytenr. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40cdc509 |
|
18-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: skip reserved bytes warning on unmount after log cleanup failure After the recent changes made by commit c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it") and its followup fix, commit 651740a5024117 ("btrfs: check WRITE_ERR when trying to read an extent buffer"), we can now end up not cleaning up space reservations of log tree extent buffers after a transaction abort happens, as well as not cleaning up still dirty extent buffers. This happens because if writeback for a log tree extent buffer failed, then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on, when trying to free the log tree with free_log_tree(), which iterates over the tree, we can end up getting an -EIO error when trying to read a node or a leaf, since read_extent_buffer_pages() returns -EIO if an extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return immediately as we can not iterate over the entire tree. In that case we never update the reserved space for an extent buffer in the respective block group and space_info object. When this happens we get the following traces when unmounting the fs: [174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure [174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure [174957.399379] ------------[ cut here ]------------ [174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs] [174957.407523] Modules linked in: btrfs overlay dm_zero (...) [174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1 [174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs] [174957.429717] Code: 21 48 8b bd (...) [174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206 [174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8 [174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000 [174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000 [174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148 [174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100 [174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000 [174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0 [174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [174957.443948] Call Trace: [174957.444264] <TASK> [174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs] [174957.445238] close_ctree+0x301/0x357 [btrfs] [174957.445803] ? call_rcu+0x16c/0x290 [174957.446250] generic_shutdown_super+0x74/0x120 [174957.446832] kill_anon_super+0x14/0x30 [174957.447305] btrfs_kill_super+0x12/0x20 [btrfs] [174957.447890] deactivate_locked_super+0x31/0xa0 [174957.448440] cleanup_mnt+0x147/0x1c0 [174957.448888] task_work_run+0x5c/0xa0 [174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0 [174957.449934] syscall_exit_to_user_mode+0x16/0x40 [174957.450512] do_syscall_64+0x48/0xc0 [174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae [174957.451605] RIP: 0033:0x7f328fdc4a97 [174957.452059] Code: 03 0c 00 f7 (...) [174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97 [174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0 [174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40 [174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000 [174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000 [174957.460193] </TASK> [174957.460534] irq event stamp: 0 [174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0 [174957.466323] ---[ end trace bc7ee0c490bce3af ]--- [174957.467282] ------------[ cut here ]------------ [174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs] [174957.470066] Modules linked in: btrfs overlay dm_zero (...) [174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1 [174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs] [174957.488050] Code: 00 00 00 ad de (...) [174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206 [174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000 [174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846 [174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000 [174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000 [174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100 [174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000 [174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0 [174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [174957.506167] Call Trace: [174957.506654] <TASK> [174957.507047] close_ctree+0x301/0x357 [btrfs] [174957.507867] ? call_rcu+0x16c/0x290 [174957.508567] generic_shutdown_super+0x74/0x120 [174957.509447] kill_anon_super+0x14/0x30 [174957.510194] btrfs_kill_super+0x12/0x20 [btrfs] [174957.511123] deactivate_locked_super+0x31/0xa0 [174957.511976] cleanup_mnt+0x147/0x1c0 [174957.512610] task_work_run+0x5c/0xa0 [174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0 [174957.514231] syscall_exit_to_user_mode+0x16/0x40 [174957.515069] do_syscall_64+0x48/0xc0 [174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae [174957.516688] RIP: 0033:0x7f328fdc4a97 [174957.517413] Code: 03 0c 00 f7 d8 (...) [174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97 [174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0 [174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40 [174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000 [174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000 [174957.529404] </TASK> [174957.529843] irq event stamp: 0 [174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0 [174957.533865] ---[ end trace bc7ee0c490bce3b0 ]--- [174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full [174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0 [174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0 [174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0 [174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0 [174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0 [174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0 This also means that in case we have log tree extent buffers that are still dirty, we can end up not cleaning them up in case we find an extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case we have no way for iterating over the rest of the tree. This issue is very often triggered with test cases generic/475 and generic/648 from fstests. The issue could almost be fixed by iterating over the io tree attached to each log root which keeps tracks of the range of allocated extent buffers, log_root->dirty_log_pages, however that does not work and has some inconveniences: 1) After we sync the log, we clear the range of the extent buffers from the io tree, so we can't find them after writeback. We could keep the ranges in the io tree, with a separate bit to signal they represent extent buffers already written, but that means we need to hold into more memory until the transaction commits. How much more memory is used depends a lot on whether we are able to allocate contiguous extent buffers on disk (and how often) for a log tree - if we are able to, then a single extent state record can represent multiple extent buffers, otherwise we need multiple extent state record structures to track each extent buffer. In fact, my earlier approach did that: https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/ However that can cause a very significant negative impact on performance, not only due to the extra memory usage but also because we get a larger and deeper dirty_log_pages io tree. We got a report that, on beefy machines at least, we can get such performance drop with fsmark for example: https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/ 2) We would be doing it only to deal with an unexpected and exceptional case, which is basically failure to read an extent buffer from disk due to IO failures. On a healthy system we don't expect transaction aborts to happen after all; 3) Instead of relying on iterating the log tree or tracking the ranges of extent buffers in the dirty_log_pages io tree, using the radix tree that tracks extent buffers (fs_info->buffer_radix) to find all log tree extent buffers is not reliable either, because after writeback of an extent buffer it can be evicted from memory by the release page callback of the btree inode (btree_releasepage()). Since there's no way to be able to properly cleanup a log tree without being able to read its extent buffers from disk and without using more memory to track the logical ranges of the allocated extent buffers do the following: 1) When we fail to cleanup a log tree, setup a flag that indicates that failure; 2) Trigger writeback of all log tree extent buffers that are still dirty, and wait for the writeback to complete. This is just to cleanup their state, page states, page leaks, etc; 3) When unmounting the fs, ignore if the number of bytes reserved in a block group and in a space_info is not 0 if, and only if, we failed to cleanup a log tree. Also ignore only for metadata block groups and the metadata space_info object. This is far from a perfect solution, but it serves to silence test failures such as those from generic/475 and generic/648. However having a non-zero value for the reserved bytes counters on unmount after a transaction abort, is not such a terrible thing and it's completely harmless, it does not affect the filesystem integrity in any way. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2d192fc4 |
|
16-Dec-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: don't start transaction for scrub if the fs is mounted read-only [BUG] The following super simple script would crash btrfs at unmount time, if CONFIG_BTRFS_ASSERT() is set. mkfs.btrfs -f $dev mount $dev $mnt xfs_io -f -c "pwrite 0 4k" $mnt/file umount $mnt mount -r ro $dev $mnt btrfs scrub start -Br $mnt umount $mnt This will trigger the following ASSERT() introduced by commit 0a31daa4b602 ("btrfs: add assertion for empty list of transactions at late stage of umount"). That patch is definitely not the cause, it just makes enough noise for developers. [CAUSE] We will start transaction for the following call chain during scrub: scrub_enumerate_chunks() |- btrfs_inc_block_group_ro() |- btrfs_join_transaction() However for RO mount, there is no running transaction at all, thus btrfs_join_transaction() will start a new transaction. Furthermore, since it's read-only mount, btrfs_sync_fs() will not call btrfs_commit_super() to commit the new but empty transaction. And leads to the ASSERT(). The bug has been there for a long time. Only the new ASSERT() makes it noisy enough to be noticed. [FIX] For read-only scrub on read-only mount, there is no need to start a transaction nor to allocate new chunks in btrfs_inc_block_group_ro(). Just do extra read-only mount check in btrfs_inc_block_group_ro(), and if it's read-only, skip all chunk allocation and go inc_block_group_ro() directly. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d96b3424 |
|
21-Nov-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make send work with concurrent block group relocation We don't allow send and balance/relocation to run in parallel in order to prevent send failing or silently producing some bad stream. This is because while send is using an extent (specially metadata) or about to read a metadata extent and expecting it belongs to a specific parent node, relocation can run, the transaction used for the relocation is committed and the extent gets reallocated while send is still using the extent, so it ends up with a different content than expected. This can result in just failing to read a metadata extent due to failure of the validation checks (parent transid, level, etc), failure to find a backreference for a data extent, and other unexpected failures. Besides reallocation, there's also a similar problem of an extent getting discarded when it's unpinned after the transaction used for block group relocation is committed. The restriction between balance and send was added in commit 9e967495e0e0 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), kernel 5.3, while the more general restriction between send and relocation was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs while we have send operations running"), kernel 5.14. Both send and relocation can be very long running operations. Relocation because it has to do a lot of IO and expensive backreference lookups in case there are many snapshots, and send due to read IO when operating on very large trees. This makes it inconvenient for users and tools to deal with scheduling both operations. For zoned filesystem we also have automatic block group relocation, so send can fail with -EAGAIN when users least expect it or send can end up delaying the block group relocation for too long. In the future we might also get the automatic block group relocation for non zoned filesystems. This change makes it possible for send and relocation to run in parallel. This is achieved the following way: 1) For all tree searches, send acquires a read lock on the commit root semaphore; 2) After each tree search, and before releasing the commit root semaphore, the leaf is cloned and placed in the search path (struct btrfs_path); 3) After releasing the commit root semaphore, the changed_cb() callback is invoked, which operates on the leaf and writes commands to the pipe (or file in case send/receive is not used with a pipe). It's important here to not hold a lock on the commit root semaphore, because if we did we could deadlock when sending and receiving to the same filesystem using a pipe - the send task blocks on the pipe because it's full, the receive task, which is the only consumer of the pipe, triggers a transaction commit when attempting to create a subvolume or reserve space for a write operation for example, but the transaction commit blocks trying to write lock the commit root semaphore, resulting in a deadlock; 4) Before moving to the next key, or advancing to the next change in case of an incremental send, check if a transaction used for relocation was committed (or is about to finish its commit). If so, release the search path(s) and restart the search, to where we were before, so that we don't operate on stale extent buffers. The search restarts are always possible because both the send and parent roots are RO, and no one can add, remove of update keys (change their offset) in RO trees - the only exception is deduplication, but that is still not allowed to run in parallel with send; 5) Periodically check if there is contention on the commit root semaphore, which means there is a transaction commit trying to write lock it, and release the semaphore and reschedule if there is contention, so as to avoid causing any significant delays to transaction commits. This leaves some room for optimizations for send to have less path releases and re searching the trees when there's relocation running, but for now it's kept simple as it performs quite well (on very large trees with resulting send streams in the order of a few hundred gigabytes). Test case btrfs/187, from fstests, stresses relocation, send and deduplication attempting to run in parallel, but without verifying if send succeeds and if it produces correct streams. A new test case will be added that exercises relocation happening in parallel with send and then checks that send succeeds and the resulting streams are correct. A final note is that for now this still leaves the mutual exclusion between send operations and deduplication on files belonging to a root used by send operations. A solution for that will be slightly more complex but it will eventually be built on top of this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
29cbcf40 |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: stop accessing ->extent_root directly When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dfe8aec4 |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a btrfs_block_group_root() helper With extent tree v2 we will have a separate root to hold the block group items. Add a btrfs_block_group_root() that will return the appropriate root given the flags of the fs, and convert all functions that need to modify block group items to use the helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9270501c |
|
09-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: change root to fs_info for btrfs_reserve_metadata_bytes We used to need the root for btrfs_reserve_metadata_bytes to check the orphan cleanup state, but we no longer need that, we simply need the fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the same as they simply call btrfs_reserve_metadata_bytes() and then manipulate the block_rsv that is being used. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
17130a65 |
|
14-Oct-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove spurious unlock/lock of unused_bgs_lock Since both unused block groups and reclaim bgs lists are protected by unused_bgs_lock then free them in the same critical section without doing an extra unlock/lock pair. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ecd84d54 |
|
13-Oct-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update comments for chunk allocation -ENOSPC cases Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that describe which cases can lead to a failure to allocate metadata and system space despite having previously reserved space. This adds one more reason that I previously forgot to mention. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2bb2e00e |
|
13-Oct-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock between chunk allocation and chunk btree modifications When a task is doing some modification to the chunk btree and it is not in the context of a chunk allocation or a chunk removal, it can deadlock with another task that is currently allocating a new data or metadata chunk. These contexts are the following: * When relocating a system chunk, when we need to COW the extent buffers that belong to the chunk btree; * When adding a new device (ioctl), where we need to add a new device item to the chunk btree; * When removing a device (ioctl), where we need to remove a device item from the chunk btree; * When resizing a device (ioctl), where we need to update a device item in the chunk btree and may need to relocate a system chunk that lies beyond the new device size when shrinking a device. The problem happens due to a sequence of steps like the following: 1) Task A starts a data or metadata chunk allocation and it locks the chunk mutex; 2) Task B is relocating a system chunk, and when it needs to COW an extent buffer of the chunk btree, it has locked both that extent buffer as well as its parent extent buffer; 3) Since there is not enough available system space, either because none of the existing system block groups have enough free space or because the only one with enough free space is in RO mode due to the relocation, task B triggers a new system chunk allocation. It blocks when trying to acquire the chunk mutex, currently held by task A; 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert the new chunk item into the chunk btree and update the existing device items there. But in order to do that, it has to lock the extent buffer that task B locked at step 2, or its parent extent buffer, but task B is waiting on the chunk mutex, which is currently locked by task A, therefore resulting in a deadlock. One example report when the deadlock happens with system chunk relocation: INFO: task kworker/u9:5:546 blocked for more than 143 seconds. Not tainted 5.15.0-rc3+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993 __down_read_common kernel/locking/rwsem.c:1214 [inline] __down_read kernel/locking/rwsem.c:1223 [inline] down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590 __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47 btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline] btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191 btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline] btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728 btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794 btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504 do_chunk_alloc fs/btrfs/block-group.c:3408 [inline] btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653 flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670 btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953 process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297 worker_thread+0x90/0xed0 kernel/workqueue.c:2444 kthread+0x3e5/0x4d0 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 INFO: task syz-executor:9107 blocked for more than 143 seconds. Not tainted 5.15.0-rc3+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425 __mutex_lock_common kernel/locking/mutex.c:669 [inline] __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729 btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631 find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline] find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335 btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415 btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813 __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415 btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570 btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768 relocate_tree_block fs/btrfs/relocation.c:2694 [inline] relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757 relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673 btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070 btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181 __btrfs_balance fs/btrfs/volumes.c:3911 [inline] btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301 btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137 btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae So fix this by making sure that whenever we try to modify the chunk btree and we are neither in a chunk allocation context nor in a chunk remove context, we reserve system space before modifying the chunk btree. Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/ Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array") CC: stable@vger.kernel.org # 5.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2ca0ec77 |
|
14-Oct-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: use greedy gc for auto reclaim Currently auto reclaim of unusable zones reclaims the block-groups in the order they have been added to the reclaim list. Change this to a greedy algorithm by sorting the list so we have the block-groups with the least amount of valid bytes reclaimed first. Note: we can't splice the block groups from reclaim_bgs to let the sort happen outside of the lock. The block groups can be still in use by other parts eg. via bg_list and we must hold unused_bgs_lock while processing them. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ write note and comment why we can't splice the list ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
11b66fa6 |
|
13-Oct-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: reduce btrfs_update_block_group alloc argument to bool btrfs_update_block_group() accounts for the number of bytes allocated or freed. Argument @alloc specifies whether the call is for alloc or free. Convert the argument @alloc type from int to bool. Reviewed-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2707a25 |
|
08-Sep-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: add a dedicated data relocation block group Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we currently can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delalloc range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } Temporarily create a dedicated block group for the relocation process, so no non-relocation data writes can interfere with the relocation writes. This is needed that we can switch the relocation process on a zoned filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme like in a non-zoned filesystem using REQ_OP_WRITE and preallocation. Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eb66a010 |
|
19-Aug-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: activate new block group Activate new block group at btrfs_make_block_group(). We do not check the return value. If failed, we can try again later at the actual extent allocation phase. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
afba2bc0 |
|
19-Aug-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: implement active zone tracking Add zone_is_active flag to btrfs_block_group. This flag indicates the underlying zones are all active. Such zone active block groups are tracked by fs_info->active_bg_list. btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying device part. They set/clear the bitmap to indicate zone activeness and count the number of zones we can activate left. btrfs_zone_{activate,finish}() take responsibility for the logical part and the list management. In addition, btrfs_zone_finish() wait for any writes on it and send REQ_OP_ZONE_FINISH to the zone. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dafc340d |
|
19-Aug-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: introduce physical_map to btrfs_block_group We will use a block group's physical location to track active zones and finish fully written zones in the following commits. Since the zone activation is done in the extent allocation context which already holding the tree locks, we can't query the chunk tree for the physical locations. So, copy the location info into a block group and use it for activation. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
98173255 |
|
19-Aug-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: calculate free space from zone capacity Now that we introduced capacity in a block group, we need to calculate free space using the capacity instead of the length. Thus, bytes we account capacity - alloc_pointer as free, and account bytes [capacity, length] as zone unusable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c46c4247 |
|
19-Aug-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: move btrfs_free_excluded_extents out of btrfs_calc_zone_unusable btrfs_free_excluded_extents() is not neccessary for btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable() difficult to reuse. Move it out and call btrfs_free_excluded_extents() in proper context. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a09f23c3 |
|
23-Aug-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename and switch to bool btrfs_chunk_readonly btrfs_chunk_readonly() checks if the given chunk is writeable. It returns 1 for readonly, and 0 for writeable. So the return argument type bool shall suffice instead of the current type int. Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we check if the bg is writeable, and helps to keep the logic at the parent function simpler to understand. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f6f39f7a |
|
18-Aug-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk The user facing function used to allocate new chunks is btrfs_chunk_alloc, unfortunately there is yet another similar sounding function - btrfs_alloc_chunk. This creates confusion, especially since the latter function can be considered "private" in the sense that it implements the first stage of chunk creation and as such is called by btrfs_chunk_alloc. To avoid the awkwardness that comes with having similarly named but distinctly different in their purpose function rename btrfs_alloc_chunk to btrfs_create_chunk, given that the main purpose of this function is to orchestrate the whole process of allocating a chunk - reserving space into devices, deciding on characteristics of the stripe size and creating the in-memory structures. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba86dd9f |
|
08-Aug-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: suppress reclaim error message on EAGAIN btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are running. The message can fail btrfs/187 and it's unnecessary because we anyway add it back to the reclaim list. btrfs_reclaim_bgs_work() `-> btrfs_relocate_chunk() `-> btrfs_relocate_block_group() `-> reloc_chunk_start() `-> if (fs_info->send_in_progress) `-> return -EAGAIN CC: stable@vger.kernel.org # 5.13+ Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b29726c |
|
18-Jul-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: rescue: allow ibadroots to skip bad extent tree when reading block group items When extent tree gets corrupted, normally it's not extent tree root, but one toasted tree leaf/node. In that case, rescue=ibadroots mount option won't help as it can only handle the extent tree root corruption. This patch will enhance the behavior by: - Allow fill_dummy_bgs() to ignore -EEXIST error This means we may have some block group items read from disk, but then hit some error halfway. - Fallback to fill_dummy_bgs() if any error gets hit in btrfs_read_block_groups() Of course, this still needs rescue=ibadroots mount option. With that, rescue=ibadroots can handle extent tree corruption more gracefully and allow a better recover chance. Reported-by: Zhenyu Wu <wuzy001@gmail.com> Link: https://www.spinics.net/lists/linux-btrfs/msg114424.html Reviewed-by: Su Yue <l@damenly.su> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2eadb9e7 |
|
04-Jul-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_finish_chunk_alloc private to block-group.c One of the final things that must be done to add a new chunk is inserting its device extent items in the device tree. They describe the portion of allocated device physical space during phase 1 of chunk allocation. This is currently done in btrfs_finish_chunk_alloc whose name isn't very informative. What's more, this function is only used in block-group.c but is defined as public. There isn't anything special about it that would warrant it being defined in volumes.c. Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to block-group.c, make the former static and rename both functions to insert_dev_extents and insert_dev_extent respectively. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9cc0b837 |
|
05-Jul-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: don't block if we can't acquire the reclaim lock If we can't acquire the reclaim_bgs_lock on block group reclaim, we block until it is free. This can potentially stall for a long time. While reclaim of block groups is necessary for a good user experience on a zoned file system, there still is no need to block as it is best effort only, just like when we're deleting unused block groups. CC: stable@vger.kernel.org # 5.13 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
79bd3712 |
|
29-Jun-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") fixed a problem that resulted in exhausting the system chunk array in the superblock when there are many tasks allocating chunks in parallel. Basically too many tasks enter the first phase of chunk allocation without previous tasks having finished their second phase of allocation, resulting in too many system chunks being allocated. That was originally observed when running the fallocate tests of stress-ng on a PowerPC machine, using a node size of 64K. However that commit also introduced a deadlock where a task in phase 1 of the chunk allocation waited for another task that had allocated a system chunk to finish its phase 2, but that other task was waiting on an extent buffer lock held by the first task, therefore resulting in both tasks not making any progress. That change was later reverted by a patch with the subject "btrfs: fix deadlock with concurrent chunk allocations involving system chunks", since there is no simple and short solution to address it and the deadlock is relatively easy to trigger on zoned filesystems, while the system chunk array exhaustion is not so common. This change reworks the chunk allocation to avoid the system chunk array exhaustion. It accomplishes that by making the first phase of chunk allocation do the updates of the device items in the chunk btree and the insertion of the new chunk item in the chunk btree. This is done while under the protection of the chunk mutex (fs_info->chunk_mutex), in the same critical section that checks for available system space, allocates a new system chunk if needed and reserves system chunk space. This way we do not have chunk space reserved until the second phase completes. The same logic is applied to chunk removal as well, since it keeps reserved system space long after it is done updating the chunk btree. For direct allocation of system chunks, the previous behaviour remains, because otherwise we would deadlock on extent buffers of the chunk btree. Changes to the chunk btree are by large done by chunk allocation and chunk removal, which first reserve chunk system space and then later do changes to the chunk btree. The other remaining cases are uncommon and correspond to adding a device, removing a device and resizing a device. All these other cases do not pre-reserve system space, they modify the chunk btree right away, so they don't hold reserved space for a long period like chunk allocation and chunk removal do. The diff of this change is huge, but more than half of it is just addition of comments describing both how things work regarding chunk allocation and removal, including both the new behavior and the parts of the old behavior that did not change. CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cb3db1c |
|
29-Jun-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock with concurrent chunk allocations involving system chunks When a task attempting to allocate a new chunk verifies that there is not currently enough free space in the system space_info and there is another task that allocated a new system chunk but it did not finish yet the creation of the respective block group, it waits for that other task to finish creating the block group. This is to avoid exhaustion of the system chunk array in the superblock, which is limited, when we have a thundering herd of tasks allocating new chunks. This problem was described and fixed by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations"). However there are two very similar scenarios where this can lead to a deadlock: 1) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. However before task B ends its transaction handle and finishes the creation of the system block group, it attempts to allocate another chunk (like a data chunk for an fallocate operation for a very large range). Task B will be unable to progress and allocate the new chunk, because task A set space_info->chunk_alloc to 1 and therefore it loops at btrfs_chunk_alloc() waiting for task A to finish its chunk allocation and set space_info->chunk_alloc to 0, but task A is waiting on task B to finish creation of the new system block group, therefore resulting in a deadlock; 2) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. By the time that task B enter the final phase of block group allocation, which happens at btrfs_create_pending_block_groups(), when it modifies the extent tree, the device tree or the chunk tree to insert the items for some new block group, it needs to allocate a new chunk, so it ends up at btrfs_chunk_alloc() and keeps looping there because task A has set space_info->chunk_alloc to 1, but task A is waiting for task B to finish creation of the new system block group and release the reserved system space, therefore resulting in a deadlock. In short, the problem is if a task B needs to allocate a new chunk after it previously allocated a new system chunk and if another task A is currently waiting for task B to complete the allocation of the new system chunk. Unfortunately this deadlock scenario introduced by the previous fix for the system chunk array exhaustion problem does not have a simple and short fix, and requires a big change to rework the chunk allocation code so that chunk btree updates are all made in the first phase of chunk allocation. And since this deadlock regression is being frequently hit on zoned filesystems and the system chunk array exhaustion problem is triggered in more extreme cases (originally observed on PowerPC with a node size of 64K when running the fallocate tests from stress-ng), revert the changes from that commit. The next patch in the series, with a subject of "btrfs: rework chunk allocation to avoid exhaustion of the system chunk array" does the necessary changes to fix the system chunk array exhaustion problem. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/ Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5f93e776 |
|
28-Jun-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: print unusable percentage when reclaiming block groups When we're automatically reclaiming a zone, because its zone_unusable value is above the reclaim threshold, we're only logging how much percent of the zone's capacity are used, but not how much of the capacity is unusable. Also print the percentage of the unusable space in the block group before we're reclaiming it. Example: BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable CC: stable@vger.kernel.org # 5.13 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
54afaae3 |
|
23-Jun-2021 |
David Sterba <dsterba@suse.com> |
btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work The types in calculation of the used percentage in the reclaiming messages are both u64, though bg->length is either 1GiB (non-zoned) or the zone size in the zoned mode. The upper limit on zone size is 8GiB so this could theoretically overflow in the future, right now the values fit. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
138a12d8 |
|
22-Jun-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rip out btrfs_space_info::total_bytes_pinned We used this in may_commit_transaction() in order to determine if we needed to commit the transaction. However we no longer have that logic and thus have no use of this counter anymore, so delete it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cea5cf0 |
|
21-Jun-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: ensure relocation never runs while we have send operations running Relocation and send do not play well together because while send is running a block group can be relocated, a transaction committed and the respective disk extents get re-allocated and written to or discarded while send is about to do something with the extents. This was explained in commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due to concurrent relocation"), which prevented balance and send from running in parallel but it did not address one remaining case where chunk relocation can happen: shrinking a device (and device deletion which shrinks a device's size to 0 before deleting the device). We also have now one more case where relocation is triggered: on zoned filesystems partially used block groups get relocated by a background thread, introduced in commit 18bb8bbf13c183 ("btrfs: zoned: automatically reclaim zones"). So make sure that instead of preventing balance from running when there are ongoing send operations, we prevent relocation from happening. This uses the infrastructure recently added by a patch that has the subject: "btrfs: add cancellable chunk relocation support". Also it adds a spinlock used exclusively for the exclusivity between send and relocation, as before fs_info->balance_mutex was used, which would make an attempt to run send to block waiting for balance to finish, which can take a lot of time on large filesystems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0044ae11 |
|
13-Apr-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: make free space cache size consistent across different PAGE_SIZE Currently free space cache inode size is determined by two factors: - block group size - PAGE_SIZE This means, for the same sized block groups, with different PAGE_SIZE, it will result in different inode sizes. This will not be a good thing for subpage support, so change the requirement for PAGE_SIZE to sectorsize. Now for the same 4K sectorsize btrfs, it should result the same inode size no matter what the PAGE_SIZE is. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f9f28e5b |
|
16-Jun-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: fix negative space_info->bytes_readonly Consider we have a using block group on zoned btrfs. |<- ZU ->|<- used ->|<---free--->| `- Alloc offset ZU: Zone unusable Marking the block group read-only will migrate the zone unusable bytes to the read-only bytes. So, we will have this. |<- RO ->|<- used ->|<--- RO --->| RO: Read only When marking it back to read-write, btrfs_dec_block_group_ro() subtracts the above "RO" bytes from the space_info->bytes_readonly. And, it moves the zone unusable bytes back and again subtracts those bytes from the space_info->bytes_readonly, leading to negative bytes_readonly. This can be observed in the output as eg.: Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688 This commit fixes the issue by reordering the operations. Link: https://github.com/naota/linux/issues/37 Reported-by: David Sterba <dsterba@suse.com> Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
18bb8bbf |
|
19-Apr-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: automatically reclaim zones When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% by default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f3372065 |
|
19-Apr-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock As a preparation for extending the block group deletion use case, rename the unused_bgs_mutex to reclaim_bgs_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eafa4fd0 |
|
31-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix exhaustion of the system chunk array due to concurrent allocations When we are running out of space for updating the chunk tree, that is, when we are low on available space in the system space info, if we have many task concurrently allocating block groups, via fallocate for example, many of them can end up all allocating new system chunks when only one is needed. In extreme cases this can lead to exhaustion of the system chunk array, which has a size limit of 2048 bytes, and results in a transaction abort with errno EFBIG, producing a trace in dmesg like the following, which was triggered on a PowerPC machine with a node/leaf size of 64K: [1359.518899] ------------[ cut here ]------------ [1359.518980] BTRFS: Transaction aborted (error -27) [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519152] Modules linked in: (...) [1359.519239] Supported: Yes, External [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3 [1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8 [1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default) [1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007 [1359.519356] CFAR: c00000000013e170 IRQMASK: 0 [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026 [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007 [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000 [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0 [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000 [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0 [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08 [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] [1359.519626] Call Trace: [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable) [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs] [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs] [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs] [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350 [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0 [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40 [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170 [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278 [1359.520037] Instruction dump: [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000 [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378 [1359.520122] ---[ end trace d6c186e151022e20 ]--- The following steps explain how we can end up in this situation: 1) Task A is at check_system_chunk(), either because it is allocating a new data or metadata block group, at btrfs_chunk_alloc(), or because it is removing a block group or turning a block group RO. It does not matter why; 2) Task A sees that there is not enough free space in the system space_info object, that is 'left' is < 'thresh'. And at this point the system space_info has a value of 0 for its 'bytes_may_use' counter; 3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate a new system block group (chunk) and then reserves 'thresh' bytes in the chunk block reserve with the call to btrfs_block_rsv_add(). This changes the chunk block reserve's 'reserved' and 'size' counters by an amount of 'thresh', and changes the 'bytes_may_use' counter of the system space_info object from 0 to 'thresh'. Also during its call to btrfs_alloc_chunk(), we end up increasing the value of the 'total_bytes' counter of the system space_info object by 8MiB (the size of a system chunk stripe). This happens through the call chain: btrfs_alloc_chunk() create_chunk() btrfs_make_block_group() btrfs_update_space_info() 4) After it finishes the first phase of the block group allocation, at btrfs_chunk_alloc(), task A unlocks the chunk mutex; 5) At this point the new system block group was added to the transaction handle's list of new block groups, but its block group item, device items and chunk item were not yet inserted in the extent, device and chunk trees, respectively. That only happens later when we call btrfs_finish_chunk_alloc() through a call to btrfs_create_pending_block_groups(); Note that only when we update the chunk tree, through the call to btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter of the chunk block reserve as we COW/allocate extent buffers, through: btrfs_alloc_tree_block() btrfs_use_block_rsv() btrfs_block_rsv_use_bytes() And the system space_info's 'bytes_may_use' is decremented everytime we allocate an extent buffer for COW operations on the chunk tree, through: btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() btrfs_add_reserved_bytes() If we end up COWing less chunk btree nodes/leaves than expected, which is the typical case since the amount of space we reserve is always pessimistic to account for the worst possible case, we release the unused space through: btrfs_create_pending_block_groups() btrfs_trans_release_chunk_metadata() btrfs_block_rsv_release() block_rsv_release_bytes() btrfs_space_info_free_bytes_may_use() But before task A gets into btrfs_create_pending_block_groups()... 6) Many other tasks start allocating new block groups through fallocate, each one does the first phase of block group allocation in a serialized way, since btrfs_chunk_alloc() takes the chunk mutex before calling check_system_chunk() and btrfs_alloc_chunk(). However before everyone enters the final phase of the block group allocation, that is, before calling btrfs_create_pending_block_groups(), new tasks keep coming to allocate new block groups and while at check_system_chunk(), the system space_info's 'bytes_may_use' keeps increasing each time a task reserves space in the chunk block reserve. This means that eventually some other task can end up not seeing enough free space in the system space_info and decide to allocate yet another system chunk. This may repeat several times if yet more new tasks keep allocating new block groups before task A, and all the other tasks, finish the creation of the pending block groups, which is when reserved space in excess is released. Eventually this can result in exhaustion of system chunk array in the superblock, with btrfs_add_system_chunk() returning EFBIG, resulting later in a transaction abort. Even when we don't reach the extreme case of exhausting the system array, most, if not all, unnecessarily created system block groups end up being unused since when finishing creation of the first pending system block group, the creation of the following ones end up not needing to COW nodes/leaves of the chunk tree, so we never allocate and deallocate from them, resulting in them never being added to the list of unused block groups - as a consequence they don't get deleted by the cleaner kthread - the only exceptions are if we unmount and mount the filesystem again, which adds any unused block groups to the list of unused block groups, if a scrub is run, which also adds unused block groups to the unused list, and under some circumstances when using a zoned filesystem or async discard, which may also add unused block groups to the unused list. So fix this by: *) Tracking the number of reserved bytes for the chunk tree per transaction, which is the sum of reserved chunk bytes by each transaction handle currently being used; *) When there is not enough free space in the system space_info, if there are other transaction handles which reserved chunk space, wait for some of them to complete in order to have enough excess reserved space released, and then try again. Otherwise proceed with the creation of a new system chunk. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b6e9f16c |
|
17-Feb-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: replace open coded while loop with proper construct btrfs_inc_block_group_ro wants to ensure that the current transaction is not running dirty block groups, if it is it waits and loops again. That logic is currently implemented using a goto label. Actually using a proper do {} while() construct doesn't hurt readability nor does it introduce excessive nesting and makes the relevant code stand out by being encompassed in the loop construct. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
195a49ea |
|
04-Feb-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between writes to swap files and scrub When we active a swap file, at btrfs_swap_activate(), we acquire the exclusive operation lock to prevent the physical location of the swap file extents to be changed by operations such as balance and device replace/resize/remove. We also call there can_nocow_extent() which, among other things, checks if the block group of a swap file extent is currently RO, and if it is we can not use the extent, since a write into it would result in COWing the extent. However we have no protection against a scrub operation running after we activate the swap file, which can result in the swap file extents to be COWed while the scrub is running and operating on the respective block group, because scrub turns a block group into RO before it processes it and then back again to RW mode after processing it. That means an attempt to write into a swap file extent while scrub is processing the respective block group, will result in COWing the extent, changing its physical location on disk. Fix this by making sure that block groups that have extents that are used by active swap files can not be turned into RO mode, therefore making it not possible for a scrub to turn them into RO mode. When a scrub finds a block group that can not be turned to RO due to the existence of extents used by swap files, it proceeds to the next block group and logs a warning message that mentions the block group was skipped due to active swap files - this is the same approach we currently use for balance. Fixes: ed46ff3d42378 ("Btrfs: support swap files") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40ab3be1 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: extend zoned allocator to use dedicated tree-log block group This is the 1/3 patch to enable tree log on zoned filesystems. The tree-log feature does not work on a zoned filesystem as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which has a different timing than a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that zoned filesystems must avoid. Introduce a dedicated block group for tree-log blocks, so that tree-log blocks and other metadata blocks can be separate write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and assigns "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
138082f3 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: extend btrfs_rmap_block for specifying a device btrfs_rmap_block currently reverse-maps the physical addresses on all devices to the corresponding logical addresses. Extend the function to match to a specified device. The old functionality of querying all devices is left intact by specifying NULL as target device. A block_device instead of a btrfs_device is passed into btrfs_rmap_block, as this function is intended to reverse-map the result of a bio, which only has a block_device. Also export the function for later use. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dcba6e48 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: reset zones of unused block groups We must reset the zones of a deleted unused block group to rewind the zones' write pointers to the zones' start. To do this, we can use the DISCARD_SYNC code to do the reset when the filesystem is running on zoned devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2eda5708 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: implement sequential extent allocation Implement a sequential extent allocator for zoned filesystems. This allocator only needs to check if there is enough space in the block group after the allocation pointer to satisfy the extent allocation request. Therefore the allocator never manages bitmaps or clusters. Also, add assertions to the corresponding functions. As zone append writing is used, it would be unnecessary to track the allocation offset, as the allocator only needs to check available space. But by tracking and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
169e0da9 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: track unusable bytes for zones In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset. Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a94794d5 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: calculate allocation offset for conventional zones Conventional zones do not have a write pointer, so we cannot use it to determine the allocation offset for sequential allocation if a block group contains a conventional zone. But instead, we can consider the end of the highest addressed extent in the block group for the allocation offset. For new block group, we cannot calculate the allocation offset by consulting the extent tree, because it can cause deadlock by taking extent buffer lock after chunk mutex, which is already taken in btrfs_make_block_group(). Since it is a new block group anyways, we can simply set the allocation offset to 0. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
08e11a3d |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: load zone's allocation offset A zoned filesystem must allocate blocks at the zones' write pointer. The device's write pointer position can be mapped to a logical address within a block group. To facilitate this, add an "alloc_offset" to the block-group to track the logical addresses of the write pointer. This logical address is populated in btrfs_load_block_group_zone_info() from the write pointers of corresponding zones. For now, zoned filesystems the single profile. Supporting non-single profile with zone append writing is not trivial. For example, in the DUP profile, we send a zone append writing IO to two zones on a device. The device reply with written LBAs for the IOs. If the offsets of the returned addresses from the beginning of the zone are different, then it results in different logical addresses. We need fine-grained logical to physical mapping to support such separated physical address issue. Since it should require additional metadata type, disable non-single profiles for now. This commit supports the case all the zones in a block group are sequential. The next patch will handle the case having a conventional zone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4afd2fe8 |
|
04-Feb-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: release path before calling to btrfs_load_block_group_zone_info Since we have no write pointer in conventional zones, we cannot determine the allocation offset from it. Instead, we set the allocation offset after the highest addressed extent. This is done by reading the extent tree in btrfs_load_block_group_zone_info(). However, this function is called from btrfs_read_block_groups(), so the read lock for the tree node could be recursively taken. To avoid this unsafe locking scenario, release the path before reading the extent tree to get the allocation offset. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ddfd08cb |
|
18-Dec-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not block on deleted bgs mutex in the cleaner While running some stress tests I started getting hung task messages. This is because the delete unused block groups code has to take the delete_unused_bgs_mutex to do it's work, which is taken by balance to make sure we don't delete block groups while we're balancing. The problem is that balance can take a while, and so we were getting hung task warnings. We don't need to block and run these things, and the cleaner is needed to do other work, so trylock on this mutex and just bail if we can't acquire it right away. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
938fcbfb |
|
14-Jan-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: splice remaining dirty_bg's onto the transaction dirty bg list While doing error injection testing with my relocation patches I hit the following assert: assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3357! invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 PID: 24351 Comm: umount Tainted: G W 5.10.0-rc3+ #193 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282 RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000 RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050 RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000 R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100 R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100 FS: 00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0 Call Trace: btrfs_free_block_groups.cold+0x55/0x55 close_ctree+0x2c5/0x306 ? fsnotify_destroy_marks+0x14/0x100 generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1b1/0x1d0 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happened because I injected an error in btrfs_cow_block() while running the dirty block groups. When we run the dirty block groups, we splice the list onto a local list to process. However if an error occurs, we only cleanup the transactions dirty block group list, not any pending block groups we have on our locally spliced list. In fact if we fail to allocate a path in this function we'll also fail to clean up the splice list. Fix this by splicing the list back onto the transaction dirty block group list so that the block groups are cleaned up. Then add a 'out' label and have the error conditions jump to out so that the errors are handled properly. This also has the side-effect of fixing a problem where we would clear 'ret' on error because we unconditionally ran btrfs_run_delayed_refs(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2187374f |
|
15-Jan-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself Currently we pass things around to figure out if we maybe freeing data based on the state of the delayed refs head. This makes the accounting sort of confusing and hard to follow, as it's distinctly separate from the delayed ref heads stuff, but also depends on it entirely. Fix this by explicitly adjusting the space_info->total_bytes_pinned in the delayed refs code. We now have two places where we modify this counter, once where we create the delayed and destroy the delayed refs, and once when we pin and unpin the extents. This means there is a slight overlap between delayed refs and the pin/unpin mechanisms, but this is simply used by the ENOSPC infrastructure to determine if we need to commit the transaction, so there's no adverse affect from this, we might simply commit thinking it will give us enough space when it might not. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ee9b979 |
|
22-Jan-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: document fs_info in btrfs_rmap_block Fixes fs/btrfs/block-group.c:1570: warning: Function parameter or member 'fs_info' not described in 'btrfs_rmap_block' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f96e402 |
|
15-Jan-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix possible free space tree corruption with online conversion While running btrfs/011 in a loop I would often ASSERT() while trying to add a new free space entry that already existed, or get an EEXIST while adding a new block to the extent tree, which is another indication of double allocation. This occurs because when we do the free space tree population, we create the new root and then populate the tree and commit the transaction. The problem is when you create a new root, the root node and commit root node are the same. During this initial transaction commit we will run all of the delayed refs that were paused during the free space tree generation, and thus begin to cache block groups. While caching block groups the caching thread will be reading from the main root for the free space tree, so as we make allocations we'll be changing the free space tree, which can cause us to add the same range twice which results in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a variety of different errors when running delayed refs because of a double allocation. Fix this by marking the fs_info as unsafe to load the free space tree, and fall back on the old slow method. We could be smarter than this, for example caching the block group while we're populating the free space tree, but since this is a serious problem I've opted for the simplest solution. CC: stable@vger.kernel.org # 4.9+ Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
34d1eb0e |
|
16-Dec-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: don't clear ret in btrfs_start_dirty_block_groups If we fail to update a block group item in the loop we'll break, however we'll do btrfs_run_delayed_refs and lose our error value in ret, and thus not clean up properly. Fix this by only running the delayed refs if there was no failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
af456a2c |
|
18-Nov-2020 |
Boris Burkov <boris@bur.io> |
btrfs: skip space_cache v1 setup when not using it If we are not using space cache v1, we should not create the free space object or free space inodes. This comes up when we delete the existing free space objects/inodes when migrating to v2, only to see them get recreated for every dirtied block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
36b216c8 |
|
18-Nov-2020 |
Boris Burkov <boris@bur.io> |
btrfs: remove free space items when disabling space cache v1 When the filesystem transitions from space cache v1 to v2 or to nospace_cache, it removes the old cached data, but does not remove the FREE_SPACE items nor the free space inodes they point to. This doesn't cause any issues besides being a bit inefficient, since these items no longer do anything useful. To fix it, when we are mounting, and plan to disable the space cache, destroy each block group's free space item and free space inode. The code to remove the items is lifted from the existing use case of removing the block group, with a light adaptation to handle whether or not we have already looked up the free space inode. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
997e3e2e |
|
18-Nov-2020 |
Boris Burkov <boris@bur.io> |
btrfs: only mark bg->needs_free_space if free space tree is on If we attempt to create a free space tree while any block groups have needs_free_space set, we will double add the new free space item and hit EEXIST. Previously, we only created the free space tree on a new mount, so we never hit the case, but if we try to create it on a remount, such block groups could exist and trip us up. We don't do anything with this field unless the free space tree is enabled, so there is no harm in not setting it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
12659251 |
|
10-Nov-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: implement log-structured superblock for ZONED mode Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two adjacent zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second one. Then, when both zones are filled up and before starting to write to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a56fcd1 |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_update_inode take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bbb86a37 |
|
23-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: protect fs_info->caching_block_groups by block_group_cache_lock I got the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.9.0+ #101 Not tainted ------------------------------------------------------ btrfs-cleaner/3445 is trying to acquire lock: ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170 but task is already holding lock: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->commit_root_sem){++++}-{3:3}: down_write+0x3d/0x70 btrfs_cache_block_group+0x2d5/0x510 find_free_extent+0xb6e/0x12f0 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb1/0x330 alloc_tree_block_no_bg_flush+0x4f/0x60 __btrfs_cow_block+0x11d/0x580 btrfs_cow_block+0x10c/0x220 commit_cowonly_roots+0x47/0x2e0 btrfs_commit_transaction+0x595/0xbd0 sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1df/0x200 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&space_info->groups_sem){++++}-{3:3}: down_read+0x40/0x130 find_free_extent+0x2ed/0x12f0 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb1/0x330 alloc_tree_block_no_bg_flush+0x4f/0x60 __btrfs_cow_block+0x11d/0x580 btrfs_cow_block+0x10c/0x220 commit_cowonly_roots+0x47/0x2e0 btrfs_commit_transaction+0x595/0xbd0 sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1df/0x200 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (btrfs-root-00){++++}-{3:3}: __lock_acquire+0x1167/0x2150 lock_acquire+0xb9/0x3d0 down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x32/0x170 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x614/0x9d0 btrfs_find_root+0x35/0x1b0 btrfs_read_tree_root+0x61/0x120 btrfs_get_root_ref+0x14b/0x600 find_parent_nodes+0x3e6/0x1b30 btrfs_find_all_roots_safe+0xb4/0x130 btrfs_find_all_roots+0x60/0x80 btrfs_qgroup_trace_extent_post+0x27/0x40 btrfs_add_delayed_data_ref+0x3fd/0x460 btrfs_free_extent+0x42/0x100 __btrfs_mod_ref+0x1d7/0x2f0 walk_up_proc+0x11c/0x400 walk_up_tree+0xf0/0x180 btrfs_drop_snapshot+0x1c7/0x780 btrfs_clean_one_deleted_snapshot+0xfb/0x110 cleaner_kthread+0xd4/0x140 kthread+0x13a/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->commit_root_sem); lock(&space_info->groups_sem); lock(&fs_info->commit_root_sem); lock(btrfs-root-00); *** DEADLOCK *** 3 locks held by btrfs-cleaner/3445: #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140 #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0 #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80 stack backtrace: CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 __lock_acquire+0x1167/0x2150 ? __bfs+0x42/0x210 lock_acquire+0xb9/0x3d0 ? __btrfs_tree_read_lock+0x32/0x170 down_read_nested+0x43/0x130 ? __btrfs_tree_read_lock+0x32/0x170 __btrfs_tree_read_lock+0x32/0x170 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x614/0x9d0 ? find_held_lock+0x2b/0x80 btrfs_find_root+0x35/0x1b0 ? do_raw_spin_unlock+0x4b/0xa0 btrfs_read_tree_root+0x61/0x120 btrfs_get_root_ref+0x14b/0x600 find_parent_nodes+0x3e6/0x1b30 btrfs_find_all_roots_safe+0xb4/0x130 btrfs_find_all_roots+0x60/0x80 btrfs_qgroup_trace_extent_post+0x27/0x40 btrfs_add_delayed_data_ref+0x3fd/0x460 btrfs_free_extent+0x42/0x100 __btrfs_mod_ref+0x1d7/0x2f0 walk_up_proc+0x11c/0x400 walk_up_tree+0xf0/0x180 btrfs_drop_snapshot+0x1c7/0x780 ? btrfs_clean_one_deleted_snapshot+0x73/0x110 btrfs_clean_one_deleted_snapshot+0xfb/0x110 cleaner_kthread+0xd4/0x140 ? btrfs_alloc_root+0x50/0x50 kthread+0x13a/0x150 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 while testing another lockdep fix. This happens because we're using the commit_root_sem to protect fs_info->caching_block_groups, which creates a dependency on the groups_sem -> commit_root_sem, which is problematic because we will allocate blocks while holding tree roots. Fix this by making the list itself protected by the fs_info->block_group_cache_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e747853c |
|
23-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: load free space cache asynchronously While documenting the usage of the commit_root_sem, I noticed that we do not actually take the commit_root_sem in the case of the free space cache. This is problematic because we're supposed to hold that sem while we're reading the commit roots, which is what we do for the free space cache. The reason I did it inline when I originally wrote the code was because there's the case of unpinning where we need to make sure that the free space cache is loaded if we're going to use the free space cache. But we can accomplish the same thing by simply waiting for the cache to be loaded. Rework this code to load the free space cache asynchronously. This allows us to greatly cleanup the caching code because now it's all shared by the various caching methods. We also are now in a position to have the commit_root semaphore held while we're loading the free space cache. And finally our modification of ->last_byte_to_unpin is removed because it can be handled in the proper way on commit. Some care must be taken when replaying the log, when we expect that the free space cache will be read entirely before we start excluding space to replay. This could lead to overwriting space during replay. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cd79909b |
|
23-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: load free space cache into a temporary ctl The free space cache has been special in that we would load it right away instead of farming the work off to a worker thread. This resulted in some weirdness that had to be taken into account for this fact, namely that if we every found a block group being cached the fast way we had to wait for it to finish, because we could get the cache before it had been validated and we may throw the cache away. To handle this particular case instead create a temporary btrfs_free_space_ctl to load the free space cache into. Then once we've validated that it makes sense, copy it's contents into the actual block_group->free_space_ctl. This allows us to avoid the problems of needing to wait for the caching to complete, we can clean up the discard extent handling stuff in __load_free_space_cache, and we no longer need to do the merge_space_tree() because the space is added one by one into the real free_space_ctl. This will allow further reworks of how we handle loading the free space cache. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42437a63 |
|
16-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce mount option rescue=ignorebadroots In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7837fa88 |
|
14-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: drop the path before adding block group sysfs files Dave reported a problem with my rwsem conversion patch where we got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.9.0-default+ #1297 Not tainted ------------------------------------------------------ kswapd0/76 is trying to acquire lock: ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] but task is already holding lock: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (fs_reclaim){+.+.}-{0:0}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 fs_reclaim_acquire.part.0+0x25/0x30 kmem_cache_alloc+0x30/0x9c0 alloc_inode+0x81/0x90 iget_locked+0xcd/0x1a0 kernfs_get_inode+0x1b/0x130 kernfs_get_tree+0x136/0x210 sysfs_get_tree+0x1a/0x50 vfs_get_tree+0x1d/0xb0 path_mount+0x70f/0xa80 do_mount+0x75/0x90 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (kernfs_mutex){+.+.}-{3:3}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 __mutex_lock+0xa0/0xaf0 kernfs_add_one+0x23/0x150 kernfs_create_dir_ns+0x58/0x80 sysfs_create_dir_ns+0x70/0xd0 kobject_add_internal+0xbb/0x2d0 kobject_add+0x7a/0xd0 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs] btrfs_read_block_groups+0x1f1/0x8c0 [btrfs] open_ctree+0x981/0x1108 [btrfs] btrfs_mount_root.cold+0xe/0xb0 [btrfs] legacy_get_tree+0x2d/0x60 vfs_get_tree+0x1d/0xb0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] legacy_get_tree+0x2d/0x60 vfs_get_tree+0x1d/0xb0 path_mount+0x70f/0xa80 do_mount+0x75/0x90 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (btrfs-extent-00){++++}-{3:3}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 down_read_nested+0x45/0x220 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] __btrfs_read_lock_root_node+0x3a/0x50 [btrfs] btrfs_search_slot+0x6d4/0xfd0 [btrfs] check_committed_ref+0x69/0x200 [btrfs] btrfs_cross_ref_exist+0x65/0xb0 [btrfs] run_delalloc_nocow+0x446/0x9b0 [btrfs] btrfs_run_delalloc_range+0x61/0x6a0 [btrfs] writepage_delalloc+0xae/0x160 [btrfs] __extent_writepage+0x262/0x420 [btrfs] extent_write_cache_pages+0x2b6/0x510 [btrfs] extent_writepages+0x43/0x90 [btrfs] do_writepages+0x40/0xe0 __writeback_single_inode+0x62/0x610 writeback_sb_inodes+0x20f/0x500 wb_writeback+0xef/0x4a0 wb_do_writeback+0x49/0x2e0 wb_workfn+0x81/0x340 process_one_work+0x233/0x5d0 worker_thread+0x50/0x3b0 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 -> #1 (btrfs-fs-00){++++}-{3:3}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 down_read_nested+0x45/0x220 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] __btrfs_read_lock_root_node+0x3a/0x50 [btrfs] btrfs_search_slot+0x6d4/0xfd0 [btrfs] btrfs_lookup_inode+0x3a/0xc0 [btrfs] __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs] __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs] __btrfs_run_delayed_items+0x8e/0x140 [btrfs] btrfs_commit_transaction+0x367/0xbc0 [btrfs] btrfs_mksubvol+0x2db/0x470 [btrfs] btrfs_mksnapshot+0x7b/0xb0 [btrfs] __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs] btrfs_ioctl+0xd0b/0x2690 [btrfs] __x64_sys_ioctl+0x6f/0xa0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add+0x91/0xc60 validate_chain+0xa6e/0x2a20 __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 __mutex_lock+0xa0/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x3cc/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x16d/0x3b0 shrink_slab+0xb1/0x2e0 shrink_node+0x230/0x6a0 balance_pgdat+0x325/0x750 kswapd+0x206/0x4d0 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> kernfs_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(kernfs_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/76: #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0 #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0x77/0x97 check_noncircular+0xff/0x110 ? save_trace+0x50/0x470 check_prev_add+0x91/0xc60 validate_chain+0xa6e/0x2a20 ? save_trace+0x50/0x470 __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] __mutex_lock+0xa0/0xaf0 ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] ? __lock_acquire+0x582/0xac0 ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] ? btrfs_evict_inode+0x30b/0x560 [btrfs] ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x3cc/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x16d/0x3b0 shrink_slab+0xb1/0x2e0 shrink_node+0x230/0x6a0 balance_pgdat+0x325/0x750 kswapd+0x206/0x4d0 ? finish_wait+0x90/0x90 ? balance_pgdat+0x750/0x750 kthread+0x137/0x150 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x1f/0x30 This happens because we are still holding the path open when we start adding the sysfs files for the block groups, which creates a dependency on fs_reclaim via the tree lock. Fix this by dropping the path before we start doing anything with sysfs. Reported-by: David Sterba <dsterba@suse.com> CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49ea112d |
|
01-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not create raid sysfs entries under any locks While running xfstests btrfs/177 I got the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc3+ #5 Not tainted ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc+0x37/0x270 alloc_inode+0x82/0xb0 iget_locked+0x10d/0x2c0 kernfs_get_inode+0x1b/0x130 kernfs_get_tree+0x136/0x240 sysfs_get_tree+0x16/0x40 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (kernfs_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 kernfs_add_one+0x23/0x150 kernfs_create_dir_ns+0x7a/0xb0 sysfs_create_dir_ns+0x60/0xb0 kobject_add_internal+0xc0/0x2c0 kobject_add+0x6e/0x90 btrfs_sysfs_add_block_group_type+0x102/0x160 btrfs_make_block_group+0x167/0x230 btrfs_alloc_chunk+0x54f/0xb80 btrfs_chunk_alloc+0x18e/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_insert_empty_items+0x64/0xb0 btrfs_new_inode+0x225/0x730 btrfs_create+0xab/0x1f0 lookup_open.isra.0+0x52d/0x690 path_openat+0x2a7/0x9e0 do_filp_open+0x75/0x100 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_lookup_inode+0x2a/0x8f __btrfs_update_delayed_inode+0x80/0x240 btrfs_commit_inode_delayed_inode+0x119/0x120 btrfs_evict_inode+0x357/0x500 evict+0xcf/0x1f0 do_unlinkat+0x1a9/0x2b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> kernfs_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(kernfs_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb8 check_noncircular+0x12d/0x150 __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa7/0x3d0 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x41/0x50 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because when we link in a block group with a new raid index type we'll create the corresponding sysfs entries for it. This is problematic because while restriping we're holding the chunk_mutex, and while mounting we're holding the tree locks. Fixing this isn't pretty, we move the call to the sysfs stuff into the btrfs_create_pending_block_groups() work, where we're not holding any locks. This creates a slight race where other threads could see that there's no sysfs kobj for that raid type, and race to create the sysfs dir. Fix this by wrapping the creation in space_info->lock, so we only get one thread calling kobject_add() for the new directory. We don't worry about the lock on cleanup as it only gets deleted on unmount. On mount it's more straightforward, we loop through the space_infos already, just check every raid index in each space_info and added the sysfs entries for the corresponding block groups. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
72804905 |
|
01-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: kill the RCU protection for fs_info->space_info We have this thing wrapped in an RCU lock, but it's really not needed. We create all the space_info's on mount, and we destroy them on unmount. The list never changes and we're protected from messing with it by the normal mount/umount path, so kill the RCU stuff around it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c448ce8 |
|
17-Aug-2020 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: make read_block_group_item return void Since it's inclusion on 9afc66498a0b ("btrfs: block-group: refactor how we read one block group item") this function always returned 0, so there is no need to check for the returned value. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
99ffb43e |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: call btrfs_try_granting_tickets when reserving space If we have compression on we could free up more space than we reserved, and thus be able to make a space reservation. Add the call for this scenario. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3308234a |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: call btrfs_try_granting_tickets when freeing reserved bytes We were missing a call to btrfs_try_granting_tickets in btrfs_free_reserved_bytes, so add it to handle the case where we're able to satisfy an allocation because we've freed a pending reservation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
260db43c |
|
04-Aug-2020 |
Randy Dunlap <rdunlap@infradead.org> |
btrfs: delete duplicated words + other fixes in comments Delete repeated words in fs/btrfs/. {to, the, a, and old} and change "into 2 part" to "into 2 parts". Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3e39c72 |
|
21-Aug-2020 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: block-group: fix free-space bitmap threshold [BUG] After commit 9afc66498a0b ("btrfs: block-group: refactor how we read one block group item"), cache->length is being assigned after calling btrfs_create_block_group_cache. This causes a problem since set_free_space_tree_thresholds calculates the free-space threshold to decide if the free-space tree should convert from extents to bitmaps. The current code calls set_free_space_tree_thresholds with cache->length being 0, which then makes cache->bitmap_high_thresh zero. This implies the system will always use bitmap instead of extents, which is not desired if the block group is not fragmented. This behavior can be seen by a test that expects to repair systems with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only created FREE_SPACE_BITMAP. [FIX] Call set_free_space_tree_thresholds after setting cache->length. There is now a WARN_ON in set_free_space_tree_thresholds to help preventing the same mistake to happen again in the future. Link: https://github.com/kdave/btrfs-progs/issues/251 Fixes: 9afc66498a0b ("btrfs: block-group: refactor how we read one block group item") CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
162e0a16 |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: if we're restriping, use the target restripe profile Previously we depended on some weird behavior in our chunk allocator to force the allocation of new stripes, so by the time we got to doing the reduce we would usually already have a chunk with the proper target. However that behavior causes other problems and needs to be removed. First however we need to remove this check to only restripe if we already have those available profiles, because if we're allocating our first chunk it obviously will not be available. Simply use the target as specified, and if that fails it'll be because we're out of space. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
349e120e |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: don't adjust bg flags and use default allocation profiles btrfs/061 has been failing consistently for me recently with a transaction abort. We run out of space in the system chunk array, which means we've allocated way too many system chunks than we need. Chris added this a long time ago for balance as a poor mans restriping. If you had a single disk and then added another disk and then did a balance, update_block_group_flags would then figure out which RAID level you needed. Fast forward to today and we have restriping behavior, so we can explicitly tell the fs that we're trying to change the raid level. This is accomplished through the normal get_alloc_profile path. Furthermore this code actually causes btrfs/061 to fail, because we do things like mkfs -m dup -d single with multiple devices. This trips this check alloc_flags = update_block_group_flags(fs_info, cache->flags); if (alloc_flags != cache->flags) { ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE); in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but not actually restriping, we keep forcing chunk allocation of RAID1 chunks. This eventually causes us to run out of system space and the file system aborts and flips read only. We don't need this poor mans restriping any more, simply use the normal get_alloc_profile helper, which will get the correct alloc_flags and thus make the right decision for chunk allocation. This keeps us from allocating a billion system chunks and falling over. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
48aaeebe |
|
06-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: convert block group refcount to refcount_t We have refcount_t now with the associated library to handle refcounts, which gives us extra debugging around reference count mistakes that may be made. For example it'll warn on any transition from 0->1 or 0->-1, which is handy for noticing cases where we've messed up reference counting. Convert the block group ref counting from an atomic_t to refcount_t and use the appropriate helpers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
36ea6f3e |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_check_data_free_space take btrfs_inode Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f22f457a |
|
01-Jun-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove no longer necessary chunk mutex locking cases Initially when the 'removed' flag was added to a block group to avoid races between block group removal and fitrim, by commit 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation"), we had to lock the chunks mutex because we could be moving the block group from its current list, the pending chunks list, into the pinned chunks list, or we could just be adding it to the pinned chunks if it was not in the pending chunks list. Both lists were protected by the chunk mutex. However we no longer have those lists since commit 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree"), and locking the chunk mutex is no longer necessary because of that. The same happens at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block group's extent map could be part of the pinned chunks list and the call to remove_extent_mapping() could be deleting it from that list, which used to be protected by that mutex. So just remove those lock and unlock calls as they are not needed anymore. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3ba67a1 |
|
02-Jun-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out reading of bg from find_frist_block_group When find_first_block_group() finds a block group item in the extent-tree, it does a lookup of the object in the extent mapping tree and does further checks on the item. Factor out this step from find_first_block_group() so we can further simplify the code. While we're at it, we can also just return early in find_first_block_group(), if the tree slot isn't found. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
89d7da9b |
|
02-Jun-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: get mapping tree directly from fsinfo in find_first_block_group We already have an fs_info in our function parameters, there's no need to do the maths again and get fs_info from the extent_root just to get the mapping_tree. Instead directly grab the mapping_tree from fs_info. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
96f9b0f2 |
|
03-Apr-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: simplify checks when adding excluded ranges Adresses held in 'logical' array are always guaranteed to fall within the boundaries of the block group. That is, 'start' can never be smaller than cache->start. This invariant follows from the way the address are calculated in btrfs_rmap_block: stripe_nr = physical - map->stripes[i].physical; stripe_nr = div64_u64(stripe_nr, map->stripe_len); bytenr = chunk_start + stripe_nr * io_stripe_size; I.e it's always some IO stripe within the given chunk. Exploit this invariant to simplify the body of the loop by removing the unnecessary 'if' since its 'else' part is the one always executed. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9e22b925 |
|
03-Apr-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: read stripe len directly in btrfs_rmap_block extent_map::orig_block_len contains the size of a physical stripe when it's used to describe block groups (calculated in read_one_chunk via calc_stripe_length or calculated in decide_stripe_size and then assigned to extent_map::orig_block_len in create_chunk). Exploit this fact to get the size directly rather than opencoding the calculations. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ffcb9d44 |
|
01-Jun-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between block group removal and block group creation There is a race between block group removal and block group creation when the removal is completed by a task running fitrim or scrub. When this happens we end up failing the block group creation with an error -EEXIST since we attempt to insert a duplicate block group item key in the extent tree. That results in a transaction abort. The race happens like this: 1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes block group X with btrfs_freeze_block_group() (until very recently that was named btrfs_get_block_group_trimming()); 2) Task B starts removing block group X, either because it's now unused or due to relocation for example. So at btrfs_remove_block_group(), while holding the chunk mutex and the block group's lock, it sets the 'removed' flag of the block group and it sets the local variable 'remove_em' to false, because the block group is currently frozen (its 'frozen' counter is > 0, until very recently this counter was named 'trimming'); 3) Task B unlocks the block group and the chunk mutex; 4) Task A is done trimming the block group and unfreezes the block group by calling btrfs_unfreeze_block_group() (until very recently this was named btrfs_put_block_group_trimming()). In this function we lock the block group and set the local variable 'cleanup' to true because we were able to decrement the block group's 'frozen' counter down to 0 and the flag 'removed' is set in the block group. Since 'cleanup' is set to true, it locks the chunk mutex and removes the extent mapping representing the block group from the mapping tree; 5) Task C allocates a new block group Y and it picks up the logical address that block group X had as the logical address for Y, because X was the block group with the highest logical address and now the second block group with the highest logical address, the last in the fs mapping tree, ends at an offset corresponding to block group X's logical address (this logical address selection is done at volumes.c:find_next_chunk()). At this point the new block group Y does not have yet its item added to the extent tree (nor the corresponding device extent items and chunk item in the device and chunk trees). The new group Y is added to the list of pending block groups in the transaction handle; 6) Before task B proceeds to removing the block group item for block group X from the extent tree, which has a key matching: (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length) task C while ending its transaction handle calls btrfs_create_pending_block_groups(), which finds block group Y and tries to insert the block group item for Y into the exten tree, which fails with -EEXIST since logical offset is the same that X had and task B hasn't yet deleted the key from the extent tree. This failure results in a transaction abort, producing a stack like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Code: ff ff ff 48 8b 55 50 f0 48 (...) RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001 RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001 R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10 R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000 FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs] btrfs_commit_transaction+0xd0/0xc50 [btrfs] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 iterate_supers+0xdb/0x180 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f2ce1d4d5b7 Code: 83 c4 08 48 3d 01 (...) RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7 RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00 R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032 R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bd7c03622e0b0a9c ]--- Fix this simply by making btrfs_remove_block_group() remove the block group's item from the extent tree before it flags the block group as removed. Also make the free space deletion from the free space tree before flagging the block group as removed, to avoid a similar race with adding and removing free space entries for the free space tree. Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9fecd132 |
|
01-Jun-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix a block group ref counter leak after failure to remove block group When removing a block group, if we fail to delete the block group's item from the extent tree, we jump to the 'out' label and end up decrementing the block group's reference count once only (by 1), resulting in a counter leak because the block group at that point was already removed from the block group cache rbtree - so we have to decrement the reference count twice, once for the rbtree and once for our lookup at the start of the function. There is a second bug where if removing the free space tree entries (the call to remove_block_group_free_space()) fails we end up jumping to the 'out_put_group' label but end up decrementing the reference count only once, when we should have done it twice, since we have already removed the block group from the block group cache rbtree. This happens because the reference count decrement for the rbtree reference happens after attempting to remove the free space tree entries, which is far away from the place where we remove the block group from the rbtree. To make things less error prone, decrement the reference count for the rbtree immediately after removing the block group from it. This also eleminates the need for two different exit labels on error, renaming 'out_put_label' to just 'out' and removing the old 'out'. Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f2998ebd |
|
11-May-2020 |
Tiezhu Yang <yangtiezhu@loongson.cn> |
btrfs: remove duplicated include in block-group.c disk-io.h is included more than once in block-group.c, remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3be4d8ef |
|
04-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: rename write_one_cache_group() The name of this function contains the word "cache", which is left from the times where btrfs_block_group was called btrfs_block_group_cache. Now this "cache" doesn't match anything, and we have better namings for functions like read/insert/remove_block_group_item(). Rename it to update_block_group_item(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97f4728a |
|
04-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: refactor how we insert a block group item Currently the block group item insert is pretty straight forward, fill the block group item structure and insert it into extent tree. However the incoming skinny block group feature is going to change this, so this patch will refactor insertion into a new function, insert_block_group_item(), to make the incoming feature easier to add. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7357623a |
|
04-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: refactor how we delete one block group item When deleting a block group item, it's pretty straight forward, just delete the item pointed by the key. However it will not be that straight-forward for incoming skinny block group item. So refactor the block group item deletion into a new function, remove_block_group_item(), also to make the already lengthy btrfs_remove_block_group() a little shorter. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9afc6649 |
|
04-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: refactor how we read one block group item Structure btrfs_block_group has the following members which are currently read from on-disk block group item and key: - length - from item key - used - flags - from block group item However for incoming skinny block group tree, we are going to read those members from different sources. This patch will refactor such read by: - Don't initialize btrfs_block_group::length at allocation Caller should initialize them manually. Also to avoid possible (well, only two callers) missing initialization, add extra ASSERT() in btrfs_add_block_group_cache(). - Refactor length/used/flags initialization into one function The new function, fill_one_block_group() will handle the initialization of such members. - Use btrfs_block_group::length to replace key::offset Since skinny block group item would have a different meaning for its key offset. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
83fe9e12 |
|
04-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: don't set the wrong READA flag for btrfs_read_block_groups() Regular block group items in extent tree are scattered inside the huge tree, thus forward readahead makes no sense. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
684b752b |
|
08-May-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move the block group freeze/unfreeze helpers into block-group.c The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group() used to be named btrfs_get_block_group_trimming() and btrfs_put_block_group_trimming() respectively. At the time they were added to free-space-cache.c, by commit e33e17ee1098 ("btrfs: add missing discards when unpinning extents with -o discard") because all the trimming related functions were in free-space-cache.c. Now that the helpers were renamed and are used in scrub context as well, move them to block-group.c, a much more logical location for them. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b7304af |
|
08-May-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename member 'trimming' of block group to a more generic name Back in 2014, commit 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation"), I added the 'trimming' member to the block group structure. Its purpose was to prevent races between trimming and block group deletion/allocation by pinning the block group in a way that prevents its logical address and device extents from being reused while trimming is in progress for a block group, so that if another task deletes the block group and then another task allocates a new block group that gets the same logical address and device extents while the trimming task is still in progress. After the previous fix for scrub (patch "btrfs: fix a race between scrub and block group removal/allocation"), scrub now also has the same needs that trimming has, so the member name 'trimming' no longer makes sense. Since there is already a 'pinned' member in the block group that refers to space reservations (pinned bytes), rename the member to 'frozen', add a comment on top of it to describe its general purpose and rename the helpers to increment and decrement the counter as well, to match the new member name. The next patch in the series will move the helpers into a more suitable file (from free-space-cache.c to block-group.c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
534cf531 |
|
17-Apr-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: simplify error handling of clean_pinned_extents() At clean_pinned_extents(), whether we end up returning success or failure, we pretty much have to do the same things: 1) unlock unused_bg_unpin_mutex 2) decrement reference count on the previous transaction We also call btrfs_dec_block_group_ro() in case of failure, but that is better done in its caller, btrfs_delete_unused_bgs(), since its the caller that calls inc_block_group_ro(), so it should be responsible for the decrement operation, as it is in case any of the other functions it calls fail. So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents() into btrfs_delete_unused_bgs() and unify the error and success return paths for clean_pinned_extents(), reducing duplicated code and making it simpler. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f9fe614 |
|
13-Mar-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: improve global reserve stealing logic For unlink transactions and block group removal btrfs_start_transaction_fallback_global_rsv will first try to start an ordinary transaction and if it fails it will fall back to reserving the required amount by stealing from the global reserve. This is problematic because of all the same reasons we had with previous iterations of the ENOSPC handling, thundering herd. We get a bunch of failures all at once, everybody tries to allocate from the global reserve, some win and some lose, we get an ENSOPC. Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's used to mark unlink reservation. To fix this we need to integrate this logic into the normal ENOSPC infrastructure. We still go through all of the normal flushing work, and at the moment we begin to fail all the tickets we try to satisfy any tickets that are allowed to steal by stealing from the global reserve. If this works we start the flushing system over again just like we would with a normal ticket satisfaction. This serializes our global reserve stealing, so we don't have the thundering herd problem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f6033c5e |
|
20-Apr-2020 |
Xiyu Yang <xiyuyang19@fudan.edu.cn> |
btrfs: fix block group leak when removing fails btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which returns a local reference of the block group that contains the given bytenr to "block_group" with increased refcount. When btrfs_remove_block_group() returns, "block_group" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in several exception handling paths of btrfs_remove_block_group(). When those error scenarios occur such as btrfs_alloc_path() returns NULL, the function forgets to decrease its refcnt increased by btrfs_lookup_block_group() and will cause a refcnt leak. Fix this issue by jumping to "out_put_group" label and calling btrfs_put_block_group() when those error scenarios occur. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5150bf19 |
|
17-Apr-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix memory leak of transaction when deleting unused block group When cleaning pinned extents right before deleting an unused block group, we check if there's still a previous transaction running and if so we increment its reference count before using it for cleaning pinned ranges in its pinned extents iotree. However we ended up never decrementing the reference count after using the transaction, resulting in a memory leak. Fix it by decrementing the reference count. Fixes: fe119a6eeb6705 ("btrfs: switch to per-transaction pinned extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d611add4 |
|
07-Apr-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix reclaim counter leak of space_info objects Whenever we add a ticket to a space_info object we increment the object's reclaim_size counter witht the ticket's bytes, and we decrement it with the corresponding amount only when we are able to grant the requested space to the ticket. When we are not able to grant the space to a ticket, or when the ticket is removed due to a signal (e.g. an application has received sigterm from the terminal) we never decrement the counter with the corresponding bytes from the ticket. This leak can result in the space reclaim code to later do much more work than necessary. So fix it by decrementing the counter when those two cases happen as well. Fixes: db161806dc5615 ("btrfs: account ticket size at add/delete time") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
29566c9c |
|
05-Mar-2020 |
Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> |
btrfs: add RCU locks around block group initialization The space_info list is normally RCU protected and should be traversed with rcu_read_lock held. There's a warning [29.104756] WARNING: suspicious RCU usage [29.105046] 5.6.0-rc4-next-20200305 #1 Not tainted [29.105231] ----------------------------- [29.105401] fs/btrfs/block-group.c:2011 RCU-list traversed in non-reader section!! pointing out that the locking is missing in btrfs_read_block_groups. However this is not necessary as the list traversal happens at mount time when there's no other thread potentially accessing the list. To fix the warning and for consistency let's add the RCU lock/unlock, the code won't be affected much as it's doing some lightweight operations. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fe119a6e |
|
20-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: switch to per-transaction pinned extents This commit flips the switch to start tracking/processing pinned extents on a per-transaction basis. It mostly replaces all references from btrfs_fs_info::(pinned_extents|freed_extents[]) to btrfs_transaction::pinned_extents. Two notable modifications that warrant explicit mention are changing clean_pinned_extents to get a reference to the previously running transaction. The other one is removal of call to btrfs_destroy_pinned_extent since transactions are going to be cleaned in btrfs_cleanup_one_transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
45bb5d6a |
|
20-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Factor out pinned extent clean up in btrfs_delete_unused_bgs Next patch is going to refactor how pinned extents are tracked which will necessitate changing this code. To ease that work and contain the changes factor the code now in preparation, this will also help review. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf31f87f |
|
05-Feb-2020 |
David Sterba <dsterba@suse.com> |
btrfs: add wrapper for transaction abort predicate The status of aborted transaction can change between calls and it needs to be accessed by READ_ONCE. Add a helper that also wraps the unlikely hint. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8e6fd5c |
|
20-Mar-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix removal of raid[56|1c34} incompat flags after removing block group We are incorrectly dropping the raid56 and raid1c34 incompat flags when there are still raid56 and raid1c34 block groups, not when we do not any of those anymore. The logic just got unintentionally broken after adding the support for the raid1c34 modes. Fix this by clear the flags only if we do not have block groups with the respective profiles. Fixes: 9c907446dce3 ("btrfs: drop incompat bit for raid1c34 after last block group is gone") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a30a3d20 |
|
17-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: take overcommit into account in inc_block_group_ro inc_block_group_ro does a calculation to see if we have enough room left over if we mark this block group as read only in order to see if it's ok to mark the block group as read only. The problem is this calculation _only_ works for data, where our used is always less than our total. For metadata we will overcommit, so this will almost always fail for metadata. Fix this by exporting btrfs_can_overcommit, and then see if we have enough space to remove the remaining free space in the block group we are trying to mark read only. If we do then we can mark this block group as read only. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a7a63acc |
|
17-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix force usage in inc_block_group_ro For some reason we've translated the do_chunk_alloc that goes into btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are two different things. force for inc_block_group_ro is used when we are forcing the block group read only no matter what, for example when the underlying chunk is marked read only. We need to not do the space check here as this block group needs to be read only. btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that we need to pre-allocate a chunk before marking the block group read only. This has nothing to do with forcing, and in fact we _always_ want to do the space check in this case, so unconditionally pass false for force in this case. Then fixup inc_block_group_ro to honor force as it's expected and documented to do. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1776ad17 |
|
19-Nov-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Refactor btrfs_rmap_block to improve readability Move variables to appropriate scope. Remove last BUG_ON in the function and rework error handling accordingly. Make the duplicate detection code more straightforward. Use in_range macro. And give variables more descriptive name by explicitly distinguishing between IO stripe size (size recorded in the chunk item) and data stripe size (the size of an actual stripe, constituting a logical chunk/block group). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
96a14336 |
|
10-Dec-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Move and unexport btrfs_rmap_block It's used only during initial block group reading to map physical address of super block to a list of logical ones. Make it private to block-group.c, add proper kernel doc and ensure it's exported only for tests. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ef0a82da |
|
02-Jan-2020 |
Johannes Thumshirn <jth@kernel.org> |
btrfs: remove unnecessary wrapper get_alloc_profile btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile(). The only difference is btrfs_get_alloc_profile() is visible to other functions in btrfs while get_alloc_profile() is static and thus only visible to functions in block-group.c. Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to get rid of the unnecessary second function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6e80d4f8 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: handle empty block_group removal for async discard block_group removal is a little tricky. It can race with the extent allocator, the cleaner thread, and balancing. The current path is for a block_group to be added to the unused_bgs list. Then, when the cleaner thread comes around, it starts a transaction and then proceeds with removing the block_group. Extents that are pinned are subsequently removed from the pinned trees and then eventually a discard is issued for the entire block_group. Async discard introduces another player into the game, the discard workqueue. While it has none of the racing issues, the new problem is ensuring we don't leave free space untrimmed prior to forgetting the block_group. This is handled by placing fully free block_groups on a separate discard queue. This is necessary to maintain discarding order as in the future we will slowly trim even fully free block_groups. The ordering helps us make progress on the same block_group rather than say the last fully freed block_group or needing to search through the fully freed block groups at the beginning of a list and insert after. The new order of events is a fully freed block group gets placed on the unused discard queue first. Once it's processed, it will be placed on the unusued_bgs list and then the original sequence of events will happen, just without the final whole block_group discard. The mount flags can change when processing unused_bgs, so when flipping from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt free block groups on the discard_list to the unused_bg queue which will do the final discard for us. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b0643e59 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: add the beginning of async discard, discard workqueue When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the extent. This is an overeager approach when it comes to discarding and helping the SSD maintain enough free space to prevent severe garbage collection situations. This adds the beginning of async discard. Instead of issuing a discard prior to returning it to the free space, it is just marked as untrimmed. The block_group is then added to a LRU which then feeds into a workqueue to issue discards at a much slower rate. Full discarding of unused block groups is still done and will be addressed in a future patch of the series. For now, we don't persist the discard state of extents and bitmaps. Therefore, our failure recovery mode will be to consider extents untrimmed. This lets us handle failure and unmounting as one in the same. On a number of Facebook webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache. discard=sync: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) --------------------------------------------------------------- Drive A | 434 | 1170 Drive B | 880 | 2330 Drive C | 2943 | 3920 Drive D | 4763 | 5701 discard=async: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) -------------------------------------------------------------- Drive A | 134 | 956 Drive B | 64 | 1972 Drive C | 59 | 1032 Drive D | 62 | 1200 While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
46b27f50 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: rename DISCARD mount option to to DISCARD_SYNC This series introduces async discard which will use the flag DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is synchronously done in transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f8935566 |
|
26-Nov-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: kill min_allocable_bytes in inc_block_group_ro This is a relic from a time before we had a proper reservation mechanism and you could end up with really full chunks at chunk allocation time. This doesn't make sense anymore, so just kill it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b12de528 |
|
14-Nov-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: scrub: Don't check free space before marking a block group RO [BUG] When running btrfs/072 with only one online CPU, it has a pretty high chance to fail: btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg) - output mismatch (see xfstests-dev/results//btrfs/072.out.bad) --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800 +++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800 @@ -1,2 +1,3 @@ QA output created by 072 Silence is golden +Scrub find errors in "-m dup -d single" test ... And with the following call trace: BTRFS info (device dm-5): scrub: started on devid 1 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -27) WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] Call Trace: __btrfs_end_transaction+0xdb/0x310 [btrfs] btrfs_end_transaction+0x10/0x20 [btrfs] btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs] scrub_enumerate_chunks+0x264/0x940 [btrfs] btrfs_scrub_dev+0x45c/0x8f0 [btrfs] btrfs_ioctl+0x31a1/0x3fb0 [btrfs] do_vfs_ioctl+0x636/0xaa0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x79/0xe0 entry_SYSCALL_64_after_hwframe+0x49/0xbe ---[ end trace 166c865cec7688e7 ]--- [CAUSE] The error number -27 is -EFBIG, returned from the following call chain: btrfs_end_transaction() |- __btrfs_end_transaction() |- btrfs_create_pending_block_groups() |- btrfs_finish_chunk_alloc() |- btrfs_add_system_chunk() This happens because we have used up all space of btrfs_super_block::sys_chunk_array. The root cause is, we have the following bad loop of creating tons of system chunks: 1. The only SYSTEM chunk is being scrubbed It's very common to have only one SYSTEM chunk. 2. New SYSTEM bg will be allocated As btrfs_inc_block_group_ro() will check if we have enough space after marking current bg RO. If not, then allocate a new chunk. 3. New SYSTEM bg is still empty, will be reclaimed During the reclaim, we will mark it RO again. 4. That newly allocated empty SYSTEM bg get scrubbed We go back to step 2, as the bg is already mark RO but still not cleaned up yet. If the cleaner kthread doesn't get executed fast enough (e.g. only one CPU), then we will get more and more empty SYSTEM chunks, using up all the space of btrfs_super_block::sys_chunk_array. [FIX] Since scrub/dev-replace doesn't always need to allocate new extent, especially chunk tree extent, so we don't really need to do chunk pre-allocation. To break above spiral, here we introduce a new parameter to btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we need extra chunk pre-allocation. For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass @do_chunk_alloc=false. This should keep unnecessary empty chunks from popping up for scrub. Also, since there are two parameters for btrfs_inc_block_group_ro(), add more comment for it. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32da5386 |
|
29-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_block_group_cache The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d49a2ddb |
|
04-Nov-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: Reuse the item key from caller of read_one_block_group() For read_one_block_group(), its only caller has already got the item key to search next block group item. So we can use that key directly without doing our own convertion on stack. Also, since that key used in btrfs_read_block_groups() is vital for block group item search, add 'const' keyword for that parameter to prevent read_one_block_group() to modify it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ffb9e0f0 |
|
09-Oct-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: Refactor btrfs_read_block_groups() Refactor the work inside the loop of btrfs_read_block_groups() into one separate function, read_one_block_group(). This allows read_one_block_group to be reused for later BG_TREE feature. The refactor does the following extra fix: - Use btrfs_fs_incompat() to replace open-coded feature check Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9c907446 |
|
31-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: drop incompat bit for raid1c34 after last block group is gone When there are no raid1c3 or raid1c4 block groups left after balance (either convert or with other filters applied), remove the incompat bit. This is already done for RAID56, do the same for RAID1C34. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b3470b5d |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add dedicated members for start and length of a block group The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, although this makes thing harder to understand. The key is also packed so the offset/length is not properly aligned as u64. Add start (key.objectid) and length (key.offset) members to block group and remove the embedded key. When the item is searched or written, a local variable for key is used. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de0dc456 |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename block_group_item on-stack accessors to follow naming All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the name, the block group ones were not following that scheme, so let's switch them. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d976388 |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove embedded block_group_cache::item The members ::used and ::flags are now in the block group cache structure, the last one is chunk_objectid, but that's set to a fixed value and otherwise unused. The item is constructed from a local variable before write, so we can remove the embedded one from block group. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f93c63e5 |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move block_group_item::flags to block group The flags are read from the item that's embedded to block group struct, but the item will be removed. Use the ::flags after read and before write. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf38be65 |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move block_group_item::used to block group For unknown reasons, the member 'used' in the block group struct is stored in the b-tree item and accessed everywhere using the special accessor helper. Let's unify it and make it a regular member and only update the item before writing it to the tree. The item is still being used for flags and chunk_objectid, there's some duplication until the item is removed in following patches. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a60adce8 |
|
24-Sep-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_block_group_cache_done in update_block_group When free'ing extents in a block group we check to see if the block group is not cached, and then cache it if we need to. However we'll just carry on as long as we're loading the cache. This is problematic because we are dirtying the block group here. If we are fast enough we could do a transaction commit and clear the free space cache while we're still loading the space cache in another thread. This truncates the free space inode, which will keep it from loading the space cache. Fix this by using the btrfs_block_group_cache_done helper so that we try to load the space cache unconditionally here, which will result in the caller waiting for the fast caching to complete and keep us from truncating the free space inode. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a9143bd3 |
|
07-Oct-2019 |
Marcos Paulo de Souza <marcos.souza.org@gmail.com> |
btrfs: block-group: Rework documentation of check_system_chunk function Commit 4617ea3a52cf (" Btrfs: fix necessary chunk tree space calculation when allocating a chunk") removed the is_allocation argument from check_system_chunk, since the formula for reserving the necessary space for allocation or removing a chunk would be the same. So, rework the comment by removing the mention of is_allocation argument. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0cac0ec |
|
16-Sep-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of unique workqueue helper functions Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4b654acd |
|
09-Oct-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group() In btrfs_read_block_groups(), if we have an invalid block group which has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS feature, we error out without freeing the block group cache. This patch will add the missing btrfs_put_block_group() to prevent memory leak. Note for stable backports: the file to patch in versions <= 5.3 is fs/btrfs/extent-tree.c Fixes: 49303381f19a ("Btrfs: bail out if block group has different mixed flag") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a43c3835 |
|
22-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add space reservation tracepoint for reserved bytes I noticed when folding the trace_btrfs_space_reservation() tracepoint into the btrfs_space_info_update_* helpers that we didn't emit a tracepoint when doing btrfs_add_reserved_bytes(). I know this is because we were swapping bytes_may_use for bytes_reserved, so in my mind there was no reason to have the tracepoint there. But now there is because we always emit the unreserve for the bytes_may_use side, and this would have broken if compression was on anyway. Add a tracepoint to cover the bytes_reserved counter so the math still comes out right. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f3e75e38 |
|
22-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: roll tracepoint into btrfs_space_info_update helper We duplicate this tracepoint everywhere we call these helpers, so update the helper to have the tracepoint as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
784352fe |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move math functions to misc.h Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2bd36e7b |
|
22-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename the btrfs_calc_*_metadata_size helpers btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that it doesn't take into account any splitting at the levels, because truncate will never split nodes. However truncate _and_ changing will never split nodes, so rename btrfs_calc_trunc_metadata_size to btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely for inserting items, so rename this to btrfs_calc_insert_metadata_size. Making these clearer will help when I start using them differently in upcoming patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e11c0406 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: unexport the temporary exported functions These were renamed and exported to facilitate logical migration of different code chunks into block-group.c. Now that all the users are in one file go ahead and rename them back, move the code around, and make them static. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3e43c279 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group cleanup code This can now be easily migrated as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh on top of sysfs cleanups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
878d7b67 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the alloc_profile helpers These feel more at home in block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, adjust btrfs_get_alloc_profile exports ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
07730d87 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the chunk allocation code This feels more at home in block-group.c than in extent-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>i [ refresh ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
606d1bf1 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group space accounting helpers We can now easily migrate this code as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
77745c05 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the dirty bg writeout code This can be easily migrated over now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
26ce2095 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate inc/dec_block_group_ro code This can easily be moved now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4358d963 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group read/creation code All of the prep work has been done so we can now cleanly move this chunk over. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, add btrfs_get_alloc_profile export, comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3e0520b |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group removal code This is the removal code and the unused bgs code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, move clear_incompat_bg_bits ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f21246d |
|
06-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group caching code We can now just copy it over to block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3eeb3226 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate nocow and reservation helpers These are relatively straightforward as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cad1284 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group ref counting stuff Another easy set to move over to block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2e405ad8 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group lookup code Move these bits first as they are the easiest to move. Export two of the helpers so they can be moved all at once. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|