#
ef5a05c5 |
|
24-Feb-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
btrfs: remove SLAB_MEM_SPREAD flag use The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
66ce5447 |
|
20-Feb-2024 |
Kunwu Chan <chentao@kylinos.cn> |
btrfs: use KMEM_CACHE() to create btrfs_path cache Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify the creation of SLAB caches when the default values are used. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5378ea6e |
|
24-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: unify handling of return values of btrfs_insert_empty_items() The error values returned by btrfs_insert_empty_items() are following the common patter of 0/-errno, but some callers check for a value > 0, which can't happen. Document that and update calls to not expect positive values. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
84cda1a6 |
|
04-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: cache folio size and shift in extent_buffer After the conversion to folio interfaces (but without the patch to enable larger folio allocation), there is an LTP report about observable performance drop on metadata heavy operations. https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/ This drop is caused by the extra code of calculating the folio_size()/folio_shift(), instead of the old hard coded PAGE_SIZE/PAGE_SHIFT. To slightly reduce the overhead, just cache both folio_size and folio_shift in extent_buffer. The two new members (u32 folio_size and u8 folio_shift) are stored inside the holes of extent_buffer. folio_size is shared with len, which is reduced to u32. The size of eb does not change. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8d993618 |
|
11-Dec-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios These two functions are still using the old page based code, which is not going to handle larger folios at all. The migration itself is going to involve the following changes: - PAGE_SIZE -> folio_size() - PAGE_SHIFT -> folio_shift() - get_eb_page_index() -> get_eb_folio_index() - get_eb_offset_in_page() -> get_eb_offset_in_folio() And since we're going to support larger folios, although above straight conversion is good enough, this patch would add extra comments in the involved functions to explain why the same single line code can now cover 3 cases: - folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE The common, non-subpage case with per-page folio. - folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE The incoming larger folio, non-subpage case. - folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE The existing subpage case, we won't larger folio anyway. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
082d5bb9 |
|
06-Dec-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: migrate extent_buffer::pages[] to folio For now extent_buffer::pages[] are still only accepting single page pointer, thus we can migrate to folios pretty easily. As for single page, page and folio are 1:1 mapped, including their page flags. This patch would just do the conversion from struct page to struct folio, providing the first step to higher order folio in the future. This conversion is pretty simple: - extent_buffer::pages[] -> extent_buffer::folios[] - page_address(eb->pages[i]) -> folio_address(eb->pages[i]) - eb->pages[i] -> folio_page(eb->folios[i], 0) There would be more specific cleanups preparing for the incoming higher order folio support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
80d197fe |
|
19-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make the logic from btrfs_block_can_be_shared() easier to read The logic in btrfs_block_can_be_shared() is hard to follow as we have a lot of conditions in a single if statement including a subexpression with a logical or and two nested if statements inside the main if statement. Make this easier to read by using separate if statements that return immediately when we find a condition that determines if a block can be or can not be shared. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6e5de50f |
|
19-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use bool for return type of btrfs_block_can_be_shared() Currently btrfs_block_can_be_shared() returns an int that is used as a boolean. Since it all it needs is to return true or false, and it can't return errors for example, change the return type from int to bool to make it a bit more readable and obvious. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8ba2a91 |
|
13-Oct-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: get correct owning_root when dropping snapshot Dave reported a bug where we were aborting the transaction while trying to cleanup the squota reservation for an extent. This turned out to be because we're doing btrfs_header_owner(next) in do_walk_down when we decide to free the block. However in this code block we haven't explicitly read next, so it could be stale. We would then get whatever garbage happened to be in the pages at this point. The commit that introduced that is "btrfs: track owning root in btrfs_ref". Fix this by saving the owner_root when we do the btrfs_lookup_extent_info(). We always do this in do_walk_down, it is how we make the decision of whether or not to delete the block. This is cheap because we've already done the extent item lookup at this point, so it's straightforward to just grab the owner root as well. Then we can use this when deleting the metadata block without needing to force a read of the extent buffer to find the owner. This fixes the problem that Dave reported. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6422b4cd |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move btrfs_realloc_node() from ctree.c into defrag.c btrfs_realloc_node() is only used by the defrag code. Nowadays we have a defrag.c file, so move it, and its helper close_blocks(), into defrag.c. During the move also do a few minor cosmetic changes: 1) Change the return value of close_blocks() from int to bool; 2) Use SZ_32K instead of 32768 at close_blocks(); 3) Make some variables const in btrfs_realloc_node(), 'blocksize' and 'end_slot'; 4) Get rid of 'parent_nritems' variable, in both places where it was used it could be replaced by calling btrfs_header_nritems(parent); 5) Change the type of a couple variables from int to bool; 6) Rename variable 'err' to 'ret', as that's the most common name we use to track the return value of a function; 7) Move some variables from the top scope to the scope of the for loop where they are used. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
79d25df0 |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: export comp_keys() from ctree.c as btrfs_comp_keys() Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a later patch we can move out defrag specific code from ctree.c into defrag.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95f93bc4 |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename and export __btrfs_cow_block() Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is to allow to move defrag specific code out of ctree.c and into defrag.c in one of the next patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b8bf4e4d |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use round_down() to align block offset at btrfs_cow_block() At btrfs_cow_block() we can use round_down() to align the extent buffer's logical offset to the start offset of a metadata block group, instead of the less easy to read set of bitwise operations (two plus one subtraction). So replace the bitwise operations with a round_down() call. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7bff16e3 |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove noinline attribute from btrfs_cow_block() It's pointless to have the noiline attribute for btrfs_cow_block(), as the function is exported and widely used. So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
60ea105a |
|
21-Jun-2023 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: track metadata relocation COW with simple quota Relocation COWs metadata blocks in two cases for the reloc root: - copying the subvolume root item when creating the reloc root - copying a btree node when there is a COW during relocation In both cases, the resulting btree node hits an abnormal code path with respect to the owner field in its btrfs_header. It first creates the root item for the new objectid, which populates the reloc root id, and it at this point that delayed refs are created. Later, it fully copies the old node into the new node (including the original owner field) which overwrites it. This results in a simple quotas mismatch where we run the delayed ref for the reloc root which has no simple quota effect (reloc root is not an fstree) but when we ultimately delete the node, the owner is the real original fstree and we do free the space. To work around this without tampering with the behavior of relocation, add a parameter to btrfs_add_tree_block that lets the relocation code path specify a different owning root than the "operating" root (in this case, owning root is the real root and the operating root is the reloc root). These can naturally be plumbed into delayed refs that have the same concept. Note that this is a double count in some sense, but a relatively natural one, as there are really two extents, and the old one will be deleted soon. This is consistent with how data relocation extents are accounted by simple quotas. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50564b65 |
|
12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction on generation mismatch when marking eb as dirty When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed164802 |
|
08-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: rename errno identifiers to error We sync the kernel files to userspace and the 'errno' symbol is defined by standard library, which does not matter in kernel but the parameters or local variables could clash. Rename them all. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
02cd00fa |
|
07-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: reduce arguments of helpers space accounting root item There are two helpers to increase used bytes of root items that add or subtract one node size, we don't need to pass the argument for that. Rename the function so it matches the root item member that gets changed. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9580503b |
|
07-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: reformat remaining kdoc style comments Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eb96e221 |
|
19-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix unwritten extent buffer after snapshotting a new subvolume When creating a snapshot of a subvolume that was created in the current transaction, we can end up not persisting a dirty extent buffer that is referenced by the snapshot, resulting in IO errors due to checksum failures when trying to read the extent buffer later from disk. A sequence of steps that leads to this is the following: 1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical address 36007936, for the leaf/root of a new subvolume that has an ID of 291. We mark the extent buffer as dirty, and at this point the subvolume tree has a single node/leaf which is also its root (level 0); 2) We no longer commit the transaction used to create the subvolume at create_subvol(). We used to, but that was recently removed in commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create"); 3) The transaction used to create the subvolume has an ID of 33, so the extent buffer 36007936 has a generation of 33; 4) Several updates happen to subvolume 291 during transaction 33, several files created and its tree height changes from 0 to 1, so we end up with a new root at level 1 and the extent buffer 36007936 is now a leaf of that new root node, which is extent buffer 36048896. The commit root remains as 36007936, since we are still at transaction 33; 5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and we end up at transaction.c:create_pending_snapshot(), in the critical section of a transaction commit. There we COW the root of subvolume 291, which is extent buffer 36048896. The COW operation returns extent buffer 36048896, since there's no need to COW because the extent buffer was created in this transaction and it was not written yet. The we call btrfs_copy_root() against the root node 36048896. During this operation we allocate a new extent buffer to turn into the root node of the snapshot, copy the contents of the root node 36048896 into this snapshot root extent buffer, set the owner to 292 (the ID of the snapshot), etc, and then we call btrfs_inc_ref(). This will create a delayed reference for each leaf pointed by the root node with a reference root of 292 - this includes a reference for the leaf 36007936. After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state. Then we call btrfs_insert_dir_item(), to create the directory entry in in the tree of subvolume 291 that points to the snapshot. This ends up needing to modify leaf 36007936 to insert the respective directory items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state, we need to COW the leaf. We end up at btrfs_force_cow_block() and then at update_ref_for_cow(). At update_ref_for_cow() we call btrfs_block_can_be_shared() which returns false, despite the fact the leaf 36007936 is shared - the subvolume's root and the snapshot's root point to that leaf. The reason that it incorrectly returns false is because the commit root of the subvolume is extent buffer 36007936 - it was the initial root of the subvolume when we created it. So btrfs_block_can_be_shared() which has the following logic: int btrfs_block_can_be_shared(struct btrfs_root *root, struct extent_buffer *buf) { if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && buf != root->node && buf != root->commit_root && (btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) || btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) return 1; return 0; } Returns false (0) since 'buf' (extent buffer 36007936) matches the root's commit root. As a result, at update_ref_for_cow(), we don't check for the number of references for extent buffer 36007936, we just assume it's not shared and therefore that it has only 1 reference, so we set the local variable 'refs' to 1. Later on, in the final if-else statement at update_ref_for_cow(): static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, struct extent_buffer *cow, int *last_ref) { (...) if (refs > 1) { (...) } else { (...) btrfs_clear_buffer_dirty(trans, buf); *last_ref = 1; } } So we mark the extent buffer 36007936 as not dirty, and as a result we don't write it to disk later in the transaction commit, despite the fact that the snapshot's root points to it. Attempting to access the leaf or dumping the tree for example shows that the extent buffer was not written: $ btrfs inspect-internal dump-tree -t 292 /dev/sdb btrfs-progs v6.2.2 file tree key (292 ROOT_ITEM 33) node 36110336 level 1 items 2 free space 119 generation 33 owner 292 node 36110336 flags 0x1(WRITTEN) backref revision 1 checksum stored a8103e3e checksum calced a8103e3e fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07 key (256 INODE_ITEM 0) block 36007936 gen 33 key (257 EXTENT_DATA 0) block 36052992 gen 33 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 total bytes 107374182400 bytes used 38572032 uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 The respective on disk region is full of zeroes as the device was trimmed at mkfs time. Obviously 'btrfs check' also detects and complains about this: $ btrfs check /dev/sdb Opening filesystem to check... Checking filesystem on /dev/sdb UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 generation: 33 (33) [1/7] checking root items [2/7] checking extents checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 owner ref check failed [36007936 4096] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 The following tree block(s) is corrupted in tree 292: tree block bytenr: 36110336, level: 1, node key: (256, 1, 0) root 292 root dir 256 not found ERROR: errors found in fs roots found 38572032 bytes used, error(s) found total csum bytes: 16048 total tree bytes: 1265664 total fs tree bytes: 1118208 total extent tree bytes: 65536 btree space waste bytes: 562598 file data blocks allocated: 65978368 referenced 36569088 Fix this by updating btrfs_block_can_be_shared() to consider that an extent buffer may be shared if it matches the commit root and if its generation matches the current transaction's generation. This can be reproduced with the following script: $ cat test.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi # Use a filesystem with a 64K node size so that we have the same node # size on every machine regardless of its page size (on x86_64 default # node size is 16K due to the 4K page size, while on PPC it's 64K by # default). This way we can make sure we are able to create a btree for # the subvolume with a height of 2. mkfs.btrfs -f -n 64K $DEV mount $DEV $MNT btrfs subvolume create $MNT/subvol # Create a few empty files on the subvolume, this bumps its btree # height to 2 (root node at level 1 and 2 leaves). for ((i = 1; i <= 300; i++)); do echo -n > $MNT/subvol/file_$i done btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap umount $DEV btrfs check $DEV Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment): $ ./test.sh Create subvolume '/mnt/sdi/subvol' Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap' Opening filesystem to check... Checking filesystem on /dev/sdi UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1 [1/7] checking root items [2/7] checking extents parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure owner ref check failed [30539776 65536] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0) Wrong generation of child node/leaf, wanted: 5, have: 7 root 257 root dir 256 not found ERROR: errors found in fs roots found 917504 bytes used, error(s) found total csum bytes: 0 total tree bytes: 851968 total fs tree bytes: 393216 total extent tree bytes: 65536 btree space waste bytes: 736550 file data blocks allocated: 0 referenced 0 A test case for fstests will follow soon. Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create") CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e36f9491 |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: error out when reallocating block for defrag using a stale transaction At btrfs_realloc_node() we have these checks to verify we are not using a stale transaction (a past transaction with an unblocked state or higher), and the only thing we do is to trigger two WARN_ON(). This however is a critical problem, highly unexpected and if it happens it's most likely due to a bug, so we should error out and turn the fs into error state so that such issue is much more easily noticed if it's triggered. The problem is critical because in btrfs_realloc_node() we COW tree blocks, and using such stale transaction will lead to not persisting the extent buffers used for the COW operations, as allocating tree block adds the range of the respective extent buffers to the ->dirty_pages iotree of the transaction, and a stale transaction, in the unlocked state or higher, will not flush dirty extent buffers anymore, therefore resulting in not persisting the tree block and resource leaks (not cleaning the dirty_pages iotree for example). So do the following changes: 1) Return -EUCLEAN if we find a stale transaction; 2) Turn the fs into error state, with error -EUCLEAN, so that no transaction can be committed, and generate a stack trace; 3) Combine both conditions into a single if statement, as both are related and have the same error message; 4) Mark the check as unlikely, since this is not expected to ever happen. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a2caab29 |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: error when COWing block from a root that is being deleted At btrfs_cow_block() we check if the block being COWed belongs to a root that is being deleted and if so we log an error message. However this is an unexpected case and it indicates a bug somewhere, so we should return an error and abort the transaction. So change this in the following ways: 1) Abort the transaction with -EUCLEAN, so that if the issue ever happens it can easily be noticed; 2) Change the logged message level from error to critical, and change the message itself to print the block's logical address and the ID of the root; 3) Return -EUCLEAN to the caller; 4) As this is an unexpected scenario, that should never happen, mark the check as unlikely, allowing the compiler to potentially generate better code. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
48774f3b |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: error out when COWing block using a stale transaction At btrfs_cow_block() we have these checks to verify we are not using a stale transaction (a past transaction with an unblocked state or higher), and the only thing we do is to trigger a WARN with a message and a stack trace. This however is a critical problem, highly unexpected and if it happens it's most likely due to a bug, so we should error out and turn the fs into error state so that such issue is much more easily noticed if it's triggered. The problem is critical because using such stale transaction will lead to not persisting the extent buffer used for the COW operation, as allocating a tree block adds the range of the respective extent buffer to the ->dirty_pages iotree of the transaction, and a stale transaction, in the unlocked state or higher, will not flush dirty extent buffers anymore, therefore resulting in not persisting the tree block and resource leaks (not cleaning the dirty_pages iotree for example). So do the following changes: 1) Return -EUCLEAN if we find a stale transaction; 2) Turn the fs into error state, with error -EUCLEAN, so that no transaction can be committed, and generate a stack trace; 3) Combine both conditions into a single if statement, as both are related and have the same error message; 4) Mark the check as unlikely, since this is not expected to ever happen. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7569141e |
|
09-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: replace BUG_ON() at split_item() with proper error handling There's no need to BUG_ON() at split_item() if the leaf does not have enough free space for the new item, we can just return -ENOSPC since the caller can deal with errors from split_item(). Also, as this is a very unlikely condition to happen, because the caller has invoked setup_leaf_for_split() before calling split_item(), surround the condition with a WARN_ON() which makes it easier to notice this unexpected condition and tags the if branch with 'unlikely' as well. I've actually once hit this BUG_ON() with some incorrect code changes I had, which was very inconvenient as it required rebooting the VM. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
751a2761 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr() At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in the context of all callers, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. This implies btrfs_del_ptr() return an int instead of void, and making all callers check for returned errors. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50b5d1fc |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failures at insert_ptr() At insert_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in the context of all callers, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. This implies making insert_ptr() return an int instead of void, and making all callers check for returned errors. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f61aa7ba |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failure at insert_new_root() At insert_new_root(), instead of doing a BUG_ON() in case we fail to record the tree mod log operation, just return the error to the callers after releasing the allocated tree block. At this point we haven't made any changes to the b+tree, so we haven't left it in an inconsistent state and therefore have no need to abort the transaction. All we need to do is to unlock and free the extent buffer we just allocated with the purpose of making it the new root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
11d6ae03 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert() At push_nodes_for_insert(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the caller. There's really no need for the BUG_ON() as we can release all resources in this context, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eced687e |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction at update_ref_for_cow() when ref count is zero At update_ref_for_cow() we are calling btrfs_handle_fs_error() if we find that the extent buffer has an unexpected ref count of zero, however we can simply use btrfs_abort_transaction(), which achieves the same purposes: to turn the fs to error state, abort the current transaction and turn the fs to RO mode as well. Besides that, btrfs_abort_transaction() also prints a stack trace which makes it more useful. Also, as this is a very unexpected situation, indicating a serious corruption/inconsistency, tag the if branch as 'unlikely', set the error code to -EUCLEAN instead of -EROFS, and log an explicit message. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
725026ed |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction at balance_level() when left child is missing At balance_level() we are calling btrfs_handle_fs_error() when the middle child only has 1 item and the left child is missing, however we can simply use btrfs_abort_transaction(), which achieves the same purposes: to turn the fs to error state, abort the current transaction and turn the fs to RO mode. Besides that, btrfs_abort_transaction() also prints a stack trace which makes it more useful. Also, as this is a highly unexpected case and it's about a b+tree inconsistency, change the error code from -EROFS to -EUCLEAN, tag the if branch as 'unlikely' and log an explicit error message. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
87b8e9d0 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid unnecessarily setting the fs to RO and error state at balance_level() At balance_level(), when trying to promote a child node to a root node, if we fail to read the child we call btrfs_handle_fs_error(), which turns the fs to RO mode and sets it to error state as well, causing any ongoing transaction to abort. This however is not necessary because at that point we have not made any change yet at balance_level(), so any error reading the child node does not leaves us in any inconsistent state. Therefore we can just return the error to the caller. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
daefe4d4 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename enospc label to out at balance_level() At balance_level() we have this 'enospc' label where we jump to in case we get an error at several places. However that error is certainly not -ENOSPC in call cases, it can be -EIO or -ENOMEM when reading a child extent buffer for example, or -ENOMEM when trying to record tree mod log operations. So to make this less confusing, rename the label to 'out'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39020d8a |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failure at balance_level() At balance_level(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in this context, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40b0a749 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failure at __btrfs_cow_block() At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to record a tree mod log root insertion operation, do a transaction abort instead. There's really no need for the BUG_ON(), we can properly release all resources in this context and turn the filesystem to RO mode and in an error state instead. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ede600e4 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix extent buffer leak after tree mod log failure at split_node() At split_node(), if we fail to log the tree mod log copy operation, we return without unlocking the split extent buffer we just allocated and without decrementing the reference we own on it. Fix this by unlocking it and decrementing the ref count before returning. Fixes: 5de865eebb83 ("Btrfs: fix tree mod logging") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d09c5152 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add missing error handling when logging operation while COWing extent buffer When COWing an extent buffer that is not the root node, we need to log in the tree mod log that we replaced a pointer in the parent node, otherwise a tree mod log user doing a search on the b+tree can return incorrect results (that miss something). We are doing the call to btrfs_tree_mod_log_insert_key() but we totally ignore its return value. So fix this by adding the missing error handling, resulting in a transaction abort and freeing the COWed extent buffer. Fixes: f230475e62f7 ("Btrfs: put all block modifications into the tree mod log") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5cead542 |
|
01-Jun-2023 |
Boris Burkov <boris@bur.io> |
btrfs: insert tree mod log move in push_node_left There is a fairly unlikely race condition in tree mod log rewind that can result in a kernel panic which has the following trace: [530.569] BTRFS critical (device sda3): unable to find logical 0 length 4096 [530.585] BTRFS critical (device sda3): unable to find logical 0 length 4096 [530.602] BUG: kernel NULL pointer dereference, address: 0000000000000002 [530.618] #PF: supervisor read access in kernel mode [530.629] #PF: error_code(0x0000) - not-present page [530.641] PGD 0 P4D 0 [530.647] Oops: 0000 [#1] SMP [530.654] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted: G S O K 5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1 [530.680] Hardware name: Quanta Mono Lake-M.2 SATA 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017 [530.703] RIP: 0010:__btrfs_map_block+0xaa/0xd00 [530.755] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246 [530.767] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX: f2702d8b8be15100 [530.784] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI: ffff88885fd973c8 [530.800] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09: 00000000fffeffff [530.816] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12: 0000000000000000 [530.832] R13: 0000000000001000 R14: 0000000000001000 R15: ffffc9002c2f76f0 [530.848] FS: 00007f38d64af000(0000) GS:ffff88885fd80000(0000) knlGS:0000000000000000 [530.866] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [530.880] CR2: 0000000000000002 CR3: 00000002b6770004 CR4: 00000000003706e0 [530.896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [530.912] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [530.928] Call Trace: [530.934] ? btrfs_printk+0x13b/0x18c [530.943] ? btrfs_bio_counter_inc_blocked+0x3d/0x130 [530.955] btrfs_map_bio+0x75/0x330 [530.963] ? kmem_cache_alloc+0x12a/0x2d0 [530.973] ? btrfs_submit_metadata_bio+0x63/0x100 [530.984] btrfs_submit_metadata_bio+0xa4/0x100 [530.995] submit_extent_page+0x30f/0x360 [531.004] read_extent_buffer_pages+0x49e/0x6d0 [531.015] ? submit_extent_page+0x360/0x360 [531.025] btree_read_extent_buffer_pages+0x5f/0x150 [531.037] read_tree_block+0x37/0x60 [531.046] read_block_for_search+0x18b/0x410 [531.056] btrfs_search_old_slot+0x198/0x2f0 [531.066] resolve_indirect_ref+0xfe/0x6f0 [531.076] ? ulist_alloc+0x31/0x60 [531.084] ? kmem_cache_alloc_trace+0x12e/0x2b0 [531.095] find_parent_nodes+0x720/0x1830 [531.105] ? ulist_alloc+0x10/0x60 [531.113] iterate_extent_inodes+0xea/0x370 [531.123] ? btrfs_previous_extent_item+0x8f/0x110 [531.134] ? btrfs_search_path_in_tree+0x240/0x240 [531.146] iterate_inodes_from_logical+0x98/0xd0 [531.157] ? btrfs_search_path_in_tree+0x240/0x240 [531.168] btrfs_ioctl_logical_to_ino+0xd9/0x180 [531.179] btrfs_ioctl+0xe2/0x2eb0 This occurs when logical inode resolution takes a tree mod log sequence number, and then while backref walking hits a rewind on a busy node which has the following sequence of tree mod log operations (numbers filled in from a specific example, but they are somewhat arbitrary) REMOVE_WHILE_FREEING slot 532 REMOVE_WHILE_FREEING slot 531 REMOVE_WHILE_FREEING slot 530 ... REMOVE_WHILE_FREEING slot 0 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0 ADD slot 455 ADD slot 454 ADD slot 453 ... ADD slot 0 MOVE src slot 0 -> dst slot 456 nritems 533 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0 When this sequence gets applied via btrfs_tree_mod_log_rewind, it allocates a fresh rewind eb, and first inserts the correct key info for the 533 elements, then overwrites the first 456 of them, then decrements the count by 456 via the add ops, then rewinds the move by doing a memmove from 456:988->0:532. We have never written anything past 532, so that memmove writes garbage into the 0:532 range. In practice, this results in a lot of fully 0 keys. The rewind then puts valid keys into slots 0:455 with the last removes, but 456:532 are still invalid. When search_old_slot uses this eb, if it uses one of those invalid slots, it can then read the extent buffer and issue a bio for offset 0 which ultimately panics looking up extent mappings. This bad tree mod log sequence gets generated when the node balancing code happens to do a balance_node_right followed by a push_node_left while logging in the tree mod log. Illustrated for ebs L and R (left and right): L R start: [XXX|YYY|...] [ZZZ|...|...] balance_node_right: [XXX|YYY|...] [...|ZZZ|...] move Z to make room for Y [XXX|...|...] [YYY|ZZZ|...] copy Y from L to R push_node_left: [XXX|YYY|...] [...|ZZZ|...] copy Y from R to L [XXX|YYY|...] [ZZZ|...|...] move Z into emptied space (NOT LOGGED!) This is because balance_node_right logs a move, but push_node_left explicitly doesn't. That is because logging the move would remove the overwritten src < dst range in the right eb, which was already logged when we called btrfs_tree_mod_log_eb_copy. The correct sequence would include a move from 456:988 to 0:532 after remove 0:455 and before removing 0:532. Reversing that sequence would entail creating keys for 0:532, then moving those keys out to 456:988, then creating more keys for 0:455. i.e., REMOVE_WHILE_FREEING slot 532 REMOVE_WHILE_FREEING slot 531 REMOVE_WHILE_FREEING slot 530 ... REMOVE_WHILE_FREEING slot 0 MOVE src slot 456 -> dst slot 0 nritems 533 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0 ADD slot 455 ADD slot 454 ADD slot 453 ... ADD slot 0 MOVE src slot 0 -> dst slot 456 nritems 533 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0 Fix this to log the move but avoid the double remove by putting all the logging logic in btrfs_tree_mod_log_eb_copy which has enough information to detect these cases and properly log moves, removes, and adds. Leave btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's tree mod logging. (Un)fortunately, this is quite difficult to reproduce, and I was only able to reproduce it by adding sleeps in btrfs_search_old_slot that would encourage more log rewinding during ino_to_logical ioctls. I was able to hit the warning in the previous patch in the series without the fix quite quickly, but not after this patch. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
016f9d0b |
|
29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename del_ptr to btrfs_del_ptr and export it This exists internal to ctree.c, however btrfs check needs to use it for some of its operations. I'd rather not duplicate that code inside of btrfs check as this is low level and I want to keep this code in one place, so rename the function to btrfs_del_ptr and export it so that it can be used inside of btrfs-progs safely. Add a comment to make sure this doesn't get removed by a future cleanup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b3cbfb0d |
|
29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a btrfs_csum_type_size helper This is needed in btrfs-progs for the tools that convert the checksum types for file systems and a few other things. We don't have it in the kernel as we just want to get the size for the super blocks type. However I don't want to have to manually add this every time we sync ctree.c into btrfs-progs, so add the helper in the kernel with a note so it doesn't get removed by a later cleanup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4aec05fa |
|
29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove level argument from btrfs_set_block_flags We just pass in btrfs_header_level(eb) for the level, and we're passing in the eb already, so simply get the level from the eb inside of btrfs_set_block_flags. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eee3b811 |
|
27-Apr-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: improve leaf dump and error handling Improve the leaf dump behavior by: - Always dump the leaf first, then the error message - Output the slot number if possible Especially in __btrfs_free_extent() the leaf dump of extent tree can be pretty large. With an extra slot number it's much easier to locate the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6c75a589 |
|
27-Apr-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: print-tree: pass const extent buffer pointer Since print-tree infrastructure only prints the content of a tree block, we can make them to accept const extent buffer pointer. This removes a forced type convert in extent-tree, where we convert a const extent buffer pointer to regular one, just to avoid compiler warning. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
88ad95b0 |
|
26-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: tag as unlikely the key comparison when checking sibling keys When checking siblings keys, before moving keys from one node/leaf to a sibling node/leaf, it's very unexpected to have the last key of the left sibling greater than or equals to the first key of the right sibling, as that means we have a (serious) corruption that breaks the key ordering properties of a b+tree. Since this is unexpected, surround the comparison with the unlikely macro, which helps the compiler generate better code for the most expected case (no existing b+tree corruption). This is also what we do for other unexpected cases of invalid key ordering (like at btrfs_set_item_key_safe()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f469c8bd |
|
12-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: unexport btrfs_prev_leaf() btrfs_prev_leaf() is not used outside ctree.c, so there's no need to export it at ctree.h - just make it static at ctree.c and move its definition above btrfs_search_slot_for_read(), since that function calls it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a2cea677 |
|
26-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: print extent buffers when sibling keys check fails When trying to move keys from one node/leaf to another sibling node/leaf, if the sibling keys check fails we just print an error message with the last key of the left sibling and the first key of the right sibling. However it's also useful to print all the keys of each sibling, as it may provide some clues to what went wrong, which code path may be inserting keys in an incorrect order. So just do that, print the siblings with btrfs_print_tree(), as it works for both leaves and nodes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ae5afd0 |
|
26-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction when sibling keys check fails for leaves If the sibling keys check fails before we move keys from one sibling leaf to another, we are not aborting the transaction - we leave that to some higher level caller of btrfs_search_slot() (or anything else that uses it to insert items into a b+tree). This means that the transaction abort will provide a stack trace that omits the b+tree modification call chain. So change this to immediately abort the transaction and therefore get a more useful stack trace that shows us the call chain in the bt+tree modification code. It's also important to immediately abort the transaction just in case some higher level caller is not doing it, as this indicates a very serious corruption and we should stop the possibility of doing further damage. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6f932d4e |
|
12-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix btrfs_prev_leaf() to not return the same key twice A call to btrfs_prev_leaf() may end up returning a path that points to the same item (key) again. This happens if while btrfs_prev_leaf(), after we release the path, a concurrent insertion happens, which moves items off from a sibling into the front of the previous leaf, and an item with the computed previous key does not exists. For example, suppose we have the two following leaves: Leaf A ------------------------------------------------------------- | ... key (300 96 10) key (300 96 15) key (300 96 16) | ------------------------------------------------------------- slot 20 slot 21 slot 22 Leaf B ------------------------------------------------------------- | key (300 96 20) key (300 96 21) key (300 96 22) ... | ------------------------------------------------------------- slot 0 slot 1 slot 2 If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with a path pointing to leaf B and slot 0 and the following happens: 1) At btrfs_prev_leaf() we compute the previous key to search as: (300 96 19), which is a key that does not exists in the tree; 2) Then we call btrfs_release_path() at btrfs_prev_leaf(); 3) Some other task inserts a key at leaf A, that sorts before the key at slot 20, for example it has an objectid of 299. In order to make room for the new key, the key at slot 22 is moved to the front of leaf B. This happens at push_leaf_right(), called from split_leaf(). After this leaf B now looks like: -------------------------------------------------------------------------------- | key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... | -------------------------------------------------------------------------------- slot 0 slot 1 slot 2 slot 3 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed previous key: (300 96 19). Since the key does not exists, btrfs_search_slot() returns 1 and with a path pointing to leaf B and slot 1, the item with key (300 96 20); 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of leaf B, the same key as before it was called, since the key at slot 0 of leaf B (300 96 16) is less than the computed previous key, which is (300 96 19); 6) As a consequence btrfs_previous_item() returns a path that points again to the item with key (300 96 20). For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not be functional a problem, despite not making sense to return a new path pointing again to the same item/key. However for a caller such as tree-log.c:log_dir_items(), this has a bad consequence, as it can result in not logging some dir index deletions in case the directory is being logged without holding the inode's VFS lock (logging triggered while logging a child inode for example) - for the example scenario above, in case the dir index keys 17, 18 and 19 were deleted in the current transaction. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
524f14bb |
|
05-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove pointless loop at btrfs_get_next_valid_item() It's pointless to have a while loop at btrfs_get_next_valid_item(), as if the slot on the current leaf is beyond the last item, we call btrfs_next_leaf(), which leaves us at a valid slot of the next leaf (or a valid slot in the current leaf if after releasing the path an item gets pushed from the next leaf to the current leaf). So just call btrfs_next_leaf() if the current slot on the current leaf is beyond the last item. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fdf8d595 |
|
23-Feb-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: open code btrfs_bin_search() btrfs_bin_search() is a simple wrapper that searches for the whole slots by calling btrfs_generic_bin_search() with the starting slot/first_slot preset to 0. This simple wrapper can be open coded as btrfs_bin_search(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9cf14029 |
|
07-Feb-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: handle errors from btrfs_read_node_slot in split While investigating a problem with error injection I tripped over curious behavior in the node/leaf splitting code. If we get an EIO when trying to read either the left or right leaf/node for splitting we'll simply treat the node as if it were full and continue on. The end result of this isn't too bad, we simply end up allocating a block when we may have pushed items into the adjacent blocks. However this does essentially allow us to continue to modify a file system that we've gotten errors on, either from a bad disk or csum mismatch or other corruption. This isn't particularly safe, so instead handle these btrfs_read_node_slot() usages differently. We allow you to pass in any slot, the idea being that we save some code if the slot number is outside of the range of the parent. This means we treat all errors the same, when in reality we only want to ignore -ENOENT. Fix this by changing how we call btrfs_read_node_slot(), which is to only call it for slots we know are valid. This way if we get an error back from reading the block we can properly pass the error up the chain. This was validated with the error injection testing I was doing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d4694728 |
|
07-Feb-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: replace BUG_ON with ASSERT in btrfs_read_node_slot In btrfs_read_node_slot() we have a BUG_ON() that can be converted to an ASSERT(), it's from an extent buffer and the level is validated at the time it's read from disk. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a724f313 |
|
08-Feb-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do unsigned integer division in the extent buffer binary search loop In the search loop of the binary search function, we are doing a division by 2 of the sum of the high and low slots. Because the slots are integers, the generated assembly code for it is the following on x86_64: 0x00000000000141f1 <+145>: mov %eax,%ebx 0x00000000000141f3 <+147>: shr $0x1f,%ebx 0x00000000000141f6 <+150>: add %eax,%ebx 0x00000000000141f8 <+152>: sar %ebx It's a few more instructions than a simple right shift, because signed integer division needs to round towards zero. However we know that slots can never be negative (btrfs_header_nritems() returns an u32), so we can instead use unsigned types for the low and high slots and therefore use unsigned integer division, which results in a single instruction on x86_64: 0x00000000000141f0 <+144>: shr %ebx So use unsigned types for the slots and therefore unsigned division. This is part of a small patchset comprised of the following two patches: btrfs: eliminate extra call when doing binary search on extent buffer btrfs: do unsigned integer division in the extent buffer binary search loop The following fs_mark test was run on a non-debug kernel (Debian's default kernel config) before and after applying the patchset: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 6 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Results before applying patchset: FSUse% Count Size Files/sec App Overhead 2 1200000 0 174472.0 11549868 4 2400000 0 253503.0 11694618 4 3600000 0 257833.1 11611508 6 4800000 0 247089.5 11665983 6 6000000 0 211296.1 12121244 10 7200000 0 187330.6 12548565 Results after applying patchset: FSUse% Count Size Files/sec App Overhead 2 1200000 0 207556.0 11393252 4 2400000 0 266751.1 11347909 4 3600000 0 274397.5 11270058 6 4800000 0 259608.4 11442250 6 6000000 0 238895.8 11635921 8 7200000 0 211942.2 11873825 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b00dfff |
|
08-Feb-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: eliminate extra call when doing binary search on extent buffer The function btrfs_bin_search() is just a wrapper around the function generic_bin_search(), which passes the same arguments plus a default low slot with a value of 0. This adds an unnecessary extra function call, since btrfs_bin_search() is not static. So improve on this by making btrfs_bin_search() an inline function that calls generic_bin_search(), renaming the later to btrfs_generic_bin_search() and exporting it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
190a8339 |
|
26-Jan-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirty btrfs_clean_tree_block is a misnomer, it's just clear_extent_buffer_dirty with some extra accounting around it. Rename this to btrfs_clear_buffer_dirty to make it more clear it belongs with it's setter, btrfs_mark_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed25dab3 |
|
26-Jan-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add trans argument to btrfs_clean_tree_block We check the header generation in the extent buffer against the current running transaction id to see if it's safe to clear DIRTY on this buffer. Generally speaking if we're clearing the buffer dirty we're holding the transaction open, but in the case of cleaning up an aborted transaction we don't, so we have extra checks in that path to check the transid. To allow for a future cleanup go ahead and pass in the trans handle so we don't have to rely on ->running_transaction being set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4c853af |
|
16-Nov-2022 |
ChenXiaoSong <chenxiaosong2@huawei.com> |
btrfs: add might_sleep() annotations Add annotations to functions that might sleep due to allocations or IO and could be called from various contexts. In case of btrfs_search_slot it's not obvious why it would sleep: btrfs_search_slot setup_nodes_for_search reada_for_balance btrfs_readahead_node_child btrfs_readahead_tree_block btrfs_find_create_tree_block alloc_extent_buffer kmem_cache_zalloc /* allocate memory non-atomically, might sleep */ kmem_cache_alloc(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO) read_extent_buffer_pages submit_extent_page /* disk IO, might sleep */ submit_one_bio Other examples where the sleeping could happen is in 3 places might sleep in update_qgroup_limit_item(), as shown below: update_qgroup_limit_item btrfs_alloc_path /* allocate memory non-atomically, might sleep */ kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS) Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8009adf3 |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove BTRFS_LEAF_DATA_OFFSET This is simply the same thing as btrfs_item_nr_offset(leaf, 0), so remove this helper and replace it's usage with the above statement. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
637e3b48 |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add helpers for manipulating leaf items and data We have some gnarly memmove and copy_extent_buffer calls for leaf manipulation. This is because our item offsets aren't absolute, they're based on 0 being where the items start in the leaf, which is after the btrfs_header. This means any manipulation of the data requires adding sizeof(struct btrfs_header) to the offsets we pull from the items. Moving the items themselves is easier as the helpers are absolute offsets, however we of course have to call the helpers to get the offsets for the item numbers. This makes for copy_extent_buffer/memmove_extent_buffer calls that are kind of hard to reason about what's happening. Fix this by pushing this logic into helpers. For data we'll only use the item provided offsets, and the helpers will use the BTRFS_LEAF_DATA_OFFSET addition for the offsets. Additionally for the item manipulation simply pass in the item numbers, and then the helpers will call the offset helper to get the actual offset into the leaf. The diffstat makes this look like more code, but that's simply because I added comments for the helpers, it's net negative for the amount of code, and is easier to reason. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e23efd8e |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add eb to btrfs_node_key_ptr_offset This is a change needed for extent tree v2, as we will be growing the header size. This exists in btrfs-progs currently, and not having it makes syncing accessors.[ch] more problematic. So make this change to set us up for extent tree v2 and match what btrfs-progs does to make syncing easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42c9419a |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: pass the extent buffer for the btrfs_item_nr helpers This is actually a change for extent tree v2, but it exists in btrfs-progs but not in the kernel. This makes it annoying to sync accessors.h with btrfs-progs, and since this is the way I need it for extent-tree v2 simply update these helpers to take the extent buffer in order to make syncing possible now, and make the extent tree v2 stuff easier moving forward. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6bfd0ffa |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move file_extent_item helpers into file-item.h These helpers use functions that are in multiple places, which makes it tricky to sync them into btrfs-progs. Move them to file-item.h and then include file-item.h in places that use these helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3a3178c7 |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move leaf_data_end into ctree.c This is only used in ctree.c, with the exception of zero'ing out extent buffers we're getting ready to write out. In theory we shouldn't have an extent buffer with 0 items that we're writing out, however I'd rather be safe than sorry so open code it in extent_io.c, and then copy the helper into ctree.c. This will make it easier to sync accessors.[ch] into btrfs-progs, as this requires a helper that isn't defined in accessors.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
789d6a3a |
|
13-Sep-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: concentrate all tree block parentness check parameters into one structure There are several different tree block parentness check parameters used across several helpers: - level Mandatory - transid Under most cases it's mandatory, but there are several backref cases which skips this check. - owner_root - first_key Utilized by most top-down tree search routine. Otherwise can be skipped. Those four members are not always mandatory checks, and some of them are the same u64, which means if some arguments got swapped compiler will not catch it. Furthermore if we're going to further expand the parentness check, we need to modify quite some helpers just to add one more parameter. This patch will concentrate all these members into a structure called btrfs_tree_parent_check, and pass that structure for the following helpers: - btrfs_read_extent_buffer() - read_tree_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67707479 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move relocation prototypes into relocation.h Move these out of ctree.h into relocation.h to cut down on code in ctree.h Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43dd529a |
|
27-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: update function comments Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0231804 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move extent-tree helpers into their own header file Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad1ac501 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_map_token to accessors This is specific to the item-accessor code, move it out of ctree.h into accessor.h/.c and then update the users to include the new header file. This un-inlines btrfs_init_map_token, however this is only called once per function so it's not critical to be inlined. This also saves 904 bytes of code on a release build. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ec8eb376 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h We're going to use fs.h to hold fs wide related helpers and definitions, move the FS_STATE enum and related helpers to fs.h, and then update all files that need these definitions to include fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b569ea0 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the printk helpers out of ctree.h We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
33cff222 |
|
14-Oct-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove gfp_t flag from btrfs_tree_mod_log_insert_key() All callers of btrfs_tree_mod_log_insert_key() are now passing a GFP_NOFS flag to it, so remove the flag from it and from alloc_tree_mod_elem() and use it directly within alloc_tree_mod_elem(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
879b2221 |
|
13-Oct-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: switch GFP_ATOMIC to GFP_NOFS when fixing up low keys When fixing up the first key of each node above the current level, at fixup_low_keys(), we are doing a GFP_ATOMIC allocation for inserting an operation record for the tree mod log. However we can do just fine with GFP_NOFS nowadays. The need for GFP_ATOMIC was for the old days when we had custom locks with spinning behaviour for extent buffers and we were in spinning mode while at fixup_low_keys(). Now we use rw semaphores for extent buffer locks, so we can safely use GFP_NOFS. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
890d2b1a |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_next_old_item into ctree.c This uses btrfs_header_nritems, which I will be moving out of ctree.h. In order to avoid needing to include the relevant header in ctree.h, simply move this helper function into ctree.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename parameters ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
226463d7 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_path_cachep out of ctree.h This is local to the ctree code, remove it from ctree.h and inode.c, create new init/exit functions for the cachep, and move it locally to ctree.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bdcdd86c |
|
10-Nov-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix assertion failure and blocking during nowait buffered write When doing a nowait buffered write we can trigger the following assertion: [11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658 [11138.438251] ------------[ cut here ]------------ [11138.438254] kernel BUG at fs/btrfs/messages.c:259! [11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1 [11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs] [11138.443583] Code: 5b 41 5a 41 (...) [11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246 [11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000 [11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff [11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001 [11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000 [11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000 [11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000 [11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0 [11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [11138.455779] Call Trace: [11138.456211] <TASK> [11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs] [11138.457827] ? kmem_cache_alloc+0x18d/0x2a0 [11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs] [11138.459407] csum_exist_in_range+0x56/0x110 [btrfs] [11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs] [11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs] [11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs] [11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs] [11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs] [11138.463982] ? lock_release+0x153/0x4a0 [11138.464347] io_write+0x11b/0x570 [11138.464660] ? lock_release+0x153/0x4a0 [11138.465213] ? lock_is_held_type+0xe8/0x140 [11138.466003] io_issue_sqe+0x63/0x4a0 [11138.466339] io_submit_sqes+0x238/0x770 [11138.466741] __do_sys_io_uring_enter+0x37b/0xb10 [11138.467206] ? lock_is_held_type+0xe8/0x140 [11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50 [11138.468688] do_syscall_64+0x38/0x90 [11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd [11138.470017] RIP: 0033:0x7f1438c539e6 This is because to check if we can NOCOW, we check that if we can NOCOW into an extent (it's prealloc extent or the inode has NOCOW attribute), and then check if there are csums for the extent's range in the csum tree. The search may leave us beyond the last slot of a leaf, and then when we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a time_seq of 0. This triggers a failure of the first assertion at btrfs_next_old_leaf(), since we have a nowait path. With assertions disabled, we simply don't respect the NOWAIT semantics, allowing the write to block on locks or blocking on IO for reading an extent buffer from disk. Fix this by: 1) Triggering the assertion only if time_seq is not 0, which means that search is being done by a tree mod log user, and in the buffered and direct IO write paths we don't use the tree mod log; 2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to lock an extent buffer should return immediately and not retry the search, as well as if we need to do IO to read an extent buffer from disk. Fixes: c922b016f353 ("btrfs: assert nowait mode is not used for some btree search functions") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8bb808c6 |
|
03-Nov-2022 |
David Sterba <dsterba@suse.com> |
btrfs: don't print stack trace when transaction is aborted due to ENOMEM Add ENOMEM among the error codes that don't print stack trace on transaction abort. We've got several reports from syzbot that detects stacks as errors but caused by limiting memory. As this is an artificial condition we don't need to know where exactly the error happens, the abort and error cleanup will continue like e.g. for EIO. As the transaction aborts code needs to be inline in a lot of code, the implementation cases about minimal bloat. The error codes are in a separate function and the WARN uses the condition directly. This increases the code size by 571 bytes on release build. Alternatives considered: add -ENOMEM among the errors, this increases size by 2340 bytes, various attempts to combine the WARN and helper calls, increase by 700 or more bytes. Example syzbot reports (error -12): - https://syzkaller.appspot.com/bug?extid=5244d35be7f589cf093e - https://syzkaller.appspot.com/bug?extid=9c37714c07194d816417 Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c922b016 |
|
12-Sep-2022 |
Stefan Roesch <shr@fb.com> |
btrfs: assert nowait mode is not used for some btree search functions Adds nowait asserts to btree search functions which are not used by buffered IO and direct IO paths. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
857bc13f |
|
12-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: implement a nowait option for tree searches For NOWAIT IOCBs we'll need a way to tell search to not wait on locks or anything. Accomplish this by adding a path->nowait flag that will use trylocks and skip reading of metadata, returning -EAGAIN in either of these cases. For now we only need this for reads, so only the read side is handled. Add an ASSERT() to catch anybody trying to use this for writes so they know they'll have to implement the write side. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b40130b2 |
|
26-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix lockdep splat with reloc root extent buffers We have been hitting the following lockdep splat with btrfs/187 recently WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30 -> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2); *** DEADLOCK *** 7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of reloc tree -> normal tree for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks. However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of normal tree -> reloc tree which is why lockdep complains. Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root. Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged. This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of normal tree -> reloc tree We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root. With this patch we no longer have the lockdep splat in btrfs/187. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2fe6a5a1 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: sink parameter is_data to btrfs_set_disk_extent_flags The parameter has been added in 2009 in the infamous monster commit 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") but not used ever since. We can sink it and allow further simplifications. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
88c602ab |
|
15-Mar-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: tree-checker: check extent buffer owner against owner rootid Btrfs doesn't check whether the tree block respects the root owner. This means, if a tree block referred by a parent in extent tree, but has owner of 5, btrfs can still continue reading the tree block, as long as it doesn't trigger other sanity checks. Normally this is fine, but combined with the empty tree check in check_leaf(), if we hit an empty extent tree, but the root node has csum tree owner, we can let such extent buffer to sneak in. Shrink the hole by: - Do extra eb owner check at tree read time - Make sure the root owner extent buffer exactly matches the root id. Unfortunately we can't yet completely patch the hole, there are several call sites can't pass all info we need: - For reloc/log trees Their owner is key::offset, not key::objectid. We need the full root key to do that accurate check. For now, we just skip the ownership check for those trees. - For add_data_references() of relocation That call site doesn't have any parent/ownership info, as all the bytenrs are all from btrfs_find_all_leafs(). - For direct backref items walk Direct backref items records the parent bytenr directly, thus unlike indirect backref item, we don't do a full tree search. Thus in that case, we don't have full parent owner to check. For the later two cases, they all pass 0 as @owner_root, thus we can skip those cases if @owner_root is 0. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62142be3 |
|
09-Mar-2022 |
Gabriel Niebler <gniebler@suse.com> |
btrfs: introduce btrfs_for_each_slot iterator macro There is a common pattern when searching for a key in btrfs: * Call btrfs_search_slot to find the slot for the key * Enter an endless loop: * If the found slot is larger than the no. of items in the current leaf, check the next leaf * If it's still not found in the next leaf, terminate the loop * Otherwise do something with the found key * Increment the current slot and continue To reduce code duplication, we can replace this code pattern with an iterator macro, similar to the existing for_each_X macros found elsewhere in the kernel. This also makes the code easier to understand for newcomers by putting a name to the encapsulated functionality. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a2e9dc4 |
|
11-Mar-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove trivial wrapper btrfs_read_buffer() The function btrfs_read_buffer() is useless, it just calls btree_read_extent_buffer_pages() with exactly the same arguments. So remove it and rename btree_read_extent_buffer_pages() to btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_" prefix (since it's used outside disk-io.c) and the name is clear enough about what it does. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
376a21d7 |
|
11-Mar-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update outdated comment for read_block_for_search() The comment at the top of read_block_for_search() is very outdated, as it refers to the blocking versus spinning path locking modes. We no longer have these two locking modes after we switched the btree locks from custom code to rw semaphores. So update the comment to stop referring to the blocking mode and put it more up to date. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b246666e |
|
11-Mar-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: release upper nodes when reading stale btree node from disk When reading a btree node (or leaf), at read_block_for_search(), if we can't find its extent buffer in the cache (the fs_info->buffer_radix radix tree), then we unlock all upper level nodes before reading the btree node/leaf from disk, to prevent blocking other tasks for too long. However if we find that the extent buffer is in the cache but it is not up to date, we don't unlock upper level nodes before reading it from disk, potentially blocking other tasks on upper level nodes for too long. Fix this inconsistent behaviour by unlocking upper level nodes if we need to read a node/leaf from disk because its in-memory extent buffer is not up to date. If we unlocked upper level nodes then we must return -EAGAIN to the caller, just like the case where the extent buffer is not cached in memory. And like that case, we determine if upper level nodes are locked by checking only if the parent node is locked - if it isn't, then no other upper level nodes are locked. This is actually a rare case, as if we have an extent buffer in memory, it typically has the uptodate flag set and passes all the checks done by btrfs_buffer_uptodate(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4bb59055 |
|
11-Mar-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid unnecessary btree search restarts when reading node When reading a btree node, at read_block_for_search(), if we don't find the node's (or leaf) extent buffer in the cache, we will read it from disk. Since that requires waiting on IO, we release all upper level nodes from our path before reading the target node/leaf, and then return -EAGAIN to the caller, which will make the caller restart the while btree search. However we are causing the restart of btree search even for cases where it is not necessary: 1) We have a path with ->skip_locking set to true, typically when doing a search on a commit root, so we are never holding locks on any node; 2) We are doing a read search (the "ins_len" argument passed to btrfs_search_slot() is 0), or we are doing a search to modify an existing key (the "cow" argument passed to btrfs_search_slot() has a value of 1 and "ins_len" is 0), in which case we never hold locks for upper level nodes; 3) We are doing a search to insert or delete a key, in which case we may or may not have upper level nodes locked. That depends on the current minimum write lock levels at btrfs_search_slot(), if we had to split or merge parent nodes, if we had to COW upper level nodes and if we ever visited slot 0 of an upper level node. It's still common to not have upper level nodes locked, but our current node must be at least at level 1, for insertions, or at least at level 2 for deletions. In these cases when we have locks on upper level nodes, they are always write locks. These cases where we are not holding locks on upper level nodes far outweigh the cases where we are holding locks, so it's completely wasteful to retry the whole search when we have no upper nodes locked. So change the logic to not return -EAGAIN, and make the caller retry the search, when we don't have the parent node locked - when it's not locked it means no other upper level nodes are locked as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a4ffa1b |
|
22-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: unify the error handling of btrfs_read_buffer() There is one oddball error handling of btrfs_read_buffer(): ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key); if (!ret) { *eb_ret = tmp; return 0; } free_extent_buffer(tmp); btrfs_release_path(p); return -EIO; While all other call sites check the error first. Unify the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4eb150d6 |
|
22-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: unify the error handling pattern for read_tree_block() We had an error handling pattern for read_tree_block() like this: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here * Normally ended up with return or goto out. */ } else if (!extent_buffer_uptodate(eb)) { /* * Different error handling here * Normally also ended up with return or goto out; */ } This is fine, but if we want to add extra check for each read_tree_block(), the existing if-else-if is not that expandable and will take reader some seconds to figure out there is no extra branch. Here we change it to a more common way, without the extra else: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here */ return eb or goto out; } if (!extent_buffer_uptodate(eb)) { /* * Different error handling here */ return eb or goto out; } This also removes some oddball call sites which uses some creative way to check error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0cae23b6 |
|
03-Feb-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid unnecessary computation when deleting items from a leaf When deleting items from a leaf, we always compute the sum of the data sizes of the items that are going to be deleted. However we only use that sum when the last item to delete is behind the last item in the leaf. This unnecessarily wastes CPU time when we are deleting either the whole leaf or from some slot > 0 up to the last item in the leaf, and both of these cases are common (e.g. truncation operation, either as a result of truncate(2) or when logging inodes, deleting checksums after removing a large enough extent, etc). So compute only the sum of the data sizes if the last item to be deleted does not match the last item in the leaf. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c4063d1 |
|
03-Feb-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid unnecessary COW of leaves when deleting items from a leaf When we delete items from a leaf, if we end up with more than two thirds of unused leaf space, we try to delete the leaf by moving all its items into its left and right neighbour leaves. Sometimes that is not possible because there is not enough free space in the left and right leaves, and in that case we end up not deleting our leaf. The way we are doing this is not ideal and can be improved in the following ways: 1) When we call push_leaf_left(), we pass a value of 1 byte to the data size parameter of push_leaf_left(). This is not realistic value because no item can have a size less than 25 bytes, which is the size of struct btrfs_item. This means that means that if the left leaf has not enough free space to push any item, we end up COWing it even if we end up not changing its content at all. COWing that leaf means allocating a new metadata extent, marking it dirty and doing more IO when committing a transaction or when syncing a log tree. For a log tree case, it's particularly more important to avoid the useless COW operation, as more IO can imply a higher latency for an fsync operation. So instead of passing 1 as the minimum data size for push_leaf_left(), pass the size of the first item in our leaf, as we don't want to COW the left leaf if we can't at least push the first item of our leaf; 2) When we call push_leaf_right(), we also pass a value of 1 byte as the data size parameter of push_leaf_right(). Like the previous case, it will also result in COWing the right leaf even if we are not able to move any items into it, since there can't be any item with a size smaller than 25 bytes (the size of struct btrfs_item). So instead of passing 1 as the minimum data size to push_leaf_right(), pass a size that corresponds to the sum of the size of all the remaining items in our leaf. We are not interested in moving less than that, because if we do, we are not able to delete our leaf and we have COWed the right leaf for nothing. Plus, moving only some of the items of our leaf, it means an even less balanced tree. Just like the previous case, we want to avoid the useless COW of the right leaf, this way we don't have to spend time allocating one new metadata extent, and doing more IO when committing a transaction or syncing a log tree. For the log tree case it's specially more important because more IO can result in a higher latency for a fsync operation. So adjust the minimum data size passed to push_leaf_left() and push_leaf_right() as mentioned above. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path Not being able to delete a leaf that became less than 1/3 full after deleting items from it is actually common. For example, for the fio test mentioned in the changelog of patch 6/6, we are only able to delete a leaf at btrfs_del_items() about 5.3% of the time, due to its left and right neighbour leaves not having enough free space to push all the remaining items into them. The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4e098a9 |
|
03-Feb-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove unnecessary leaf free space checks when pushing items When trying to push items from a leaf into its left and right neighbours, we lock the left or right leaf, check if it has the required minimum free space, COW the leaf and then check again if it has the minimum required free space. This second check is pointless: 1) Most and foremost because it's not needed. We have a write lock on the leaf and on its parent node, so no one can come in and change either the pre-COW or post-COW version of the leaf for the whole duration of the push_leaf_left() and push_leaf_right() calls; 2) The call to btrfs_leaf_free_space() is not trivial, it has a fair amount of arithmetic operations and access to fields in the leaf's header and items, so it's not very cheap. So remove the duplicated free space checks. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1227996 |
|
14-Dec-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: refactor unlock_up The purpose of this function is to unlock all nodes in a btrfs path which are above 'lowest_unlock' and whose slot used is different than 0. As such it used slightly awkward structure of 'if' as well as somewhat cryptic "no_skip" control variable which denotes whether we should check the current level of skipability or no. This patch does the following (cosmetic) refactorings: * Renames 'no_skip' to 'check_skip' and makes it a boolean. This variable controls whether we are below the lowest_unlock/skip_level levels. * Consolidates the 2 conditions which warrant checking whether the current level should be skipped under 1 common if (check_skip) branch, this increase indentation level but is not critical. * Consolidates the 'skip_level < i && i >= lowest_unlock' and 'i >= lowest_unlock && i > skip_level' condition into a common branch since those are identical. * Eliminates the local extent_buffer variable as in this case it doesn't bring anything to function readability. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
727e6060 |
|
02-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove stale comment about locking at btrfs_search_slot() The comment refers to the old extent buffer locking code, where we used to have custom locks that had blocking and spinning behaviour modes. That is not the case anymore, since we have transitioned to rw semaphores, so the comment does not offer any value anymore. Remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bb8e9a60 |
|
02-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove BUG_ON() after splitting leaf After calling split_leaf() we BUG_ON() if the returned value is greater than zero. However split_leaf() only returns 0, in case of success, or a negative value in case of an error. The reason for the BUG_ON() is that if we ever get a positive return value from split_leaf(), we can not simply propagate it to the callers of btrfs_search_slot(), as that would be interpreted as "key not found" and not as an error. That means it could result in callers ending up causing some potential silent corruption. So change the BUG_ON() to an ASSERT(), and in case assertions are disabled, produce a warning and set the return value to an error, to make it not possible to get into a silent corruption and having the error not noticed. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
109324cf |
|
02-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move leaf search logic out of btrfs_search_slot() There's quite a significant amount of code for doing the key search for a leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus btrfs_search_slot() is already big enough. So move the logic that does the key search on a leaf into a new helper function. This makes it better organized, removing the need for the labels and the gotos, as well as reducing the indentation level and the size of btrfs_search_slot(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5e1c174 |
|
02-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove useless condition check before splitting leaf When inserting a key, we check if the write_lock_level is less than 1, and if so we set it to 1, release the path and retry the tree traversal. However that is unnecessary, because when ins_len is greater than 0, we know that write_lock_level can never be less than 1. The logic to retry is also buggy, because in case ins_len was decremented, due to an exact key match and the search is not meant for item extension (path->search_for_extension is 0), we retry without incrementing ins_len, which would make the next retry decrement it again by the same amount. So remove the check for write_lock_level being less than 1 and add an assertion to assert it's always >= 1. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e2e58d0f |
|
02-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: try to unlock parent nodes earlier when inserting a key When inserting a new key, we release the write lock on the leaf's parent only after doing the binary search on the leaf. This is because if the key ends up at slot 0, we will have to update the key at slot 0 of the parent node. The same reasoning applies to any other upper level nodes when their slot is 0. We also need to keep the parent locked in case the leaf does not have enough free space to insert the new key/item, because in that case we will split the leaf and we will need to add a new key to the parent due to a new leaf resulting from the split operation. However if the leaf has enough space for the new key and the key does not end up at slot 0 of the leaf we could release our write lock on the parent before doing the binary search on the leaf to figure out the destination slot. That leads to reducing the amount of time other tasks are blocked waiting to lock the parent, therefore increasing parallelism when there are other tasks that are trying to access other leaves accessible through the same parent. This also applies to other upper nodes besides the immediate parent, when their slot is 0, since we keep locks on them until we figure out if the leaf slot is slot 0 or not. In fact, having the key ending at up slot 0 when is rare. Typically it only happens when the key is less than or equals to the smallest, the "left most", key of the entire btree, during a split attempt when we try to push to the right sibling leaf or when the caller just wants to update the item of an existing key. It's also very common that a leaf has enough space to insert a new key, since after a split we move about half of the keys from one into the new leaf. So unlock the parent, and any other upper level nodes, when during a key insertion we notice the key is greater then the first key in the leaf and the leaf has enough free space. After unlocking the upper level nodes, do the binary search using a low boundary of slot 1 and not slot 0, to figure out the slot where the key will be inserted (or where the key already is in case it exists and the caller wants to modify its item data). This extra comparison, with the first key, is cheap and the key is very likely already in a cache line because it immediately follows the header of the extent buffer and we have recently read the level field of the header (which in fact is the last field of the header). The following fs_mark test was run on a non-debug kernel (debian's default kernel config), with a 12 cores intel CPU, and using a NVMe device: $ cat run-fsmark.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 165273.6 5958381 0 2400000 0 190938.3 6284477 0 3600000 0 181429.1 6044059 0 4800000 0 173979.2 6223418 0 6000000 0 139288.0 6384560 0 7200000 0 163000.4 6520083 1 8400000 0 57799.2 5388544 1 9600000 0 66461.6 5552969 2 10800000 0 49593.5 5163675 2 12000000 0 57672.1 4889398 After this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 167987.3 (+1.6%) 6272730 0 2400000 0 198563.9 (+4.0%) 6048847 0 3600000 0 197436.6 (+8.8%) 6163637 0 4800000 0 202880.7 (+16.6%) 6371771 1 6000000 0 167275.9 (+20.1%) 6556733 1 7200000 0 204051.2 (+25.2%) 6817091 1 8400000 0 69622.8 (+20.5%) 5525675 1 9600000 0 69384.5 (+4.4%) 5700723 1 10800000 0 61454.1 (+23.9%) 5363754 3 12000000 0 61908.7 (+7.3%) 5370196 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb81212c |
|
02-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: allow generic_bin_search() to take low boundary as an argument Right now generic_bin_search() always uses a low boundary slot of 0, but in the next patch we'll want to often skip slot 0 when searching for a key. So make generic_bin_search() have the low boundary slot specified as an argument, and move the check for the extent buffer level from btrfs_bin_search() to generic_bin_search() to avoid adding another wrapper around generic_bin_search(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
120de408 |
|
24-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: check the root node for uptodate before returning it Now that we clear the extent buffer uptodate if we fail to write it out we need to check to see if our root node is uptodate before we search down it. Otherwise we could return stale data (or potentially corrupt data that was caught by the write verification step) and think that the path is OK to search down. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d96b3424 |
|
21-Nov-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make send work with concurrent block group relocation We don't allow send and balance/relocation to run in parallel in order to prevent send failing or silently producing some bad stream. This is because while send is using an extent (specially metadata) or about to read a metadata extent and expecting it belongs to a specific parent node, relocation can run, the transaction used for the relocation is committed and the extent gets reallocated while send is still using the extent, so it ends up with a different content than expected. This can result in just failing to read a metadata extent due to failure of the validation checks (parent transid, level, etc), failure to find a backreference for a data extent, and other unexpected failures. Besides reallocation, there's also a similar problem of an extent getting discarded when it's unpinned after the transaction used for block group relocation is committed. The restriction between balance and send was added in commit 9e967495e0e0 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), kernel 5.3, while the more general restriction between send and relocation was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs while we have send operations running"), kernel 5.14. Both send and relocation can be very long running operations. Relocation because it has to do a lot of IO and expensive backreference lookups in case there are many snapshots, and send due to read IO when operating on very large trees. This makes it inconvenient for users and tools to deal with scheduling both operations. For zoned filesystem we also have automatic block group relocation, so send can fail with -EAGAIN when users least expect it or send can end up delaying the block group relocation for too long. In the future we might also get the automatic block group relocation for non zoned filesystems. This change makes it possible for send and relocation to run in parallel. This is achieved the following way: 1) For all tree searches, send acquires a read lock on the commit root semaphore; 2) After each tree search, and before releasing the commit root semaphore, the leaf is cloned and placed in the search path (struct btrfs_path); 3) After releasing the commit root semaphore, the changed_cb() callback is invoked, which operates on the leaf and writes commands to the pipe (or file in case send/receive is not used with a pipe). It's important here to not hold a lock on the commit root semaphore, because if we did we could deadlock when sending and receiving to the same filesystem using a pipe - the send task blocks on the pipe because it's full, the receive task, which is the only consumer of the pipe, triggers a transaction commit when attempting to create a subvolume or reserve space for a write operation for example, but the transaction commit blocks trying to write lock the commit root semaphore, resulting in a deadlock; 4) Before moving to the next key, or advancing to the next change in case of an incremental send, check if a transaction used for relocation was committed (or is about to finish its commit). If so, release the search path(s) and restart the search, to where we were before, so that we don't operate on stale extent buffers. The search restarts are always possible because both the send and parent roots are RO, and no one can add, remove of update keys (change their offset) in RO trees - the only exception is deduplication, but that is still not allowed to run in parallel with send; 5) Periodically check if there is contention on the commit root semaphore, which means there is a transaction commit trying to write lock it, and release the semaphore and reschedule if there is contention, so as to avoid causing any significant delays to transaction commits. This leaves some room for optimizations for send to have less path releases and re searching the trees when there's relocation running, but for now it's kept simple as it performs quite well (on very large trees with resulting send streams in the order of a few hundred gigabytes). Test case btrfs/187, from fstests, stresses relocation, send and deduplication attempting to run in parallel, but without verifying if send succeeds and if it produces correct streams. A new test case will be added that exercises relocation happening in parallel with send and then checks that send succeeds and the resulting streams are correct. A final note is that for now this still leaves the mutual exclusion between send operations and deduplication on files belonging to a root used by send operations. A solution for that will be slightly more complex but it will eventually be built on top of this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dc2e724e |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename btrfs_item_end_nr to btrfs_item_data_end The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually the offset of the end of the data the item points to. In fact all of the helpers that we use btrfs_item_end_nr() use data in their name, like BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end() to make it clear what this helper is giving us. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3212fa14 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: drop the _nr from the item helpers Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
74794207 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce item_nr token variant helpers The last remaining place where we have the pattern of item = btrfs_item_nr(slot) <do something with the item> are the token helpers. Handle this by introducing token helpers that will do the btrfs_item_nr() work inside of the helper itself, and then convert all users of the btrfs_item token helpers to the new _nr() variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c91666b1 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add btrfs_set_item_*_nr() helpers We have the pattern of item = btrfs_item_nr(slot); btrfs_set_item_*(leaf, item); in a bunch of places in our code. Fix this by adding btrfs_set_item_*_nr() helpers which will do the appropriate work, and replace those calls with btrfs_set_item_*_nr(leaf, slot); Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
227f3cd0 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_item_size_nr/btrfs_item_offset_nr everywhere We have this pattern in a lot of places item = btrfs_item_nr(slot); btrfs_item_size(leaf, item); when we could simply use btrfs_item_size(leaf, slot); Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the _nr variation of the helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a163608 |
|
13-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix invalid delayed ref after subvolume creation failure When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f0641656 |
|
23-Sep-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: unexport setup_items_for_insert() Since setup_items_for_insert() is not used anymore outside of ctree.c, make it static and remove its prototype from ctree.h. This also requires to move the definition of setup_item_for_insert() from ctree.h to ctree.c and move down btrfs_duplicate_item() so that it's defined after setup_items_for_insert(). Further, since setup_item_for_insert() is used outside ctree.c, rename it to btrfs_setup_item_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 2/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b7ef5f3a |
|
23-Sep-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: loop only once over data sizes array when inserting an item batch When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49d0c642 |
|
22-Sep-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: assert that extent buffers are write locked instead of only locked We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and that checks that we hold a lock either in read mode or write mode. However in all contexts we use btrfs_assert_tree_locked(), we actually want to check if we are holding a write lock on the extent buffer's rw semaphore - it would be a bug if in any of those contexts we were holding a read lock instead. So change btrfs_assert_tree_locked() to use lockdep_assert_held_write() instead and, to make it more explicit, rename btrfs_assert_tree_locked() to btrfs_assert_tree_write_locked(), so that it's clear we want to check we are holding a write lock. For now there are no contexts where we want to assert that we must have a read lock, but in case that is needed in the future, we can add a new helper function that just calls out lockdep_assert_held_read(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e41d12f5 |
|
20-Sep-2021 |
Christoph Hellwig <hch@lst.de> |
mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h> There is no need to pull blk-cgroup.h and thus blkdev.h in here, so break the include chain. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0ff40a91 |
|
29-Jul-2021 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: introduce btrfs_search_backwards function It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
809d6902 |
|
26-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: make btrfs_next_leaf static inline btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it to header to avoid the function call overhead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
069a2e37 |
|
20-Jul-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: continue readahead of siblings even if target node is in memory At reada_for_search(), when attempting to readahead a node or leaf's siblings, we skip the readahead of the siblings if the node/leaf is already in memory. That is probably fine for the READA_FORWARD and READA_BACK readahead types, as they are used on contexts where we end up reading some consecutive leaves, but usually not the whole btree. However for a READA_FORWARD_ALWAYS mode, currently only used for full send operations, it does not make sense to skip the readahead if the target node or leaf is already loaded in memory, since we know the caller is visiting every node and leaf of the btree in ascending order. So change the behaviour to not skip the readahead when the target node is already in memory and the readahead mode is READA_FORWARD_ALWAYS. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk, with 32GiB of RAM and using a non-debug kernel config (Debian's default config). $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } file_count=2000000 add_files $file_count 0 4 echo echo "Creating snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The duration of the full send operations, in seconds, were the following: Before this change: 85 seconds After this change: 76 seconds (-11.2%) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67d5e289 |
|
06-Jul-2021 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: remove max argument from generic_bin_search Both callers use btrfs_header_nritems to feed the max argument. Remove the argument and let generic_bin_search call it itself. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
79bd3712 |
|
29-Jun-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") fixed a problem that resulted in exhausting the system chunk array in the superblock when there are many tasks allocating chunks in parallel. Basically too many tasks enter the first phase of chunk allocation without previous tasks having finished their second phase of allocation, resulting in too many system chunks being allocated. That was originally observed when running the fallocate tests of stress-ng on a PowerPC machine, using a node size of 64K. However that commit also introduced a deadlock where a task in phase 1 of the chunk allocation waited for another task that had allocated a system chunk to finish its phase 2, but that other task was waiting on an extent buffer lock held by the first task, therefore resulting in both tasks not making any progress. That change was later reverted by a patch with the subject "btrfs: fix deadlock with concurrent chunk allocations involving system chunks", since there is no simple and short solution to address it and the deadlock is relatively easy to trigger on zoned filesystems, while the system chunk array exhaustion is not so common. This change reworks the chunk allocation to avoid the system chunk array exhaustion. It accomplishes that by making the first phase of chunk allocation do the updates of the device items in the chunk btree and the insertion of the new chunk item in the chunk btree. This is done while under the protection of the chunk mutex (fs_info->chunk_mutex), in the same critical section that checks for available system space, allocates a new system chunk if needed and reserves system chunk space. This way we do not have chunk space reserved until the second phase completes. The same logic is applied to chunk removal as well, since it keeps reserved system space long after it is done updating the chunk btree. For direct allocation of system chunks, the previous behaviour remains, because otherwise we would deadlock on extent buffers of the chunk btree. Changes to the chunk btree are by large done by chunk allocation and chunk removal, which first reserve chunk system space and then later do changes to the chunk btree. The other remaining cases are uncommon and correspond to adding a device, removing a device and resizing a device. All these other cases do not pre-reserve system space, they modify the chunk btree right away, so they don't hold reserved space for a long period like chunk allocation and chunk removal do. The diff of this change is huge, but more than half of it is just addition of comments describing both how things work regarding chunk allocation and removal, including both the new behavior and the parts of the old behavior that did not change. CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5963ffca |
|
20-May-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: always abort the transaction if we abort a trans handle While stress testing our error handling I noticed that sometimes we would still commit the transaction even though we had aborted the transaction. Currently we track if a trans handle has dirtied any metadata, and if it hasn't we mark the filesystem as having an error (so no new transactions can be started), but we will allow the current transaction to complete as we do not mark the transaction itself as having been aborted. This sounds good in theory, but we were not properly tracking IO errors in btrfs_finish_ordered_io, and thus committing the transaction with bogus free space data. This isn't necessarily a problem per-se with the free space cache, as the other guards in place would have kept us from accepting the free space cache as valid, but highlights a real world case where we had a bug and could have corrupted the filesystem because of it. This "skip abort on empty trans handle" is nice in theory, but assumes we have perfect error handling everywhere, which we clearly do not. Also we do not allow further transactions to be started, so all this does is save the last transaction that was happening, which doesn't necessarily gain us anything other than the potential for real corruption. Remove this particular bit of code, if we decide we need to abort the transaction then abort the current one and keep us from doing real harm to the file system, regardless of whether this specific trans handle dirtied anything or not. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ace75066 |
|
31-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: improve btree readahead for full send operations Currently a full send operation uses the standard btree readahead when iterating over the subvolume/snapshot btree, which despite bringing good performance benefits, it could be improved in a few aspects for use cases such as full send operations, which are guaranteed to visit every node and leaf of a btree, in ascending and sequential order. The limitations of that standard btree readahead implementation are the following: 1) It only triggers readahead for leaves that are physically close to the leaf being read, within a 64K range; 2) It only triggers readahead for the next or previous leaves if the leaf being read is not currently in memory; 3) It never triggers readahead for nodes. So add a new readahead mode that addresses all these points and use it for full send operations. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The durations of the full send operation in seconds were the following: Before this change: 217 seconds After this change: 205 seconds (-5.7%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
406808ab |
|
11-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use booleans where appropriate for the tree mod log functions Several functions of the tree modification log use integers as booleans, so change them to use booleans instead, making their use more clear. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f3a84ccd |
|
11-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move the tree mod log code into its own file The tree modification log, which records modifications done to btrees, is quite large and currently spread all over ctree.c, which is a huge file already. To make things better organized, move all that code into its own separate source and header files. Functions and definitions that are used outside of the module (mostly by ctree.c) are renamed so that they start with a "btrfs_" prefix. Everything else remains unchanged. This makes it easier to go over the tree modification log code every time I need to go read it to fix a bug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dbcc7d57 |
|
11-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race when cloning extent buffer during rewind of an old root While resolving backreferences, as part of a logical ino ioctl call or fiemap, we can end up hitting a BUG_ON() when replaying tree mod log operations of a root, triggering a stack trace like the following: ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:1210! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G W 5.11.0-2d11c0084b02-misc-next+ #89 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0 Code: 05 48 8d 74 10 (...) RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6 RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20 R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c FS: 00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0 Call Trace: btrfs_search_old_slot+0x265/0x10d0 ? lock_acquired+0xbb/0x600 ? btrfs_search_slot+0x1090/0x1090 ? free_extent_buffer.part.61+0xd7/0x140 ? free_extent_buffer+0x13/0x20 resolve_indirect_refs+0x3e9/0xfc0 ? lock_downgrade+0x3d0/0x3d0 ? __kasan_check_read+0x11/0x20 ? add_prelim_ref.part.11+0x150/0x150 ? lock_downgrade+0x3d0/0x3d0 ? __kasan_check_read+0x11/0x20 ? lock_acquired+0xbb/0x600 ? __kasan_check_write+0x14/0x20 ? do_raw_spin_unlock+0xa8/0x140 ? rb_insert_color+0x30/0x360 ? prelim_ref_insert+0x12d/0x430 find_parent_nodes+0x5c3/0x1830 ? resolve_indirect_refs+0xfc0/0xfc0 ? lock_release+0xc8/0x620 ? fs_reclaim_acquire+0x67/0xf0 ? lock_acquire+0xc7/0x510 ? lock_downgrade+0x3d0/0x3d0 ? lockdep_hardirqs_on_prepare+0x160/0x210 ? lock_release+0xc8/0x620 ? fs_reclaim_acquire+0x67/0xf0 ? lock_acquire+0xc7/0x510 ? poison_range+0x38/0x40 ? unpoison_range+0x14/0x40 ? trace_hardirqs_on+0x55/0x120 btrfs_find_all_roots_safe+0x142/0x1e0 ? find_parent_nodes+0x1830/0x1830 ? btrfs_inode_flags_to_xflags+0x50/0x50 iterate_extent_inodes+0x20e/0x580 ? tree_backref_for_extent+0x230/0x230 ? lock_downgrade+0x3d0/0x3d0 ? read_extent_buffer+0xdd/0x110 ? lock_downgrade+0x3d0/0x3d0 ? __kasan_check_read+0x11/0x20 ? lock_acquired+0xbb/0x600 ? __kasan_check_write+0x14/0x20 ? _raw_spin_unlock+0x22/0x30 ? __kasan_check_write+0x14/0x20 iterate_inodes_from_logical+0x129/0x170 ? iterate_inodes_from_logical+0x129/0x170 ? btrfs_inode_flags_to_xflags+0x50/0x50 ? iterate_extent_inodes+0x580/0x580 ? __vmalloc_node+0x92/0xb0 ? init_data_container+0x34/0xb0 ? init_data_container+0x34/0xb0 ? kvmalloc_node+0x60/0x80 btrfs_ioctl_logical_to_ino+0x158/0x230 btrfs_ioctl+0x205e/0x4040 ? __might_sleep+0x71/0xe0 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? getrusage+0x4b6/0x9c0 ? __kasan_check_read+0x11/0x20 ? lock_release+0xc8/0x620 ? __might_fault+0x64/0xd0 ? lock_acquire+0xc7/0x510 ? lock_downgrade+0x3d0/0x3d0 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? __kasan_check_read+0x11/0x20 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? lock_downgrade+0x3d0/0x3d0 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? __kasan_check_read+0x11/0x20 ? lock_release+0xc8/0x620 ? __task_pid_nr_ns+0xd3/0x250 ? lock_acquire+0xc7/0x510 ? __fget_files+0x160/0x230 ? __fget_light+0xf2/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fd1976e2427 Code: 00 00 90 48 8b 05 (...) RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427 RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004 RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120 R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004 R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8 Modules linked in: ---[ end trace ec8931a1c36e57be ]--- (gdb) l *(__tree_mod_log_rewind+0x3b1) 0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210). 1205 * the modification. as we're going backwards, we do the 1206 * opposite of each operation here. 1207 */ 1208 switch (tm->op) { 1209 case MOD_LOG_KEY_REMOVE_WHILE_FREEING: 1210 BUG_ON(tm->slot < n); 1211 fallthrough; 1212 case MOD_LOG_KEY_REMOVE_WHILE_MOVING: 1213 case MOD_LOG_KEY_REMOVE: 1214 btrfs_set_node_key(eb, &tm->key, tm->slot); Here's what happens to hit that BUG_ON(): 1) We have one tree mod log user (through fiemap or the logical ino ioctl), with a sequence number of 1, so we have fs_info->tree_mod_seq == 1; 2) Another task is at ctree.c:balance_level() and we have eb X currently as the root of the tree, and we promote its single child, eb Y, as the new root. Then, at ctree.c:balance_level(), we call: tree_mod_log_insert_root(eb X, eb Y, 1); 3) At tree_mod_log_insert_root() we create tree mod log elements for each slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each with a ->logical pointing to ebX->start. These are placed in an array named tm_list. Lets assume there are N elements (N pointers in eb X); 4) Then, still at tree_mod_log_insert_root(), we create a tree mod log element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set to the level of eb X and ->generation set to the generation of eb X; 5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with tm_list as argument. After that, tree_mod_log_free_eb() calls __tree_mod_log_insert() for each member of tm_list in reverse order, from highest slot in eb X, slot N - 1, to slot 0 of eb X; 6) __tree_mod_log_insert() sets the sequence number of each given tree mod log operation - it increments fs_info->tree_mod_seq and sets fs_info->tree_mod_seq as the sequence number of the given tree mod log operation. This means that for the tm_list created at tree_mod_log_insert_root(), the element corresponding to slot 0 of eb X has the highest sequence number (1 + N), and the element corresponding to the last slot has the lowest sequence number (2); 7) Then, after inserting tm_list's elements into the tree mod log rbtree, the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest sequence number, which is N + 2; 8) Back to ctree.c:balance_level(), we free eb X by calling btrfs_free_tree_block() on it. Because eb X was created in the current transaction, has no other references and writeback did not happen for it, we add it back to the free space cache/tree; 9) Later some other task T allocates the metadata extent from eb X, since it is marked as free space in the space cache/tree, and uses it as a node for some other btree; 10) The tree mod log user task calls btrfs_search_old_slot(), which calls get_old_root(), and finally that calls __tree_mod_log_oldest_root() with time_seq == 1 and eb_root == eb Y; 11) First iteration of the while loop finds the tree mod log element with sequence number N + 2, for the logical address of eb Y and of type MOD_LOG_ROOT_REPLACE; 12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out of the loop, and set root_logical to point to tm->old_root.logical which corresponds to the logical address of eb X; 13) On the next iteration of the while loop, the call to tree_mod_log_search_oldest() returns the smallest tree mod log element for the logical address of eb X, which has a sequence number of 2, an operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to the old slot N - 1 of eb X (eb X had N items in it before being freed); 14) We then break out of the while loop and return the tree mod log operation of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of eb X, to get_old_root(); 15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation and set "logical" to the logical address of eb X, which was the old root. We then call tree_mod_log_search() passing it the logical address of eb X and time_seq == 1; 16) Then before calling tree_mod_log_search(), task T adds a key to eb X, which results in adding a tree mod log operation of type MOD_LOG_KEY_ADD to the tree mod log - this is done at ctree.c:insert_ptr() - but after adding the tree mod log operation and before updating the number of items in eb X from 0 to 1... 17) The task at get_old_root() calls tree_mod_log_search() and gets the tree mod log operation of type MOD_LOG_KEY_ADD just added by task T. Then it enters the following if branch: if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) { (...) } (...) Calls read_tree_block() for eb X, which gets a reference on eb X but does not lock it - task T has it locked. Then it clones eb X while it has nritems set to 0 in its header, before task T sets nritems to 1 in eb X's header. From hereupon we use the clone of eb X which no other task has access to; 18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD mod log operation we just got from tree_mod_log_search() in the previous step and the cloned version of eb X; 19) At __tree_mod_log_rewind(), we set the local variable "n" to the number of items set in eb X's clone, which is 0. Then we enter the while loop, and in its first iteration we process the MOD_LOG_KEY_ADD operation, which just decrements "n" from 0 to (u32)-1, since "n" is declared with a type of u32. At the end of this iteration we call rb_next() to find the next tree mod log operation for eb X, that gives us the mod log operation of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence number of N + 1 (steps 3 to 6); 20) Then we go back to the top of the while loop and trigger the following BUG_ON(): (...) switch (tm->op) { case MOD_LOG_KEY_REMOVE_WHILE_FREEING: BUG_ON(tm->slot < n); fallthrough; (...) Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0. Fix this by taking a read lock on the extent buffer before cloning it at ctree.c:get_old_root(). This should be done regardless of the extent buffer having been freed and reused, as a concurrent task might be modifying it (while holding a write lock on it). Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/ Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
72c9925f |
|
04-Feb-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix extent buffer leak on failure to copy root At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up returning without unlocking and releasing our reference on the extent buffer named "cow" we previously allocated with btrfs_alloc_tree_block(). So fix that by unlocking the extent buffer and dropping our reference on it before returning. Fixes: be20aa9dbadc8c ("Btrfs: Add mount option to turn off data cow") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
867ed321 |
|
14-Jan-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root While testing my error handling patches, I added a error injection site at btrfs_inc_extent_ref, to validate the error handling I added was doing the correct thing. However I hit a pretty ugly corruption while doing this check, with the following error injection stack trace: btrfs_inc_extent_ref btrfs_copy_root create_reloc_root btrfs_init_reloc_root btrfs_record_root_in_trans btrfs_start_transaction btrfs_update_inode btrfs_update_time touch_atime file_accessed btrfs_file_mmap This is because we do not catch the error from btrfs_inc_extent_ref, which in practice would be ENOMEM, which means we lose the extent references for a root that has already been allocated and inserted, which is the problem. Fix this by aborting the transaction if we fail to do the reference modification. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f75e2b79 |
|
16-Dec-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: allow error injection for btrfs_search_slot and btrfs_cow_block The following patches are going to address error handling in relocation, in order to test those patches I need to be able to inject errors in btrfs_search_slot and btrfs_cow_block, as we call both of these pretty often in different cases during relocation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a664971 |
|
01-Dec-2020 |
ethanwu <ethanwu@synology.com> |
btrfs: correctly calculate item size used when item key collision happens Item key collision is allowed for some item types, like dir item and inode refs, but the overall item size is limited by the nodesize. item size(ins_len) passed from btrfs_insert_empty_items to btrfs_search_slot already contains size of btrfs_item. When btrfs_search_slot reaches leaf, we'll see if we need to split leaf. The check incorrectly reports that split leaf is required, because it treats the space required by the newly inserted item as btrfs_item + item data. But in item key collision case, only item data is actually needed, the newly inserted item could merge into the existing one. No new btrfs_item will be inserted. And split_leaf return EOVERFLOW from following code: if (extend && data_size + btrfs_item_size_nr(l, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info)) return -EOVERFLOW; In most cases, when callers receive EOVERFLOW, they either return this error or handle in different ways. For example, in normal dir item creation the userspace will get errno EOVERFLOW; in inode ref case INODE_EXTREF is used instead. However, this is not the case for rename. To avoid the unrecoverable situation in rename, btrfs_check_dir_item_collision is called in early phase of rename. In this function, when item key collision is detected leaf space is checked: data_size = sizeof(*di) + name_len; if (data_size + btrfs_item_size_nr(leaf, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info)) the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here refers to existing item size, the condition here correctly calculates the needed size for collision case rather than the wrong case above. The consequence of inconsistent condition check between btrfs_check_dir_item_collision and btrfs_search_slot when item key collision happens is that we might pass check here but fail later at btrfs_search_slot. Rename fails and volume is forced readonly [436149.586170] ------------[ cut here ]------------ [436149.586173] BTRFS: Transaction aborted (error -75) [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1 [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286 [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006 [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0 [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472 [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08 [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe [436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000 [436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0 [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [436149.586296] Call Trace: [436149.586311] vfs_rename+0x383/0x920 [436149.586313] ? vfs_rename+0x383/0x920 [436149.586315] do_renameat2+0x4ca/0x590 [436149.586317] __x64_sys_rename+0x20/0x30 [436149.586324] do_syscall_64+0x5a/0x120 [436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [436149.586332] RIP: 0033:0x7fa3133b1d37 [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37 [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0 [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0 [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00 Thanks to Hans van Kranenburg for information about crc32 hash collision tools, I was able to reproduce the dir item collision with following python script. https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run it under a btrfs volume will trigger the abort transaction. It simply creates files and rename them to forged names that leads to hash collision. There are two ways to fix this. One is to simply revert the patch 878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the condition consistent although that patch is correct about the size. The other way is to handle the leaf space check correctly when collision happens. I prefer the second one since it correct leaf space check in collision case. This fix will not account sizeof(struct btrfs_item) when the item already exists. There are two places where ins_len doesn't contain sizeof(struct btrfs_item), however. 1. extent-tree.c: lookup_inline_extent_backref 2. file-item.c: btrfs_csum_file_blocks to make the logic of btrfs_search_slot more clear, we add a flag search_for_extension in btrfs_path. This flag indicates that ins_len passed to btrfs_search_slot doesn't contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot will use the actual size needed to calculate the required leaf space. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: ethanwu <ethanwu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
884b07d0 |
|
01-Dec-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors To support sectorsize < PAGE_SIZE case, we need to take extra care of extent buffer accessors. Since sectorsize is smaller than PAGE_SIZE, one page can contain multiple tree blocks, we must use eb->start to determine the real offset to read/write for extent buffer accessors. This patch introduces two helpers to do this: - get_eb_page_index() This is to calculate the index to access extent_buffer::pages. It's just a simple wrapper around "start >> PAGE_SHIFT". For sectorsize == PAGE_SIZE case, nothing is changed. For sectorsize < PAGE_SIZE case, we always get index as 0, and the existing page shift also works. - get_eb_offset_in_page() This is to calculate the offset to access extent_buffer::pages. This needs to take extent_buffer::start into consideration. For sectorsize == PAGE_SIZE case, extent_buffer::start is always aligned to PAGE_SIZE, thus adding extent_buffer::start to offset_in_page() won't change the result. For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives us the correct offset to access. This patch will touch the following parts to cover all extent buffer accessors: - BTRFS_SETGET_HEADER_FUNCS() - read_extent_buffer() - read_extent_buffer_to_user() - memcmp_extent_buffer() - write_extent_buffer_chunk_tree_uuid() - write_extent_buffer_fsid() - write_extent_buffer() - memzero_extent_buffer() - copy_extent_buffer_full() - copy_extent_buffer() - memcpy_extent_buffer() - memmove_extent_buffer() - btrfs_get_token_##bits() - btrfs_get_##bits() - btrfs_set_token_##bits() - btrfs_set_##bits() - generic_bin_search() Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95b982de |
|
13-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: simplify return values in setup_nodes_for_search The function is needlessly convoluted. Fix that by: * removing redundant sret variable definition in both if arms * replace the again/done labels with direct return statements, the function is short enough and doesn't do anything special upon exit * remove BUG_ON on split_node returning a positive number - it can't happen as split_node returns either 0 or a negative error code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d5286a92 |
|
12-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove useless return value statement in split_node At the point when we set 'ret = 0' it's guaranteed that the function is going to return 0 so directly return 0. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fe596ca3 |
|
06-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_tree_read_lock in btrfs_search_slot We no longer use recursion, so __btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock. Replace this call with the simple helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1bb96598 |
|
06-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: merge back btrfs_read_lock_root_node helpers We no longer have recursive locking and there's no need for separate helpers that allowed the transition to rwsem with minimal code changes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f5239dc |
|
06-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove btrfs_path::recurse With my async free space cache loading patches ("btrfs: load free space cache asynchronously") we no longer have a user of path->recurse and can remove it. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0e46318d |
|
06-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: unlock to current level in btrfs_next_old_leaf Filipe reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc2-btrfs-next-71 #1 Not tainted ------------------------------------------------------ find/324157 is trying to acquire lock: ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] but task is already holding lock: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: lock_acquire+0xd8/0x490 down_write_nested+0x44/0x120 __btrfs_tree_lock+0x27/0x120 [btrfs] btrfs_search_slot+0x2a3/0xc50 [btrfs] btrfs_insert_empty_items+0x58/0xa0 [btrfs] insert_with_overflow+0x44/0x110 [btrfs] btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs] btrfs_setxattr+0xd6/0x4c0 [btrfs] btrfs_setxattr_trans+0x68/0x100 [btrfs] __vfs_setxattr+0x66/0x80 __vfs_setxattr_noperm+0x70/0x200 vfs_setxattr+0x6b/0x120 setxattr+0x125/0x240 path_setxattr+0xba/0xd0 __x64_sys_setxattr+0x27/0x30 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (btrfs-tree-01#2/3){++++}-{3:3}: check_prev_add+0x91/0xc60 __lock_acquire+0x1689/0x3130 lock_acquire+0xd8/0x490 down_read_nested+0x45/0x220 __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] btrfs_next_old_leaf+0x27d/0x580 [btrfs] btrfs_real_readdir+0x1e3/0x4b0 [btrfs] iterate_dir+0x170/0x1c0 __x64_sys_getdents64+0x83/0x140 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(btrfs-tree-01#2/3); lock(btrfs-tree-00); lock(btrfs-tree-01#2/3); *** DEADLOCK *** 5 locks held by find/324157: #0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60 #1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0 #2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] #3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] #4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] stack backtrace: CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 check_noncircular+0xff/0x110 ? mark_lock.part.0+0x468/0xe90 check_prev_add+0x91/0xc60 __lock_acquire+0x1689/0x3130 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 lock_acquire+0xd8/0x490 ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] down_read_nested+0x45/0x220 ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] btrfs_next_old_leaf+0x27d/0x580 [btrfs] btrfs_real_readdir+0x1e3/0x4b0 [btrfs] iterate_dir+0x170/0x1c0 __x64_sys_getdents64+0x83/0x140 ? filldir+0x1d0/0x1d0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens because btrfs_next_old_leaf searches down to our current key, and then walks up the path until we can move to the next slot, and then reads back down the path so we get the next leaf. However it doesn't unlock any lower levels until it replaces them with the new extent buffer. This is technically fine, but of course causes lockdep to complain, because we could be holding locks on lower levels while locking upper levels. Fix this by dropping all nodes below the level that we use as our new starting point before we start reading back down the path. This also allows us to drop the nested/recursive locking magic, because we're no longer locking two nodes at the same level anymore. Reported-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ffeb03cf |
|
06-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: cleanup the locking in btrfs_next_old_leaf We are carrying around this next_rw_lock from when we would do spinning vs blocking read locks. Now that we have the rwsem locking we can simply use the read lock flag unconditionally and the read lock helpers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1b7ec85e |
|
05-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: pass root owner to read_tree_block In order to properly set the lockdep class of a newly allocated block we need to know the owner of the block. For non-refcounted trees this is straightforward, we always know in advance what tree we're reading from. For refcounted trees we don't necessarily know, however all refcounted trees share the same lockdep class name, tree-<level>. Fix all the callers of read_tree_block() to pass in the root objectid we're using. In places like relocation and backref we could probably unconditionally use 0, but just in case use the root when we have it, otherwise use 0 in the cases we don't have the root as it's going to be a refcounted tree anyway. This is a preparation patch for further changes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
206983b7 |
|
05-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_read_node_slot in btrfs_realloc_node We have this open-coded nightmare in btrfs_realloc_node that does the same thing that the normal read path does, which is to see if we have the eb in memory already, and if not read it, and verify the eb is uptodate. Delete this open coding and simply use btrfs_read_node_slot. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bfb484d9 |
|
05-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: cleanup extent buffer readahead We're going to pass around more information when we allocate extent buffers, in order to make that cleaner how we do readahead. Most of the callers have the parent node that we're getting our blockptr from, with the sole exception of relocation which simply has the bytenr it wants to read. Add a helper that takes the current arguments that we need (bytenr and gen), and add another helper for simply reading the slot out of a node. In followup patches the helper that takes all the extra arguments will be expanded, and the simpler helper won't need to have it's arguments adjusted. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b9729ce0 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: locking: rip out path->leave_spinning We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ac5887c8 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: locking: remove all the blocking helpers Now that we're using a rw_semaphore we no longer need to indicate if a lock is blocking or not, nor do we need to flip the entire path from blocking to spinning. Remove these helpers and all the places they are called. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
572c83ac |
|
29-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: cleanup cow block on error In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead to a system lockup. It gets stuck trying to write back inodes, and the write back thread was trying to lock an extent buffer: $ cat /proc/2143497/stack [<0>] __btrfs_tree_lock+0x108/0x250 [<0>] lock_extent_buffer_for_io+0x35e/0x3a0 [<0>] btree_write_cache_pages+0x15a/0x3b0 [<0>] do_writepages+0x28/0xb0 [<0>] __writeback_single_inode+0x54/0x5c0 [<0>] writeback_sb_inodes+0x1e8/0x510 [<0>] wb_writeback+0xcc/0x440 [<0>] wb_workfn+0xd7/0x650 [<0>] process_one_work+0x236/0x560 [<0>] worker_thread+0x55/0x3c0 [<0>] kthread+0x13a/0x150 [<0>] ret_from_fork+0x1f/0x30 This is because we got an error while COWing a block, specifically here if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) { ret = btrfs_reloc_cow_block(trans, root, buf, cow); if (ret) { btrfs_abort_transaction(trans, ret); return ret; } } [16402.241552] BTRFS: Transaction aborted (error -2) [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540 [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8 [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540 [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282 [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388 [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380 [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000 [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0 [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000 [16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000 [16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0 [16402.257867] Call Trace: [16402.258072] btrfs_cow_block+0x109/0x230 [16402.258356] btrfs_search_slot+0x530/0x9d0 [16402.258655] btrfs_lookup_file_extent+0x37/0x40 [16402.259155] __btrfs_drop_extents+0x13c/0xd60 [16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0 [16402.259949] btrfs_replace_file_extents+0x190/0x820 [16402.260873] btrfs_clone+0x9ae/0xc00 [16402.261139] btrfs_extent_same_range+0x66/0x90 [16402.261771] btrfs_remap_file_range+0x353/0x3b1 [16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140 [16402.262821] vfs_dedupe_file_range+0x189/0x220 [16402.263150] do_vfs_ioctl+0x552/0x700 [16402.263662] __x64_sys_ioctl+0x62/0xb0 [16402.264023] do_syscall_64+0x33/0x40 [16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [16402.264862] RIP: 0033:0x7f542c7d15cb [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003 [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64 [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80 [16402.271498] irq event stamp: 0 [16402.271846] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0 [16402.273343] softirqs last enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0 [16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0 [16402.274338] ---[ end trace 737874a5a41a8236 ]--- [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry [16402.276179] BTRFS info (device dm-9): forced readonly [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30 The problem here is that as soon as we allocate the new block it is locked and marked dirty in the btree inode. This means that we could attempt to writeback this block and need to lock the extent buffer. However we're not unlocking it here and thus we deadlock. Fix this by unlocking the cow block if we have any errors inside of __btrfs_cow_block, and also free it so we do not leak it. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7269ddd2 |
|
01-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: improve error message in setup_items_for_insert Reword and update formats to match variable types. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update formats ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da9ffb24 |
|
01-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: add kerneldoc for setup_items_for_insert Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc0d82e1 |
|
01-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: sink total_data parameter in setup_items_for_insert That parameter can easily be derived based on the "data_size" and "nr" parameters exploit this fact to simply the function's signature. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3dc9dc89 |
|
01-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: eliminate total_size parameter from setup_items_for_insert The value of this argument can be derived from the total_data as it's simply the value of the data size + size of btrfs_items being touched. Move the parameter calculation inside the function. This results in a simpler interface and also a minor size reduction: ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34) Function old new delta btrfs_duplicate_item 260 259 -1 setup_items_for_insert 1200 1190 -10 btrfs_insert_empty_items 177 154 -23 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc0716c2 |
|
01-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: re-arrange statements in setup_items_for_insert Rearrange statements calculating the offset of the newly added items so that the calculation has to be done only once. No functional change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ca9d473a |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use BTRFS_NESTED_NEW_ROOT for double splits I've made this change separate since it requires both of the newly added NESTED flags and I didn't want to slip it into one of those changes. If we do a double split of a node we can end up doing a BTRFS_NESTED_SPLIT on level 0, which throws lockdep off because it appears as a double lock. Since we're maxed out on subclasses, use BTRFS_NESTED_NEW_ROOT if we had to do a double split. This is OK because we won't have to do a double split if we had to insert a new root, and the new root would be at a higher level anyway. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf6f34aa |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots The way we add new roots is confusing from a locking perspective for lockdep. We generally have the rule that we lock things in order from highest level to lowest, but in the case of adding a new level to the tree we actually allocate a new block for the root, which makes the locking go in reverse. A similar issue exists for snapshotting, we cow the original root for the root of a new tree, however they're at the same level. Address this by using BTRFS_NESTING_NEW_ROOT for these operations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4dff97e6 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce BTRFS_NESTING_SPLIT for split blocks If we are splitting a leaf/node, we could do something like the following lock(leaf) BTRFS_NESTING_NORMAL lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW push from leaf -> left reset path to point to left split left allocate new block, lock block BTRFS_NESTING_SPLIT at the new block point we need to have a different nesting level, because we have already used either BTRFS_NESTING_LEFT or BTRFS_NESTING_RIGHT when pushing items from the original leaf into the adjacent leaves. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf59a5a2 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW For similar reasons as BTRFS_NESTING_COW, we need BTRFS_NESTING_LEFT/RIGHT_COW. The pattern is this lock leaf -> BTRFS_NESTING_NORMAL cow leaf -> BTRFS_NESTING_COW split leaf lock left -> BTRFS_NESTING_LEFT cow left -> BTRFS_NESTING_LEFT_COW We need this in order to indicate to lockdep that these locks are discrete and are being taken in a safe order. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf77467a |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT Our lockdep maps are based on rootid+level, however in some cases we will lock adjacent blocks on the same level, namely in searching forward or in split/balance. Because of this lockdep will complain, so we need a separate subclass to indicate to lockdep that these are different locks. lock leaf -> BTRFS_NESTING_NORMAL cow leaf -> BTRFS_NESTING_COW split leaf lock left -> BTRFS_NESTING_LEFT lock right -> BTRFS_NESTING_RIGHT The above graph illustrates the need for this new nesting subclass. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9631e4cc |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks When we COW a block we are holding a lock on the original block, and then we lock the new COW block. Because our lockdep maps are based on root + level, this will make lockdep complain. We need a way to indicate a subclass for locking the COW'ed block, so plumb through our btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer, and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks. The reason I've added all this extra infrastructure is because there will be need of different nesting classes in follow up patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd7ba1c1 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add nesting tags to the locking helpers We will need these when we switch to an rwsem, so plumb in the infrastructure here to use later on. I violate the 80 character limit some here because it'll be cleaned up later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
51899412 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce btrfs_path::recurse Our current tree locking stuff allows us to recurse with read locks if we're already holding the write lock. This is necessary for the space cache inode, as we could be holding a lock on the root_tree root when we need to cache a block group, and thus need to be able to read down the root_tree to read in the inode cache. We can get away with this in our current locking, but we won't be able to with a rwsem. Handle this by purposefully annotating the places where we require recursion, so that in the future we can maybe come up with a way to avoid the recursion. In the case of the free space inode, this will be superseded by the free space tree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d16c702f |
|
19-Aug-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: ctree: check key order before merging tree blocks [BUG] With a crafted image, btrfs can panic at btrfs_del_csums(): kernel BUG at fs/btrfs/ctree.c:3188! invalid opcode: 0000 [#1] SMP PTI CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9 RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180 RSP: 0018:ffff976141257ab8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000 RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0 Call Trace: truncate_one_csum+0xac/0xf0 btrfs_del_csums+0x24f/0x3a0 __btrfs_free_extent.isra.72+0x5a7/0xbe0 __btrfs_run_delayed_refs+0x539/0x1120 btrfs_run_delayed_refs+0xdb/0x1b0 btrfs_commit_transaction+0x52/0x950 ? start_transaction+0x94/0x450 transaction_kthread+0x163/0x190 kthread+0x105/0x140 ? btrfs_cleanup_transaction+0x560/0x560 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 Modules linked in: ---[ end trace 93bf9db00e6c374e ]--- [CAUSE] This crafted image has a tricky key order corruption: checksum tree key (CSUM_TREE ROOT_ITEM 0) node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE ... key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19 key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19 ... leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672 range start 73785344 end 75497472 length 1712128 item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4 range start 75497472 end 75501568 length 4096 item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740 range start 75501568 end 77283328 length 1781760 item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4 range start 77283328 end 77287424 length 4096 item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<< range start 4120596480 end 4120903680 length 307200 leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672 range start 77594624 end 79306752 length 1712128 ... Note the item 4 key of leaf 29757440, which is obviously too large, and even larger than the first key of the next leaf. However it still follows the key order in that tree block, thus tree checker is unable to detect it at read time, since tree checker can only work inside one leaf, thus such complex corruption can't be detected in advance. [FIX] The next time to detect such problem is at tree block merge time, which is in push_node_left(), balance_node_right(), push_leaf_left() or push_leaf_right(). Now we check if the key order of the right-most key of the left node is larger than the left-most key of the right node. By this we don't need to call the full tree-checker, while still keeping the key order correct as key order in each node is already checked by tree checker thus we only need to check the above two slots. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
260db43c |
|
04-Aug-2020 |
Randy Dunlap <rdunlap@infradead.org> |
btrfs: delete duplicated words + other fixes in comments Delete repeated words in fs/btrfs/. {to, the, a, and old} and change "into 2 part" to "into 2 parts". Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d3beaa25 |
|
10-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: set the lockdep class for log tree extent buffers These are special extent buffers that get rewound in order to lookup the state of the tree at a specific point in time. As such they do not go through the normal initialization paths that set their lockdep class, so handle them appropriately when they are created and before they are locked. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
604997b4 |
|
27-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: use the correct const function attribute for btrfs_get_num_csums The build robot reports compiler: h8300-linux-gcc (GCC) 9.3.0 In file included from fs/btrfs/tests/extent-map-tests.c:8: >> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers] 2166 | size_t __const btrfs_get_num_csums(void); | ^~~~~~~ The function attribute for const does not follow the expected scheme and in this case is confused with a const type qualifier. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ce6ef5ab |
|
08-Jun-2020 |
David Sterba <dsterba@suse.com> |
btrfs: add little-endian optimized key helpers The CPU and on-disk keys are mapped to two different structures because of the endianness. There's an intermediate buffer used to do the conversion, but this is not necessary when CPU and on-disk endianness match. Add optimized versions of helpers that take disk_key and use the buffer directly for CPU keys or drop the intermediate buffer and conversion. This saves a lot of stack space accross many functions and removes about 6K of generated binary code: text data bss dec hex filename 1090439 17468 14912 1122819 112203 pre/btrfs.ko 1084613 17456 14912 1116981 110b35 post/btrfs.ko Delta: -5826 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c730ae0c |
|
16-Jun-2020 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: convert comments to fallthrough annotations Convert fall through comments to the pseudo-keyword which is now the preferred way. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
213ff4b7 |
|
27-May-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove redundant local variable in read_block_for_search The local 'b' variable is only used to directly read values from passed extent buffer. So eliminate it and directly use the input parameter. Furthermore this shrinks the size of the following functions: ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-73 (-73) Function old new delta read_block_for_search.isra 876 871 -5 push_node_left 1112 1044 -68 Total: Before=50348, After=50275, chg -0.14% Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
995e9a16 |
|
27-May-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: open code key_search This function wraps the optimisation implemented by d7396f07358a ("Btrfs: optimize key searches in btrfs_search_slot") however this optimisation is really used in only one place - btrfs_search_slot. Just open code the optimisation and also add a comment explaining how it works since it's not clear just by looking at the code - the key point here is it depends on an internal invariant that BTRFS' btree provides, namely intermediate pointers always contain the key at slot0 at the child node. So in the case of exact match we can safely assume that the given key will always be in slot 0 on lower levels. Furthermore this results in a reduction of btrfs_search_slot's size: ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-75 (-75) Function old new delta btrfs_search_slot 2783 2708 -75 Total: Before=50423, After=50348, chg -0.15% Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92a7cc42 |
|
15-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE The name BTRFS_ROOT_REF_COWS is not very clear about the meaning. In fact, that bit can only be set to those trees: - Subvolume roots - Data reloc root - Reloc roots for above roots All other trees won't get this bit set. So just by the result, it is obvious that, roots with this bit set can have tree blocks shared with other trees. Either shared by snapshots, or by reloc roots (an special snapshot created by relocation). This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to make it easier to understand, and update all comment mentioning "reference counted" to follow the rename. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5cd17f34 |
|
29-Apr-2020 |
David Sterba <dsterba@suse.com> |
btrfs: speed up and simplify generic_bin_search The bin search jumps over the extent buffer item keys, comparing directly the bytes if the key is in one page, or storing it in a temporary buffer in case it spans two pages. The mapping start and length are obtained from map_private_extent_buffer, which is heavy weight compared to what we need. We know the key size and can find out the eb page in a simple way. For keys spanning two pages the fallback read_extent_buffer is used. The temporary variables are reduced and moved to the scope of use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a31356b9 |
|
29-Apr-2020 |
David Sterba <dsterba@suse.com> |
btrfs: don't use set/get token in leaf_space_used The token is supposed to cache the last page used by the set/get helpers. In leaf_space_used the first and last items are accessed, it's not likely they'd be on the same page so there's some overhead caused updating the token address but not using it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cc4c13d5 |
|
28-Apr-2020 |
David Sterba <dsterba@suse.com> |
btrfs: drop eb parameter from set/get token helpers Now that all set/get helpers use the eb from the token, we don't need to pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some stack space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3b83361 |
|
17-Apr-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove the redundant parameter level in btrfs_bin_search() All callers pass the eb::level so we can get read it directly inside the btrfs_bin_search and key_search. This is inspired by the work of Marek in U-boot. CC: Marek Behun <marek.behun@nic.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
59a0fcdb |
|
27-Feb-2020 |
David Sterba <dsterba@suse.com> |
btrfs: inline checksum name and driver definitions There's an unnecessary indirection in the checksum definition table, pointer and the string itself. The strings are short and the overall size of one entry is now 24 bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42c9d0b5 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: simplify parameters of btrfs_set_disk_extent_flags All callers pass extent buffer start and length so the extent buffer itself should work fine. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b908c334 |
|
05-Feb-2020 |
David Sterba <dsterba@suse.com> |
btrfs: move root node locking helpers to locking.c The helpers are related to locking so move them there, update comments. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42836cf4 |
|
21-Jan-2020 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: don't iterate mod seq list when putting a tree mod seq Each new element added to the mod seq list is always appended to the list, and each one gets a sequence number coming from a counter which gets incremented everytime a new element is added to the list (or a new node is added to the tree mod log rbtree). Therefore the element with the lowest sequence number is always the first element in the list. So just remove the list iteration at btrfs_put_tree_mod_seq() that computes the minimum sequence number in the list and replace it with a check for the first element's sequence number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7227ff4d |
|
21-Jan-2020 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between adding and putting tree mod seq elements and nodes There is a race between adding and removing elements to the tree mod log list and rbtree that can lead to use-after-free problems. Consider the following example that explains how/why the problems happens: 1) Task A has mod log element with sequence number 200. It currently is the only element in the mod log list; 2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to access the tree mod log. When it enters the function, it initializes 'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock' before checking if there are other elements in the mod seq list. Since the list it empty, 'min_seq' remains set to (u64)-1. Then it unlocks the lock 'tree_mod_seq_lock'; 3) Before task A acquires the lock 'tree_mod_log_lock', task B adds itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a sequence number of 201; 4) Some other task, name it task C, modifies a btree and because there elements in the mod seq list, it adds a tree mod elem to the tree mod log rbtree. That node added to the mod log rbtree is assigned a sequence number of 202; 5) Task B, which is doing fiemap and resolving indirect back references, calls btrfs get_old_root(), with 'time_seq' == 201, which in turn calls tree_mod_log_search() - the search returns the mod log node from the rbtree with sequence number 202, created by task C; 6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating the mod log rbtree and finds the node with sequence number 202. Since 202 is less than the previously computed 'min_seq', (u64)-1, it removes the node and frees it; 7) Task B still has a pointer to the node with sequence number 202, and it dereferences the pointer itself and through the call to __tree_mod_log_rewind(), resulting in a use-after-free problem. This issue can be triggered sporadically with the test case generic/561 from fstests, and it happens more frequently with a higher number of duperemove processes. When it happens to me, it either freezes the VM or it produces a trace like the following before crashing: [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1 [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 1245.321287] RIP: 0010:rb_next+0x16/0x50 [ 1245.321307] Code: .... [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202 [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80 [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000 [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038 [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8 [ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000 [ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0 [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1245.321706] Call Trace: [ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs] [ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs] [ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs] [ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs] [ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322066] do_vfs_ioctl+0x45a/0x750 [ 1245.322081] ksys_ioctl+0x70/0x80 [ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 1245.322113] __x64_sys_ioctl+0x16/0x20 [ 1245.322126] do_syscall_64+0x5c/0x280 [ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1245.322155] RIP: 0033:0x7fdee3942dd7 [ 1245.322177] Code: .... [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7 [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004 [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44 [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48 [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50 [ 1245.322423] Modules linked in: .... [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]--- Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum sequence number and iterates the rbtree while holding the lock 'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock' lock, since it is now redundant. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6609fee8 |
|
05-Dec-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues When a tree mod log user no longer needs to use the tree it calls btrfs_put_tree_mod_seq() to remove itself from the list of users and delete all no longer used elements of the tree's red black tree, which should be all elements with a sequence number less then our equals to the caller's sequence number. However the logic is broken because it can delete and free elements from the red black tree that have a sequence number greater then the caller's sequence number: 1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the tree mod log; 2) The task which got assigned the sequence number 1 calls btrfs_put_tree_mod_seq(); 3) Sequence number 1 is deleted from the list of sequence numbers; 4) The current minimum sequence number is computed to be the sequence number 2; 5) A task using sequence number 2 is at tree_mod_log_rewind() and gets a pointer to one of its elements from the red black tree through a call to tree_mod_log_search(); 6) The task with sequence number 1 iterates the red black tree of tree modification elements and deletes (and frees) all elements with a sequence number less then or equals to 2 (the computed minimum sequence number) - it ends up only leaving elements with sequence numbers of 3 and 4; 7) The task with sequence number 2 now uses the pointer to its element, already freed by the other task, at __tree_mod_log_rewind(), resulting in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces a trace like the following: [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1 [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [16804.548666] RIP: 0010:rb_next+0x16/0x50 (...) [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202 [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600 [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000 [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8 [16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000 [16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0 [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16804.557583] Call Trace: [16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs] [16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs] [16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs] [16804.560700] find_parent_nodes+0x388/0x1120 [btrfs] [16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs] [16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs] [16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs] [16804.563112] ? __might_fault+0x11/0x90 [16804.563706] do_vfs_ioctl+0x45a/0x700 [16804.564299] ksys_ioctl+0x70/0x80 [16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20 [16804.565461] __x64_sys_ioctl+0x16/0x20 [16804.566020] do_syscall_64+0x5c/0x250 [16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe [16804.567153] RIP: 0033:0x7f4b1ba2add7 (...) [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7 [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003 [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44 [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48 [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50 (...) [16804.575623] ---[ end trace 87317359aad4ba50 ]--- Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that have a sequence number equals to the computed minimum sequence number, and not just elements with a sequence number greater then that minimum. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
352ae07b |
|
07-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add blake2b to checksumming algorithms Add blake2b (with 256 bit digest) to the list of possible checksumming algorithms used by BTRFS. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4e967be |
|
08-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add member for a specific checksum driver Currently all the checksum algorithms generate a fixed size digest size and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes and BLAKE2b produces digest of 512 bits by default. We can't do that and will use the blake2b-256, this needs to be passed to the crypto API. Separate that from the base algorithm name and add a member to request specific driver, in this case with the digest size. The only place that uses the driver name is the crypto API setup. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7cea56c |
|
07-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: sysfs: export supported checksums Export supported checksum algorithms via sysfs in the list of static features: /sys/fs/btrfs/features/supported_checksums Space spearated list of checksum algorithm names. Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3831bf00 |
|
07-Oct-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: add sha256 to checksumming algorithm Add sha256 to the list of possible checksumming algorithms used by BTRFS. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3951e7f0 |
|
07-Oct-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: add xxhash64 to checksumming algorithms Add xxhash64 to the list of possible checksumming algorithms used by BTRFS. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67439dad |
|
08-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: opencode extent_buffer_get The helper is trivial and we can understand what the atomic_inc on something named refs does. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e1f60a65 |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add __pure attribute to functions The attribute is more relaxed than const and the functions could dereference pointers, as long as the observable state is not changed. We do have such functions, based on -Wsuggest-attribute=pure . The visible effects of this patch are negligible, there are differences in the assembly but hard to summarize. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1f95ec01 |
|
24-Sep-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move btrfs_unlock_up_safe to other locking functions The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed2b1d36 |
|
24-Sep-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move btrfs_set_path_blocking to other locking functions The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
34ffafdb |
|
10-Sep-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: ctree: Remove stray comment of setting up path lock The following comment shows up in btrfs_search_slot() with out much sense: /* * setup the path here so we can release it under lock * contention with the cow code */ if (cow) { /* code touching path->lock[] is far away from here */ } This comment hasn't been cleaned up after the relevant code has been removed. The original code is introduced in commit 65b51a009e29 ("btrfs_search_slot: reduce lock contention by cowing in two stages"): + + /* + * setup the path here so we can release it under lock + * contention with the cow code + */ + p->nodes[level] = b; + if (!p->skip_locking) + p->locks[level] = 1; + But in current code, we have different timing for modifying path lock, so just remove the comment. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
abe9339d |
|
10-Sep-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: ctree: Reduce one indent level for btrfs_search_old_slot() Similar to btrfs_search_slot() done in previous patch, make a shortcut for the level 0 case and allow to reduce indentation for the remaining case. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f624d976 |
|
10-Sep-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: ctree: Reduce one indent level for btrfs_search_slot() In btrfs_search_slot(), we something like: if (level != 0) { /* Do search inside tree nodes*/ } else { /* Do search inside tree leaves */ goto done; } This caused extra indent for tree node search code. Change it to something like: if (level == 0) { /* Do search inside tree leaves */ goto done' } /* Do search inside tree nodes */ So we have more space to maneuver our code, this is especially useful as the tree nodes search code is more complex than the leaves search code. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
65e99c43 |
|
04-Sep-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic Those function are simple boolean predicates there is no need to assign their return values to interim variables. Use them directly as predicates. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
af024ed2 |
|
30-Aug-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: create structure to encode checksum type and length Create a structure to encode the type and length for the known on-disk checksums. This makes it easier to add new checksums later. The structure and helpers are moved from ctree.h so they don't occupy space in all headers including ctree.h. This save some space in the final object. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c82f823c |
|
09-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: tie extent buffer and it's token together Further simplifaction of the get/set helpers is possible when the token is uniquely tied to an extent buffer. A condition and an assignment can be avoided. The initializations are moved closer to the first use when the extent buffer is valid. There's one exception in __push_leaf_left where the token is reused. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
18d0f5c6 |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move functions for tree compare to send.c Send is the only user of tree_compare, we can move it there along with the other helpers and definitions. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4b231ae4 |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename and export read_node_slot Preparatory work for code that will be moved out of ctree and uses this function. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efad8a85 |
|
12-Aug-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix use-after-free when using the tree modification log At ctree.c:get_old_root(), we are accessing a root's header owner field after we have freed the respective extent buffer. This results in an use-after-free that can lead to crashes, and when CONFIG_DEBUG_PAGEALLOC is set, results in a stack trace like the following: [ 3876.799331] stack segment: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 3876.799363] CPU: 0 PID: 15436 Comm: pool Not tainted 5.3.0-rc3-btrfs-next-54 #1 [ 3876.799385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 3876.799433] RIP: 0010:btrfs_search_old_slot+0x652/0xd80 [btrfs] (...) [ 3876.799502] RSP: 0018:ffff9f08c1a2f9f0 EFLAGS: 00010286 [ 3876.799518] RAX: ffff8dd300000000 RBX: ffff8dd85a7a9348 RCX: 000000038da26000 [ 3876.799538] RDX: 0000000000000000 RSI: ffffe522ce368980 RDI: 0000000000000246 [ 3876.799559] RBP: dae1922adadad000 R08: 0000000008020000 R09: ffffe522c0000000 [ 3876.799579] R10: ffff8dd57fd788c8 R11: 000000007511b030 R12: ffff8dd781ddc000 [ 3876.799599] R13: ffff8dd9e6240578 R14: ffff8dd6896f7a88 R15: ffff8dd688cf90b8 [ 3876.799620] FS: 00007f23ddd97700(0000) GS:ffff8dda20200000(0000) knlGS:0000000000000000 [ 3876.799643] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3876.799660] CR2: 00007f23d4024000 CR3: 0000000710bb0005 CR4: 00000000003606f0 [ 3876.799682] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3876.799703] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3876.799723] Call Trace: [ 3876.799735] ? do_raw_spin_unlock+0x49/0xc0 [ 3876.799749] ? _raw_spin_unlock+0x24/0x30 [ 3876.799779] resolve_indirect_refs+0x1eb/0xc80 [btrfs] [ 3876.799810] find_parent_nodes+0x38d/0x1180 [btrfs] [ 3876.799841] btrfs_check_shared+0x11a/0x1d0 [btrfs] [ 3876.799870] ? extent_fiemap+0x598/0x6e0 [btrfs] [ 3876.799895] extent_fiemap+0x598/0x6e0 [btrfs] [ 3876.799913] do_vfs_ioctl+0x45a/0x700 [ 3876.799926] ksys_ioctl+0x70/0x80 [ 3876.799938] ? trace_hardirqs_off_thunk+0x1a/0x20 [ 3876.799953] __x64_sys_ioctl+0x16/0x20 [ 3876.799965] do_syscall_64+0x62/0x220 [ 3876.799977] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3876.799993] RIP: 0033:0x7f23e0013dd7 (...) [ 3876.800056] RSP: 002b:00007f23ddd96ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 3876.800078] RAX: ffffffffffffffda RBX: 00007f23d80210f8 RCX: 00007f23e0013dd7 [ 3876.800099] RDX: 00007f23d80210f8 RSI: 00000000c020660b RDI: 0000000000000003 [ 3876.800626] RBP: 000055fa2a2a2440 R08: 0000000000000000 R09: 00007f23ddd96d7c [ 3876.801143] R10: 00007f23d8022000 R11: 0000000000000246 R12: 00007f23ddd96d80 [ 3876.801662] R13: 00007f23ddd96d78 R14: 00007f23d80210f0 R15: 00007f23ddd96d80 (...) [ 3876.805107] ---[ end trace e53161e179ef04f9 ]--- Fix that by saving the root's header owner field into a local variable before freeing the root's extent buffer, and then use that local variable when needed. Fixes: 30b0463a9394d9 ("Btrfs: fix accessing the root pointer in tree mod log functions") CC: stable@vger.kernel.org # 3.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a9fb468 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make caching_thread use btrfs_find_next_key extent-tree.c has a find_next_key that just walks up the path to find the next key, but it is used for both the caching stuff and the snapshot delete stuff. The snapshot deletion stuff is special so it can't really use btrfs_find_next_key, but the caching thread stuff can. We just need to fix btrfs_find_next_key to deal with ->skip_locking and then it works exactly the same as the private find_next_key helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
73e82fe4 |
|
27-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: assert tree mod log lock in __tree_mod_log_insert The tree is going to be modified so it must be the exclusive lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c15d410 |
|
24-Apr-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: ctree: Dump the leaf before BUG_ON in btrfs_set_item_key_safe We have a long standing problem with reversed keys that's detected by btrfs_set_item_key_safe. This is hard to reproduce so we'd like to capture more information for later analysis. Let's dump the leaf content before triggering BUG_ON() so that we can have some clue on what's going wrong. The output of tree locks should help us to debug such problem. Sample stacktrace: generic/522 [00:07:05] [26946.113381] run fstests generic/522 at 2019-04-16 00:07:05 [27161.474720] kernel BUG at fs/btrfs/ctree.c:3192! [27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP [27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G W 5.1.0-rc5-default+ #562 [27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014 [27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs] [27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286 [27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000 [27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7 [27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000 [27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70 [27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000 [27161.493845] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000 [27161.495608] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0 [27161.498100] Call Trace: [27161.498771] __btrfs_drop_extents+0x6ec/0xdf0 [btrfs] [27161.499872] btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs] [27161.501114] btrfs_log_inode+0x7ff/0xdc0 [btrfs] [27161.502114] ? __mutex_unlock_slowpath+0x4b/0x2b0 [27161.503172] btrfs_log_inode_parent+0x237/0x9c0 [btrfs] [27161.504348] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [27161.505374] btrfs_sync_file+0x1b7/0x480 [btrfs] [27161.506371] __x64_sys_msync+0x180/0x210 [27161.507208] do_syscall_64+0x54/0x180 [27161.507932] entry_SYSCALL_64_after_hwframe+0x49/0xbe [27161.508839] RIP: 0033:0x7f8ea5aa9c61 [27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a [27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61 [27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000 [27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000 [27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938 [27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000 [27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug] [27161.522272] ---[ end trace d5afec7ccac6a252 ]--- [27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs] [27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286 [27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000 [27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7 [27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000 [27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70 [27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000 [27161.534405] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000 [27161.536048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5c8daa5 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flags Signed-off-by: David Sterba <dsterba@suse.com>
|
#
179d1e6a |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from from tree_advance Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7da9597 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from tree_move_down Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c71dd880 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from btrfs_extend_item Signed-off-by: David Sterba <dsterba@suse.com>
|
#
78ac4f9e |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from btrfs_truncate_item Signed-off-by: David Sterba <dsterba@suse.com>
|
#
25263cd7 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from split_item Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8087c193 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in __push_leaf_left We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f72f0010 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in __push_leaf_right We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
94f94ad9 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in copy_for_split We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ad3cf6d |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in insert_ptr We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
55d32ed8 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in balance_node_right We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d30a668f |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in push_node_left We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e064d5e9 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in btrfs_verify_level_key We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d0d20b0f |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in read_node_slot We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e902baac |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in btrfs_leaf_free_space We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a884d7d |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in clean_tree_block We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed874f0d |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in tree_mod_log_eb_copy We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8f881e8c |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in leaf_data_end We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
290342f6 |
|
25-Mar-2019 |
Arnd Bergmann <arnd@arndb.de> |
btrfs: use BUG() instead of BUG_ON(1) BUG_ON(1) leads to bogus warnings from clang when CONFIG_PROFILE_ANNOTATED_BRANCHES is set: fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] BUG_ON(1); ^~~~~~~~~ include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^~~~~~~~~~~~~~~~~~~ include/linux/compiler.h:48:23: note: expanded from macro 'unlikely' # define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x))) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here max_chunk_size); ^~~~~~~~~~~~~~ include/linux/kernel.h:860:36: note: expanded from macro 'min' #define min(x, y) __careful_cmp(x, y, <) ^ include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp' __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op)) ^ include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once' typeof(y) unique_y = (y); \ ^ fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true BUG_ON(1); ^ include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^ fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning u64 max_chunk_size; ^ = 0 Change it to BUG() so clang can see that this code path can never continue. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
448de471 |
|
12-Mar-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Check the first key and level for cached extent buffer [BUG] When reading a file from a fuzzed image, kernel can panic like: BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1 assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3500! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs] Call Trace: btrfs_lookup_csum+0x52/0x150 [btrfs] __btrfs_lookup_bio_sums+0x209/0x640 [btrfs] btrfs_submit_bio_hook+0x103/0x170 [btrfs] submit_one_bio+0x59/0x80 [btrfs] extent_read_full_page+0x58/0x80 [btrfs] generic_file_read_iter+0x2f6/0x9d0 __vfs_read+0x14d/0x1a0 vfs_read+0x8d/0x140 ksys_read+0x52/0xc0 do_syscall_64+0x60/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe [CAUSE] The fuzzed image has a corrupted leaf whose first key doesn't match its parent: checksum tree key (CSUM_TREE ROOT_ITEM 0) node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6 chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c ... key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19 leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE leaf 29761536 flags 0x1(WRITTEN) backref revision 1 fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6 chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244 range start 8798638964736 end 8798641262592 length 2297856 When reading the above tree block, we have extent_buffer->refs = 2 in the context: - initial one from __alloc_extent_buffer() alloc_extent_buffer() |- __alloc_extent_buffer() |- atomic_set(&eb->refs, 1) - one being added to fs_info->buffer_radix alloc_extent_buffer() |- check_buffer_tree_ref() |- atomic_inc(&eb->refs) So if even we call free_extent_buffer() in read_tree_block or other similar situation, we only decrease the refs by 1, it doesn't reach 0 and won't be freed right now. The staled eb and its corrupted content will still be kept cached. Furthermore, we have several extra cases where we either don't do first key check or the check is not proper for all callers: - scrub We just don't have first key in this context. - shared tree block One tree block can be shared by several snapshot/subvolume trees. In that case, the first key check for one subvolume doesn't apply to another. So for the above reasons, a corrupted extent buffer can sneak into the buffer cache. [FIX] Call verify_level_key in read_block_for_search to do another verification. For that purpose the function is exported. Due to above reasons, although we can free corrupted extent buffer from cache, we still need the check in read_block_for_search(), for scrub and shared tree blocks. Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755 Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757 Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759 Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761 Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767 Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769 Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
253002f2 |
|
20-Feb-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: remove assertion when searching for a key in a node/leaf At ctree.c:key_search(), the assertion that verifies the first key on a child extent buffer corresponds to the key at a specific slot in the parent has a disadvantage: we effectively hit a BUG_ON() which requires rebooting the machine later. It also does not tell any information about which extent buffer is affected, from which root, the expected and found keys, etc. However as of commit 581c1760415c48 ("btrfs: Validate child tree block's level and first key"), that assertion is not needed since at the time we read an extent buffer from disk we validate that its first key matches the key, at the respective slot, in the parent extent buffer. Therefore just remove the assertion at key_search(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cbca7d59 |
|
18-Feb-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: add missing error handling after doing leaf/node binary search The function map_private_extent_buffer() can return an -EINVAL error, and it is called by generic_bin_search() which will return back the error. The btrfs_bin_search() function in turn calls generic_bin_search() and the key_search() function calls btrfs_bin_search(), so both can return the -EINVAL error coming from the map_private_extent_buffer() function. Some callers of these functions were ignoring that these functions can return an error, so fix them to deal with error return values. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
766ece54 |
|
23-Jan-2019 |
David Sterba <dsterba@suse.com> |
btrfs: merge btrfs_set_lock_blocking_rw with it's caller The last caller that does not have a fixed value of lock is btrfs_set_path_blocking, that actually does the same conditional swtich by the lock type so we can merge the branches together and remove the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8bead258 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: open code now trivial btrfs_set_lock_blocking btrfs_set_lock_blocking is now only a simple wrapper around btrfs_set_lock_blocking_write. The name does not bring any semantic value that could not be inferred from the new function so there's no point keeping it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
300aa896 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers We can use the right helper where the lock type is a fixed parameter. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f616f5cd |
|
23-Jan-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Use delayed subtree rescan for balance Before this patch, qgroup code traces the whole subtree of subvolume and reloc trees unconditionally. This makes qgroup numbers consistent, but it could cause tons of unnecessary extent tracing, which causes a lot of overhead. However for subtree swap of balance, just swap both subtrees because they contain the same contents and tree structure, so qgroup numbers won't change. It's the race window between subtree swap and transaction commit could cause qgroup number change. This patch will delay the qgroup subtree scan until COW happens for the subtree root. So if there is no other operations for the fs, balance won't cause extra qgroup overhead. (best case scenario) Depending on the workload, most of the subtree scan can still be avoided. Only for worst case scenario, it will fall back to old subtree swap overhead. (scan all swapped subtrees) [[Benchmark]] Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. And after file system population, there is no other activity, so it should be the best case scenario. | v4.20-rc1 | w/ patchset | diff ----------------------------------------------------------------------- relocated extents | 22615 | 22457 | -0.1% qgroup dirty extents | 163457 | 121606 | -25.6% time (sys) | 22.884s | 18.842s | -17.6% time (real) | 27.724s | 22.884s | -17.5% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a6279470 |
|
25-Jan-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix deadlock when allocating tree block during leaf/node split When splitting a leaf or node from one of the trees that are modified when flushing pending block groups (extent, chunk, device and free space trees), we need to allocate a new tree block, which in turn can result in the need to allocate a new block group. After allocating the new block group we may need to flush new block groups that were previously allocated during the course of the current transaction, which is what may cause a deadlock due to attempts to write lock twice the same leaf or node, as when splitting a leaf or node we are holding a write lock on it and its parent node. The same type of deadlock can also happen when increasing the tree's height, since we are holding a lock on the existing root while allocating the tree block to use as the new root node. An example trace when the deadlock happens during the leaf split path is: [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G W 4.19.16 #1 [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018 [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] (...) [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246 [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001 [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48 [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000 [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000 [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18 [27175.303601] FS: 0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000 [27175.304468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0 [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [27175.307940] Call Trace: [27175.308802] btrfs_search_slot+0x779/0x9a0 [btrfs] [27175.309669] ? update_space_info+0xba/0xe0 [btrfs] [27175.310534] btrfs_insert_empty_items+0x67/0xc0 [btrfs] [27175.311397] btrfs_insert_item+0x60/0xd0 [btrfs] [27175.312253] btrfs_create_pending_block_groups+0xee/0x210 [btrfs] [27175.313116] do_chunk_alloc+0x25f/0x300 [btrfs] [27175.313984] find_free_extent+0x706/0x10d0 [btrfs] [27175.314855] btrfs_reserve_extent+0x9b/0x1d0 [btrfs] [27175.315707] btrfs_alloc_tree_block+0x100/0x5b0 [btrfs] [27175.316548] split_leaf+0x130/0x610 [btrfs] [27175.317390] btrfs_search_slot+0x94d/0x9a0 [btrfs] [27175.318235] btrfs_insert_empty_items+0x67/0xc0 [btrfs] [27175.319087] alloc_reserved_file_extent+0x84/0x2c0 [btrfs] [27175.319938] __btrfs_run_delayed_refs+0x596/0x1150 [btrfs] [27175.320792] btrfs_run_delayed_refs+0xed/0x1b0 [btrfs] [27175.321643] delayed_ref_async_start+0x81/0x90 [btrfs] [27175.322491] normal_work_helper+0xd0/0x320 [btrfs] [27175.323328] ? move_linked_works+0x6e/0xa0 [27175.324160] process_one_work+0x191/0x370 [27175.324976] worker_thread+0x4f/0x3b0 [27175.325763] kthread+0xf8/0x130 [27175.326531] ? rescuer_thread+0x320/0x320 [27175.327284] ? kthread_create_worker_on_cpu+0x50/0x50 [27175.328027] ret_from_fork+0x35/0x40 [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]--- Fix this by preventing the flushing of new blocks groups when splitting a leaf/node and when inserting a new root node for one of the trees modified by the flushing operation, similar to what is done when COWing a node/leaf from on of these trees. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383 Reported-by: Eli V <eliventer@gmail.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a6d8654d |
|
08-Jan-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix deadlock when using free space tree due to block group creation When modifying the free space tree we can end up COWing one of its extent buffers which in turn might result in allocating a new chunk, which in turn can result in flushing (finish creation) of pending block groups. If that happens we can deadlock because creating a pending block group needs to update the free space tree, and if any of the updates tries to modify the same extent buffer that we are COWing, we end up in a deadlock since we try to write lock twice the same extent buffer. So fix this by skipping pending block group creation if we are COWing an extent buffer from the free space tree. This is a case missed by commit 5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches"). Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173 Fixes: 5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
52042d8e |
|
27-Nov-2018 |
Andrea Gelmini <andrea.gelmini@gelma.net> |
btrfs: Fix typos in comments and strings The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
be6821f8 |
|
11-Dec-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: send, fix race with transaction commits that create snapshots If we create a snapshot of a snapshot currently being used by a send operation, we can end up with send failing unexpectedly (returning -ENOENT error to user space for example). The following diagram shows how this happens. CPU 1 CPU2 CPU3 btrfs_ioctl_send() (...) create_snapshot() -> creates snapshot of a root used by the send task btrfs_commit_transaction() create_pending_snapshot() __get_inode_info() btrfs_search_slot() btrfs_search_slot_get_root() down_read commit_root_sem get reference on eb of the commit root -> eb with bytenr == X up_read commit_root_sem btrfs_cow_block(root node) btrfs_free_tree_block() -> creates delayed ref to free the extent btrfs_run_delayed_refs() -> runs the delayed ref, adds extent to fs_info->pinned_extents btrfs_finish_extent_commit() unpin_extent_range() -> marks extent as free in the free space cache transaction commit finishes btrfs_start_transaction() (...) btrfs_cow_block() btrfs_alloc_tree_block() btrfs_reserve_extent() -> allocates extent at bytenr == X btrfs_init_new_buffer(bytenr X) btrfs_find_create_tree_block() alloc_extent_buffer(bytenr X) find_extent_buffer(bytenr X) -> returns existing eb, which the send task got (...) -> modifies content of the eb with bytenr == X -> uses an eb that now belongs to some other tree and no more matches the commit root of the snapshot, resuts will be unpredictable The consequences of this race can be various, and can lead to searches in the commit root performed by the send task failing unexpectedly (unable to find inode items, returning -ENOENT to user space, for example) or not failing because an inode item with the same number was added to the tree that reused the metadata extent, in which case send can behave incorrectly in the worst case or just fail later for some reason. Fix this by performing a copy of the commit root's extent buffer when doing a search in the context of a send operation. CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e2e9: Btrfs: move get root out of btrfs_search_slot to a helper CC: stable@vger.kernel.org # 4.4.x: f9ddfd0592a: Btrfs: remove unused check of skip_locking CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
83354f07 |
|
30-Nov-2018 |
Josef Bacik <jbacik@fb.com> |
btrfs: catch cow on deleting snapshots When debugging some weird extent reference bug I suspected that we were changing a snapshot while we were deleting it, which could explain my bug. This was indeed what was happening, and this patch helped me verify my theory. It is never correct to modify the snapshot once it's being deleted, so mark the root when we are deleting it and make sure we complain about it when it happens. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de37aa51 |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fsid/metadata_fsid fields from btrfs_info Currently btrfs_fs_info structure contains a copy of the fsid/metadata_uuid fields. Same values are also contained in the btrfs_fs_devices structure which fs_info has a reference to. Let's reduce duplication by removing the fields from fs_info and always refer to the ones in fs_devices. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7239ff4b |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Introduce support for FSID change without metadata rewrite This field is going to be used when the user wants to change the UUID of the filesystem without having to rewrite all metadata blocks. This field adds another level of indirection such that when the FSID is changed what really happens is the current UUID (the one with which the fs was created) is copied to the 'metadata_uuid' field in the superblock as well as a new incompat flag is set METADATA_UUID. When the kernel detects this flag is set it knows that the superblock in fact has 2 UUIDs: 1. Is the UUID which is user-visible, currently known as FSID. 2. Metadata UUID - this is the UUID which is stamped into all on-disk datastructures belonging to this file system. When the new incompat flag is present device scanning checks whether both fsid/metadata_uuid of the scanned device match any of the registered filesystems. When the flag is not set then both UUIDs are equal and only the FSID is retained on disk, metadata_uuid is set only in-memory during mount. Additionally a new metadata_uuid field is also added to the fs_info struct. It's initialised either with the FSID in case METADATA_UUID incompat flag is not set or with the metdata_uuid of the superblock otherwise. This commit introduces the new fields as well as the new incompat flag and switches all users of the fsid to the new logic. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates in comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8c7eeb65 |
|
15-Aug-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extra reference count bumps in btrfs_compare_trees When the 2 comparison trees roots are initialised they are private to the function and already have reference counts of 1 each. There is no need to further increment the reference count since the cloned buffers are already accessed via struct btrfs_path. Eventually the 2 paths used for comparison are going to be released, effectively disposing of the cloned buffers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
24cee18a |
|
15-Aug-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewind When a rewound buffer is created it already has a ref count of 1 and the dummy flag set. Then another ref is taken bumping the count to 2. Finally when this buffer is released from btrfs_release_path the extra reference is decremented by the special handling code in free_extent_buffer. However, this special code is in fact redundant sinca ref count of 1 is still correct since the buffer is only accessed via btrfs_path struct. This paves the way forward of removing the special handling in free_extent_buffer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6c122e2a |
|
15-Aug-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove redundant extent_buffer_get in get_old_root get_old_root used used only by btrfs_search_old_slot to initialise the path structure. The old root is always a cloned buffer (either via alloc dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1 from allocation, 1 from extent_buffer_get call in get_old_root. This latter explicit ref count acquire operation is in fact unnecessary since the semantic is such that the newly allocated buffer is handed over to the btrfs_path for lifetime management. Considering this just remove the extra extent_buffer_get in get_old_root. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5ce55557 |
|
12-Oct-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix deadlock when writing out free space caches When writing out a block group free space cache we can end deadlocking with ourselves on an extent buffer lock resulting in a warning like the following: [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G W I 4.16.8 #1 [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246 [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001 [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20 [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000 [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630 [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000 [245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000 [245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0 [245043.408670] Call Trace: [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [245043.425538] ? iput+0x72/0x1e0 [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] [245043.433341] kthread+0xf0/0x130 [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 [245043.437236] ret_from_fork+0x1f/0x30 [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- This is because at write_one_cache_group() when we are COWing a leaf from the extent tree we end up allocating a new block group (chunk) and, because we have hit a threshold on the number of bytes reserved for system chunks, we attempt to finalize the creation of new block groups from the current transaction, by calling btrfs_create_pending_block_groups(). However here we also need to modify the extent tree in order to insert a block group item, and if the location for this new block group item happens to be in the same leaf that we were COWing earlier, we deadlock since btrfs_search_slot() tries to write lock the extent buffer that we locked before at write_one_cache_group(). We have already hit similar cases in the past and commit d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation") fixed some of those cases by delaying the creation of pending block groups at the known specific spots that could lead to a deadlock. This change reworks that commit to be more generic so that we don't have to add similar logic to every possible path that can lead to a deadlock. This is done by making __btrfs_cow_block() disallowing the creation of new block groups (setting the transaction's can_flush_pending_bgs to false) before it attempts to allocate a new extent buffer for either the extent, chunk or device trees, since those are the trees that pending block creation modifies. Once the new extent buffer is allocated, it allows creation of pending block groups to happen again. This change depends on a recent patch from Josef which is not yet in Linus' tree, named "btrfs: make sure we create all new block groups" in order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). Fixes: d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation") CC: stable@vger.kernel.org # 4.4+ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753 Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/ Reported-by: E V <eliventer@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
52398340 |
|
21-Aug-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: kill btrfs_clear_path_blocking Btrfs's btree locking has two modes, spinning mode and blocking mode, while searching btree, locking is always acquired in spinning mode and then converted to blocking mode if necessary, and in some hot paths we may switch the locking back to spinning mode by btrfs_clear_path_blocking(). When acquiring locks, both of reader and writer need to wait for blocking readers and writers to complete before doing read_lock()/write_lock(). The problem is that btrfs_clear_path_blocking() needs to switch nodes in the path to blocking mode at first (by btrfs_set_path_blocking) to make lockdep happy before doing its actual clearing blocking job. When switching to blocking mode from spinning mode, it consists of step 1) bumping up blocking readers counter and step 2) read_unlock()/write_unlock(), this has caused serious ping-pong effect if there're a great amount of concurrent readers/writers, as waiters will be woken up and go to sleep immediately. 1) Killing this kind of ping-pong results in a big improvement in my 1600k files creation script, MNT=/mnt/btrfs mkfs.btrfs -f /dev/sdf mount /dev/def $MNT time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 w/o patch: real 2m27.307s user 0m12.839s sys 13m42.831s w/ patch: real 1m2.273s user 0m15.802s sys 8m16.495s 1.1) latency histogram from funclatency[1] Overall with the patch, there're ~50% less write lock acquisition and the 95% max latency that write lock takes also reduces to ~100ms from >500ms. -------------------------------------------- w/o patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 2385222 |****************************************| 2 -> 3 : 37147 | | 4 -> 7 : 20452 | | 8 -> 15 : 13131 | | 16 -> 31 : 3877 | | 32 -> 63 : 3900 | | 64 -> 127 : 2612 | | 128 -> 255 : 974 | | 256 -> 511 : 165 | | 512 -> 1023 : 13 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6743860 |****************************************| 2 -> 3 : 2146 | | 4 -> 7 : 190 | | 8 -> 15 : 38 | | 16 -> 31 : 4 | | -------------------------------------------- w/ patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 1318454 |****************************************| 2 -> 3 : 6800 | | 4 -> 7 : 3664 | | 8 -> 15 : 2145 | | 16 -> 31 : 809 | | 32 -> 63 : 219 | | 64 -> 127 : 10 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6854317 |****************************************| 2 -> 3 : 2383 | | 4 -> 7 : 601 | | 8 -> 15 : 92 | | 2) dbench also proves the improvement, dbench -t 120 -D /mnt/btrfs 16 w/o patch: Throughput 158.363 MB/sec w/ patch: Throughput 449.52 MB/sec 3) xfstests didn't show any additional failures. One thing to note is that callers may set path->leave_spinning to have all nodes in the path stay in spinning mode, which means callers are ready to not sleep before releasing the path, but it won't cause problems if they don't want to sleep in blocking mode. [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
315bed43 |
|
13-Sep-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: handle error of get_old_root In btrfs_search_old_slot get_old_root is always used with the assumption it cannot fail. However, this is not true in rare circumstance it can fail and return null. This will lead to null point dereference when the header is read. Fix this by checking the return value and properly handling NULL by setting ret to -EIO and returning gracefully. Coverity-id: 1087503 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
98e6b1eb |
|
11-Sep-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: remove unnecessary level check in balance_level In the callchain: btrfs_search_slot() if (level != 0) setup_nodes_for_search() balance_level() It is just impossible to have level=0 in balance_level, we can drop the check. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4b6f8e96 |
|
13-Aug-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: do not unnecessarily pass write_lock_level when processing leaf As we're going to return right after the call, it's not necessary to get update the new write_lock_level from unlock_up. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4fd786e6 |
|
05-Aug-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Remove 'objectid' member from struct btrfs_root There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a79865c6 |
|
21-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove V0 extent support The v0 compat code was introduced in commit 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9 years ago, which was merged in 2.6.31. This means that the code is there to support filesystems which are _VERY_ old and if you are using btrfs on such an old kernel, you have much bigger problems. This coupled with the fact that no one is likely testing/maintining this code likely means it has bugs lurking. All things considered I think 43 kernel releases later it's high time this remnant of the past got removed. This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case we have a v0 with no support intact as a sort of safety-net. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc877d28 |
|
18-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Deduplicate extent_buffer init code When a new extent buffer is allocated there are a few mandatory fields which need to be set in order for the buffer to be sane: level, generation, bytenr, backref_rev, owner and FSID/UUID. Currently this is open coded in the callers of btrfs_alloc_tree_block, meaning it's fairly high in the abstraction hierarchy of operations. This patch solves this by simply moving this init code in btrfs_init_new_buffer, since this is the function which initializes a newly allocated extent buffer. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b167fa91 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from fixup_low_keys This argument is unused. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f9ddfd05 |
|
29-May-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: remove unused check of skip_locking The check is superfluous since all callers who set search_for_commit also have skip_locking set. ASSERT() is put in place to ensure skip_locking is set by new callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d80bb3f9 |
|
17-May-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: remove always true check in unlock_up As unlock_up() is written as for () { if (!path->locks[i]) break; ... if (... && path->locks[i]) { } } Apparently, @path->locks[i] is always true at this 'if'. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
662c653b |
|
17-May-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: grab write lock directly if write_lock_level is the max level Typically, when acquiring root node's lock, btrfs tries its best to get read lock and trade for write lock if @write_lock_level implies to do so. In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's write lock directly. In this particular case, the dance of acquiring read lock and then trading for write lock can be saved. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1fc28d8e |
|
17-May-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: move get root out of btrfs_search_slot to a helper It's good to have a helper instead of having all get-root details open-coded. The new helper locks (if necessary) and sets root node of the path. Also invert the checks to make the code flow easier to read. There is no functional change in this commit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e6a1d6fd |
|
17-May-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: use more straightforward extent_buffer_uptodate check If parent_transid "0" is passed to btrfs_buffer_uptodate(), btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but extent_buffer_uptodate() is preferred since we don't have to look into verify_parent_transid(). Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ca19b4a6 |
|
17-May-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: remove superfluous free_extent_buffer in read_block_for_search read_block_for_search() can be simplified as: tmp = find_extent_buffer(); if (tmp) return; ... free_extent_buffer(); read_tree_block(); Apparently, @tmp must be NULL at this point, free_extent_buffer() is not needed. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
02a3307a |
|
15-May-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
btrfs: fix reading stale metadata blocks after degraded raid1 mounts If a btree block, aka. extent buffer, is not available in the extent buffer cache, it'll be read out from the disk instead, i.e. btrfs_search_slot() read_block_for_search() # hold parent and its lock, go to read child btrfs_release_path() read_tree_block() # read child Unfortunately, the parent lock got released before reading child, so commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had used 0 as parent transid to read the child block. It forces read_tree_block() not to check if parent transid is different with the generation id of the child that it reads out from disk. A simple PoC is included in btrfs/124, 0. A two-disk raid1 btrfs, 1. Right after mkfs.btrfs, block A is allocated to be device tree's root. 2. Mount this filesystem and put it in use, after a while, device tree's root got COW but block A hasn't been allocated/overwritten yet. 3. Umount it and reload the btrfs module to remove both disks from the global @fs_devices list. 4. mount -odegraded dev1 and write some data, so now block A is allocated to be a leaf in checksum tree. Note that only dev1 has the latest metadata of this filesystem. 5. Umount it and mount it again normally (with both disks), since raid1 can pick up one disk by the writer task's pid, if btrfs_search_slot() needs to read block A, dev2 which does NOT have the latest metadata might be read for block A, then we got a stale block A. 6. As parent transid is not checked, block A is marked as uptodate and put into the extent buffer cache, so the future search won't bother to read disk again, which means it'll make changes on this stale one and make it dirty and flush it onto disk. To avoid the problem, parent transid needs to be passed to read_tree_block(). In order to get a valid parent transid, we need to hold the parent's lock until finishing reading child. This patch needs to be slightly adapted for stable kernels, the &first_key parameter added to read_tree_block() is from 4.16+ (581c1760415c4). The fix is to replace 0 by 'gen'. Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6f2f0b39 |
|
13-May-2018 |
Robbie Ko <robbieko@synology.com> |
Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting [BUG] btrfs incremental send BUG happens when creating a snapshot of snapshot that is being used by send. [REASON] The problem can happen if while we are doing a send one of the snapshots used (parent or send) is snapshotted, because snapshoting implies COWing the root of the source subvolume/snapshot. 1. When doing an incremental send, the send process will get the commit roots from the parent and send snapshots, and add references to them through extent_buffer_get(). 2. When a snapshot/subvolume is snapshotted, its root node is COWed (transaction.c:create_pending_snapshot()). 3. COWing releases the space used by the node immediately, through: __btrfs_cow_block() --btrfs_free_tree_block() ----btrfs_add_free_space(bytenr of node) 4. Because send doesn't hold a transaction open, it's possible that the transaction used to create the snapshot commits, switches the commit root and the old space used by the previous root node gets assigned to some other node allocation. Allocation of a new node will use the existing extent buffer found in memory, which we previously got a reference through extent_buffer_get(), and allow the extent buffer's content (pages) to be modified: btrfs_alloc_tree_block --btrfs_reserve_extent ----find_free_extent (get bytenr of old node) --btrfs_init_new_buffer (use bytenr of old node) ----btrfs_find_create_tree_block ------alloc_extent_buffer --------find_extent_buffer (get old node) 5. So send can access invalid memory content and have unpredictable behaviour. [FIX] So we fix the problem by copying the commit roots of the send and parent snapshots and use those copies. CallTrace looks like this: ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:1861! invalid opcode: 0000 [#1] SMP CPU: 6 PID: 24235 Comm: btrfs Tainted: P O 3.10.105 #23721 ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000 RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs] RSP: 0018:ffff88041b723b68 EFLAGS: 00010246 RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000 RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000 RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66 R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890 R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001 FS: 00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000) CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880 ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50 ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400 Call Trace: [<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs] [<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs] [<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs] [<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs] [<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs] [<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs] [<ffffffff81111133>] ? handle_pte_fault+0x373/0x990 [<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20 [<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0 [<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0 [<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500 [<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90 [<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990 [<ffffffff81034f83>] ? do_fork+0x113/0x3b0 [<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c [<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0 [<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b ---[ end trace 29576629ee80b2e1 ]--- Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function") CC: stable@vger.kernel.org # 3.6+ Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1d7c514 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace GPL boilerplate by SPDX -- sources Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d1980131 |
|
15-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: update barrier in should_cow_block Once there was a simple int force_cow that was used with the plain barriers, and then converted to a bit, so we should use the appropriate barrier helper. Other variables in the complex if condition do not depend on a barrier, so we should be fine in case the atomic barrier becomes a no-op. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
581c1760 |
|
28-Mar-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Validate child tree block's level and first key We have several reports about node pointer points to incorrect child tree blocks, which could have even wrong owner and level but still with valid generation and checksum. Although btrfs check could handle it and print error message like: leaf parent key incorrect 60670574592 Kernel doesn't have enough check on this type of corruption correctly. At least add such check to read_tree_block() and btrfs_read_buffer(), where we need two new parameters @level and @first_key to verify the child tree block. The new @level check is mandatory and all call sites are already modified to extract expected level from its call chain. While @first_key is optional, the following call sites are skipping such check: 1) Root node/leaf As ROOT_ITEM doesn't contain the first key, skip @first_key check. 2) Direct backref Only parent bytenr and level is known and we need to resolve the key all by ourselves, skip @first_key check. Another note of this verification is, it needs extra info from nodeptr or ROOT_ITEM, so it can't fit into current tree-checker framework, which is limited to node/leaf boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d9d19a01 |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: kill tree_mod_log_set_root_pointer helper A useless wrapper around tree_mod_log_insert_root that hides missing error handling. Move it to the callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0e82bcfe |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: kill tree_mod_log_set_node_key helper A trivial wrapper that can be simply opencoded and makes the GFP allocation request more visible. The error handling is now moved to the callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf1d3425 |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: kill trivial wrapper tree_mod_log_eb_move The wrapper is effectively an alias for tree_mod_log_insert_move but also hides the missing error handling. To make that more visible, lift the BUG_ON to the callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b1a09f1e |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: remove trivial locking wrappers of tree mod log The wrappers are trivial and do not bring any extra value on top of the plain locking primitives. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bcd24dab |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop fs_info parameter from __tree_mod_log_oldest_root It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b6dfa35b |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: embed tree_mod_move structure to tree_mod_elem The tree_mod_move is not used anywhere and can be embedded as anonymous structure. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a446a979 |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop unused fs_info parameter from tree_mod_log_eb_move Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95b757c1 |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop fs_info parameter from tree_mod_log_free_eb It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
db7279a2 |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop fs_info parameter from tree_mod_log_free_eb It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e09c2efe |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop fs_info parameter from tree_mod_log_insert_key It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6074d45f |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop fs_info parameter from tree_mod_log_insert_move It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3ac6de1a |
|
05-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop fs_info parameter from tree_mod_log_set_node_key It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
448f3a17 |
|
27-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove redundant comment from btrfs_search_forward This function always sets keep_locks to 1 and saves the old value of keep_locks which is restored at the end. So there is no way it can be called without keep_locks being set. Remove comment imposing redundant requirement on callers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4271ecea |
|
13-Dec-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Improve btrfs_search_slot description Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a74b35ec |
|
08-Dec-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Rename bin_search -> btrfs_bin_search Currently there are 2 function doing binary search on btrfs nodes: bin_search and btrfs_bin_search. The latter being a simple wrapper for the former. So eliminate the wrapper and just rename bin_search to btrfs_bin_search. No functional changes Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ea2c7c9 |
|
12-Dec-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix out of bounds access in btrfs_search_slot When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then the level variable is going to be 7 (this is the max height of the tree). On the other hand btrfs_cow_block is always called with "level + 1" as an index into the nodes and slots arrays. This leads to an out of bounds access. Admittdely this will be benign since an OOB access of the nodes array will likely read the 0th element from the slots array, which in this case is going to be 0 (since we start CoW at the top of the tree). The OOB access into the slots array in turn will read the 0th and 1st values of the locks array, which would both be 0 at the time. However, this benign behavior relies on the fact that the path being passed hasn't been initialised, if it has already been used to query a btree then it could potentially have populated the nodes/slots arrays. Fix it by explicitly checking if we are at level 7 (the maximum allowed index in nodes/slots arrays) and explicitly call the CoW routine with NULL for parent's node/slot. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Fixes-coverity-id: 711515 Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
692826b2 |
|
21-Nov-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: handle errors while updating refcounts in update_ref_for_cow Since commit fb235dc06fa (btrfs: qgroup: Move half of the qgroup accounting time out of commit trans) the assumption that btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has been false. The qgroup operations call into btrfs_search_slot and friends and can now return the full spectrum of error codes. Fortunately, the fix here is easy since update_ref_for_cow failing is already handled so we just need to bail early with the error code. Fixes: fb235dc06fa (btrfs: qgroup: Move half of the qgroup accounting ...) Cc: <stable@vger.kernel.org> # v4.11+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Edmund Nadolski <enadolski@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
84f7d8e6 |
|
29-Sep-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: pass root to various extent ref mod functions We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee8c494f |
|
20-Aug-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove unused arguments from btrfs_changed_cb_t btrfs_changed_cb_t represents the signature of the callback being passed to btrfs_compare_trees. Currently there is only one such callback, namely changed_cb in send.c. This function doesn't really uses the first 2 parameters, i.e. the roots. Since there are not other functions implementing the btrfs_changed_cb_t let's remove the unused parameters from the prototype and implementation. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4f78750 |
|
29-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in btrfs_print_leaf, remove argument Signed-off-by: David Sterba <dsterba@suse.com>
|
#
adf02123 |
|
31-May-2017 |
David Sterba <dsterba@suse.com> |
btrfs: adjust includes after vmalloc removal As we don't use vmalloc/vzalloc/vfree directly in ctree.c, we can now use the proper header that defines kvmalloc. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d9ec8c4 |
|
29-May-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET Commit 5f39d397dfbe ("Btrfs: Create extent_buffer interface for large blocksizes") refactored btrfs_leaf_data function to take extent_buffer rather than struct btrfs_leaf. However, as it turns out the parameter being passed is never used. Furthermore this function no longer returns the leaf data but rather the offset to it. So rename the function to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_* helpers and turn it into a macro. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ removed () from the macro ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
752ade68 |
|
08-May-2017 |
Michal Hocko <mhocko@suse.com> |
treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bcc8e07f |
|
28-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: sink GFP flags parameter to tree_mod_log_insert_root All (1) callers pass the same value. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
176ef8f5 |
|
28-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: sink GFP flags parameter to tree_mod_log_insert_move All (1) callers pass the same value. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
047e5e17 |
|
14-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove BUG_ON from __tree_mod_log_insert All callers dereference the 'tm' parameter before it gets to this function, the NULL check does not make much sense here. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
263d3995 |
|
17-Feb-2017 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: try harder to migrate items to left sibling before splitting a leaf Before attempting to split a leaf we try to migrate items from the leaf to its right and left siblings. We start by trying to move items into the rigth sibling and, if the new item is meant to be inserted at the end of our leaf, we try to free from our leaf an amount of bytes equal to the number of bytes used by the new item, by setting the variable space_needed to the byte size of that new item. However if we fail to move enough items to the right sibling due to lack of space in that sibling, we then try to move items into the left sibling, and in that case we try to free an amount equal to the size of the new item from our leaf, when we need only to free an amount corresponding to the size of the new item minus the current free space of our leaf. So make sure that before we try to move items to the left sibling we do set the variable space_needed with a value corresponding to the new item's size minus the leaf's current free space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
|
#
f1e30261 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from tree_move_next_or_upnext Not needed. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab6a43e1 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from tree_move_down Never needed. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
66cb7ddb |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from __push_leaf_left Unused since long ago. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1e47eef2 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from __push_leaf_right Unused since long ago. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4961e293 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from split_item Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c302b49 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from clean_tree_block Added but never needed. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cda79c54 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from read_block_for_search Never used in that function. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d07b8528 |
|
30-Jan-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove unused trans in read_block_for_search @trans is not used at all, this removes it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
310712b2 |
|
18-Jan-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: constify struct btrfs_{,disk_}key wherever possible In a lot of places, it's unclear when it's safe to reuse a struct btrfs_key after it has been passed to a helper function. Constify these arguments wherever possible to make it obvious. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b4df8b6 |
|
19-Dec-2016 |
Geliang Tang <geliangtang@gmail.com> |
btrfs: use rb_entry() instead of container_of To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2ff7e61e |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: take an fs_info directly when the root is not used otherwise There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0b246afa |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, add fs_info convenience variables In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da17066c |
|
15-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: pull node/sector/stripe sizes out of root and into fs_info We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
58e8012c |
|
08-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: add optimized version of eb to eb copy Using copy_extent_buffer is suitable for copying betwenn buffers from an arbitrary offset and deals with page boundaries. This is not necessary when doing a full extent_buffer-to-extent_buffer copy. We can utilize the copy_page helper as well. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b159fa28 |
|
08-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: remove constant parameter to memset_extent_buffer and rename it The only memset we do is to 0, so sink the parameter to the function and simplify all calls. Rename the function to reflect the behaviour. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d24ee97b |
|
09-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: use new helpers to set uuids in eb Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62d1f9fe |
|
08-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: remove trivial helper btrfs_find_tree_block During the time, the function has been shrunk to the point that it just calls find_extent_buffer, just passing the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
196e0249 |
|
07-Sep-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf When we're not able to get enough space through splitting leaf, we'd create a new sibling leaf instead, and it's possible that we return a zero-nritem sibling leaf and mark it dirty before it's in a consistent state. With CONFIG_BTRFS_FS_CHECK_INTEGRITY=y, the integrity check of check_leaf will report panic due to this zero-nritem non-root leaf. This removes the unnecessary btrfs_mark_buffer_dirty. Reported-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
851cd173 |
|
23-Sep-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: memset to avoid stale content in btree leaf This is an additional patch to "Btrfs: memset to avoid stale content in btree node block". This uses memset to initialize the unused space in a leaf to avoid potential stale content, which may be incurred by pushing items between sibling leaves. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0f5053eb |
|
22-Sep-2016 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: parent_start initialization cleanup Code cleanup. parent_start is initialized multiple times when it is not necessary to do so. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62e85577 |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: convert printk(KERN_* to use pr_* calls This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5d163e0e |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: unsplit printed strings CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e2c89907 |
|
12-Sep-2016 |
Masahiro Yamada <yamada.masahiro@socionext.com> |
btrfs: squash lines for simple wrapper functions Remove unneeded variables and assignments. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2309e796 |
|
23-Aug-2016 |
Luis Henriques <luis.henriques@canonical.com> |
btrfs: Fix warning "variable ‘gen’ set but not used" Variable 'gen' in reada_for_search() is not used since commit 58dc4ce43251 ("btrfs: remove unused parameter from readahead_tree_block"). This patch simply removes this variable. Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
66642832 |
|
10-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_abort_transaction, drop root parameter __btrfs_abort_transaction doesn't use its root parameter except to obtain an fs_info pointer. We can obtain that from trans->root->fs_info for now and from trans->fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5ee5c9a |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root Now that we have a dummy fs_info associated with each test that uses a root, we don't need the DUMMY_ROOT bit anymore. This lets us make choices without needing an actual root like in e.g. btrfs_find_create_tree_block. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb770ae4 |
|
05-Jul-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix read_node_slot to return errors We use read_node_slot() to read btree node and it has two cases, a) slot is out of range, which means 'no such entry' b) we fail to read the block, due to checksum fails or corrupted content or not with uptodate flag. But we're returning NULL in both cases, this makes it return -ENOENT in case a) and return -EIO in case b), and this fixes its callers as well as btrfs_search_forward() 's caller to catch the new errors. The problem is reported by Peter Becker, and I can manage to hit the same BUG_ON by mounting my fuzz image. Reported-by: Peter Becker <floyd.net@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5e24e9af |
|
23-Jun-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: error out if generic_bin_search get invalid arguments With btrfs-corrupt-block, one can set btree node/leaf's field, if we assign a negative value to node/leaf, we can get various hangs, eg. if extent_root's nritems is -2ULL, then we get stuck in btrfs_read_block_groups() because it has a while loop and btrfs_search_slot() on extent_root will always return the first child. This lets us know what's happening and returns a EINVAL to callers instead of returning the first item. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
415b35a5 |
|
17-Jun-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix error handling in map_private_extent_buffer map_private_extent_buffer() can return -EINVAL in two different cases, 1. when the requested contents span two pages if nodesize is larger than pagesize, 2. when it detects something insane. The 2nd one used to be only a WARN_ON(1), and we decided to return a error to callers, but we didn't fix up all its callers, which will be addressed by this patch. Without this, btrfs may end up with 'general protection', ie. reading invalid memory. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
64c12921 |
|
07-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: account for non-CoW'd blocks in btrfs_abort_transaction The test for !trans->blocks_used in btrfs_abort_transaction is insufficient to determine whether it's safe to drop the transaction handle on the floor. btrfs_cow_block, informed by should_cow_block, can return blocks that have already been CoW'd in the current transaction. trans->blocks_used is only incremented for new block allocations. If an operation overlaps the blocks in the current transaction entirely and must abort the transaction, we'll happily let it clean up the trans handle even though it may have modified the blocks and will commit an incomplete operation. In the long-term, I'd like to do closer tracking of when the fs is actually modified so we can still recover as gracefully as possible, but that approach will need some discussion. In the short term, since this is the only code using trans->blocks_used, let's just switch it to a bool indicating whether any blocks were used and set it when should_cow_block returns false. Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c871b0f2 |
|
06-Jun-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: check if extent buffer is aligned to sectorsize Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Now that alloc_extent_buffer() no more returns NULL, this changes its caller and callers of its caller to match with the new error handling. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b9ef22de |
|
01-Jun-2016 |
Feifei Xu <xufeifei@linux.vnet.ibm.com> |
Btrfs: self-tests: Support non-4k page size self-tests code assumes 4k as the sectorsize and nodesize. This commit fix hardcoded 4K. Enables the self-tests code to be executed on non-4k page sized systems (e.g. ppc64). Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
01327610 |
|
19-May-2016 |
Nicholas D Steeves <nsteeves@gmail.com> |
btrfs: fix string and comment grammatical issues and typos Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
34d97007 |
|
16-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename btrfs_std_error to btrfs_handle_fs_error btrfs_std_error() handles errors, puts FS into readonly mode (as of now). So its good idea to rename it to btrfs_handle_fs_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ edit changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8f282f71 |
|
30-Mar-2016 |
David Sterba <dsterba@suse.com> |
btrfs: fallback to vmalloc in btrfs_compare_tree The allocation of node could fail if the memory is too fragmented for a given node size, practically observed with 64k. http://article.gmane.org/gmane.comp.file-systems.btrfs/54689 Reported-and-tested-by: Jean-Denis Girard <jd.girard@sysnux.pf> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e780b0d1 |
|
18-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: send: use GFP_KERNEL everywhere The send operation is not on the critical writeback path we don't need to use GFP_NOFS for allocations. All error paths are handled and the whole operation is restartable. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
298cfd36 |
|
21-Jan-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
Btrfs: Use (eb->start, seq) as search key for tree modification log In subpagesize-blocksize a page can map multiple extent buffers and hence using (page index, seq) as the search key is incorrect. For example, searching through tree modification log tree can return an entry associated with the first extent buffer mapped by the page (if such an entry exists), when we are actually searching for entries associated with extent buffers that are mapped at position 2 or more in the page. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4058b54 |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup, use enum values for btrfs_path reada Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee22184b |
|
14-Dec-2015 |
Byongho Lee <bhlee.kernel@gmail.com> |
Btrfs: use linux/sizes.h to represent constants We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ce0eac2a |
|
23-Aug-2015 |
Alexandru Moise <00moses.alexander00@gmail.com> |
btrfs: Fixed dsize and last_off declarations The return values of btrfs_item_offset_nr and btrfs_item_size_nr are of type u32. To avoid mixing signed and unsigned integers we should also declare dsize and last_off to be of type u32. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4553fef |
|
25-Sep-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: consolidate btrfs_error() to btrfs_std_error() btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
93314e3b |
|
06-Aug-2015 |
Zhaolei <zhaolei@cn.fujitsu.com> |
btrfs: abort transaction on btrfs_reloc_cow_block() When btrfs_reloc_cow_block() failed in __btrfs_cow_block(), current code just return a err-value to caller, but leave new_created extent buffer exist and locked. Then subsequent code (in relocate) try to lock above eb again, and caused deadlock without any dmesg. (eb lock use wait_event(), so no lockdep message) It is hard to do recover work in __btrfs_cow_block() at this error point, but we can abort transaction to avoid deadlock and operate on unstable state.a It also helps developer to find wrong place quickly. (better than a frozen fs without any dmesg before patch) Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
64c043de |
|
25-May-2015 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix up read_tree_block to return proper error The return value of read_tree_block() can confuse callers as it always returns NULL for either -ENOMEM or -EIO, so it's likely that callers parse it to a wrong error, for instance, in btrfs_read_tree_root(). This fixes the above issue. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
31e818fe |
|
20-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup, use kmalloc_array/kcalloc array helpers Convert kmalloc(nr * size, ..) to kmalloc_array that does additional overflow checks, the zeroing variant is kcalloc. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
5dfe2be7 |
|
23-Feb-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix off-by-one logic error in btrfs_realloc_node The end_slot variable actually matches the number of pointers in the node and not the last slot (which is 'nritems - 1'). Therefore in order to check that the current slot in the for loop doesn't match the last one, the correct logic is to check if 'i' is less than 'end_slot - 1' and not 'end_slot - 2'. Fix this and set end_slot to be 'nritems - 1', as it's less confusing since the variable name implies it's inclusive rather then exclusive. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
01d58472 |
|
21-Nov-2014 |
Daniel Dressler <danieru.dressler@gmail.com> |
Btrfs: disk-io: replace root args iff only fs_info used This is the 3rd independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) csum_tree_block 2) csum_dirty_buffer 3) check_tree_block_fsid 4) btrfs_find_tree_block 5) clean_tree_block Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
b7a0365e |
|
11-Nov-2014 |
Daniel Dressler <danieru.dressler@gmail.com> |
Btrfs: ctree: reduce args where only fs_info used This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. This causes programmers to ponder which root can be passed. Since only the fs_info is read affected functions can accept any root, except this is only obvious upon inspection. This patch reduces the specificty of such functions to accept the fs_info directly. This patch does not address the two functions in ctree.c (insert_ptr, and split_item) which only use root for BUG_ONs in ctree.c This patch affects the following functions: 1) fixup_low_keys 2) btrfs_set_item_key_safe Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
95449a16 |
|
14-Jan-2015 |
chandan <chandan@linux.vnet.ibm.com> |
Btrfs: insert_new_root: Fix lock type of the extent buffer. btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has been taken. Hence assign the appropriate value to path->locks[level]. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
a8df6fe6 |
|
19-Jan-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix setup_leaf_for_split() to avoid leaf corruption We were incorrectly detecting when the target key didn't exist anymore after releasing the path and re-searching the tree. This could make us split or duplicate (btrfs_split_item() and btrfs_duplicate_item() are its only callers at the moment) an item when we should not. For the case of duplicating an item, we currently only duplicate checksum items (csum tree) and file extent items (fs/subvol trees). For the checksum items we end up overriding the item completely, but for file extent items we update only some of their fields in the copy (done in __btrfs_drop_extents), which means we can end up having a logical corruption for some values. Also for the case where we duplicate a file extent item it will make us produce a leaf with a wrong key order, as btrfs_duplicate_item() advances us to the next slot and then its caller sets a smaller key on the new item at that slot (like in __btrfs_drop_extents() e.g.). Alternatively if the tree search in setup_leaf_for_split() leaves with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end up accessing beyond the leaf's end (when we check if the item's size has changed) and make our caller insert an item at the invalid slot btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory access if the leaf is full or nearly full. This issue has been present since the introduction of this function in 2009: Btrfs: Add btrfs_duplicate_item commit ad48fd754676bfae4139be1a897b1ea58f9aaf21 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e7070be1 |
|
16-Dec-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: change how we track dirty roots I've been overloading root->dirty_list to keep track of dirty roots and which roots need to have their commit roots switched at transaction commit time. This could cause us to lose an update to the root which could corrupt the file system. To fix this use a state bit to know if the root is dirty, and if it isn't set we go ahead and move the root to the dirty list. This way if we re-dirty the root after adding it to the switch_commit list we make sure to update it. This also makes it so that the extent root is always the last root on the dirty list to try and keep the amount of churn down at this point in the commit. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1d4c08e0 |
|
02-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: expand btrfs_find_item if found_key is NULL If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper for simple btrfs_search_slot. After we've removed all such callers, passing a NULL key is not valid anymore. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
381cf658 |
|
02-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: fix leak of path in btrfs_find_item If btrfs_find_item is called with NULL path it allocates one locally but does not free it. Affected paths are inserting an orphan item for a file and for a subvol root. Move the path allocation to the callers. CC: <stable@vger.kernel.org> # 3.14+ Fixes: 3f870c289900 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality") Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
3f556f78 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: unify extent buffer allocation api Make the extent buffer allocation interface consistent. Cloned eb will set a valid fs_info. For dummy eb, we can drop the length parameter and set it from fs_info. The built-in sanity checks may pass a NULL fs_info that's queried for nodesize, but we know it's 4096. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
d3e46fea |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: sink blocksize parameter to readahead_tree_block All callers pass nodesize. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
5f5bc6b1 |
|
09-Nov-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: make xattr replace operations atomic Replacing a xattr consists of doing a lookup for its existing value, delete the current value from the respective leaf, release the search path and then finally insert the new value. This leaves a time window where readers (getxattr, listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs, so this has security implications. This change also fixes 2 other existing issues which were: *) Deleting the old xattr value without verifying first if the new xattr will fit in the existing leaf item (in case multiple xattrs are packed in the same item due to name hash collision); *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't exist but we have have an existing item that packs muliple xattrs with the same name hash as the input xattr. In this case we should return ENOSPC. A test case for xfstests follows soon. Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace implementation. Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f82c458a |
|
19-Nov-2014 |
Chris Mason <clm@fb.com> |
btrfs: fix lockups from btrfs_clear_path_blocking The fair reader/writer locks mean that btrfs_clear_path_blocking needs to strictly follow lock ordering rules even when we already have blocking locks on a given path. Before we can clear a blocking lock on the path, we need to make sure all of the locks have been converted to blocking. This will remove lock inversions against anyone spinning in write_lock() against the buffers we're trying to get read locks on. These inversions didn't exist before the fair read/writer locks, but now we need to be more careful. We papered over this deadlock in the past by changing btrfs_try_read_lock() to be a true trylock against both the spinlock and the blocking lock. This was slower, and not sufficient to fix all the deadlocks. This patch adds a btrfs_tree_read_lock_atomic(), which basically means get the spinlock but trylock on the blocking lock. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Reported-by: Patrick Schmid <schmid@phys.ethz.ch> cc: stable@vger.kernel.org #v3.15+
|
#
b99d9a6a |
|
25-Sep-2014 |
Fabian Frederick <fabf@skynet.be> |
btrfs: fix shadow warning on cmp cmp was declared twice in btrfs_compare_trees resulting in a shadow warning. This patch renames second internal variable. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fccb84c9 |
|
29-Sep-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: move checks for DUMMY_ROOT into a helper Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
7ec20afb |
|
24-Jul-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: new define for the inline extent data start Use a common definition for the inline data start so we don't have to open-code it and introduce bugs like "Btrfs: fix wrong max inline data size limit" fixed. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
4d75f8a9 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove blocksize from btrfs_alloc_free_block and rename Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free + _tree_block family. The parameter blocksize was set to the metadata block size, directly or indirectly. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
0308af44 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove unused parameter blocksize from btrfs_find_tree_block Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
ce86cd59 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove parameter blocksize from read_tree_block We know the tree block size, no need to pass it around. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
58dc4ce4 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove unused parameter from readahead_tree_block The parent_transid parameter has been unused since its introduction in ca7a79ad8dbe2466 ("Pass down the expected generation number when reading tree blocks"). In reada_tree_block, it was even wrongly set to leafsize. Transid check is done in the proper read and readahead ignores errors. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
f98de9b9 |
|
04-Aug-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: make btrfs_search_forward return with nodes unlocked None of the uses of btrfs_search_forward() need to have the path nodes (level >= 1) read locked, only the leaf needs to be locked while the caller processes it. Therefore make it return a path with all nodes unlocked, except for the leaf. This change is motivated by the observation that during a file fsync we repeatdly call btrfs_search_forward() and process the returned leaf while upper nodes of the returned path (level >= 1) are read locked, which unnecessarily blocks other tasks that want to write to the same fs/subvol btree. Therefore instead of modifying the fsync code to unlock all nodes with level >= 1 immediately after calling btrfs_search_forward(), change btrfs_search_forward() to do it, so that it benefits all callers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
160f4089 |
|
28-Jul-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: avoid unnecessary switch of path locks to blocking mode If we need to cow a node, increase the write lock level and retry the tree search, there's no point of changing the node locks in our path to blocking mode, as we only waste time and unnecessarily wake up other tasks waiting on the spinning locks (just to block them again shortly after) because we release our path before repeating the tree search. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
24cdc847 |
|
28-Jul-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: unlock nodes earlier when inserting items in a btree In ctree.c:setup_items_for_insert(), we can unlock all nodes in our path before we process the leaf (shift items and data, adjust data offsets, etc). This allows for better btree concurrency, as we're often holding a write lock on at least the node at level 1. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
707e8a07 |
|
04-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: use nodesize everywhere, kill leafsize The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e339a6b0 |
|
02-Jul-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: __btrfs_mod_ref should always use no_quota Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't understand how snapshot delete was using it and assumed that we needed the quota operations there. With Mark's work this has turned out to be not the case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the argument and make __btrfs_mod_ref call it's process function with no_quota set always. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0b43e04f |
|
08-Jun-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix leaf corruption after __btrfs_drop_extents Several reports about leaf corruption has been floating on the list, one of them points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted after __btrfs_drop_extents(), it's really a rare case but it does exist. The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents(). So in btrfs_next_leaf(), we release the current path to re-search the last key of the leaf for locating next leaf, and we've taken it into account that there might be balance operations between leafs during this 'unlock and re-lock' dance, so we check the path again and advance it if there are now more items available. But things are a bit different if that last key happens to be removed and balance gets a bigger key as the last one, and btrfs_search_slot will return it with ret > 0, IOW, nothing change in this leaf except the new last key, then we think we're okay because there is no more item balanced in, fine, we thinks we can go to the next leaf. However, we should return that bigger key, otherwise we deserve leaf corruption, for example, in endio, skipping that key means that __btrfs_drop_extents() thinks it has dropped all extent matched the required range and finish_ordered_io can safely insert a new extent, but it actually doesn't and ends up a leaf corruption. One may be asking that why our locking on extent io tree doesn't work as expected, ie. it should avoid this kind of race situation. But in __btrfs_drop_extents(), we don't always find extents which are included within our locking range, IOW, extents can start before our searching start, in this case locking on extent io tree doesn't protect us from the race. This takes the special case into account. Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
337c6f68 |
|
09-Jun-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item We might have had an item with the previous key in the tree right before we released our path. And after we released our path, that item might have been pushed to the first slot (0) of the leaf we were holding due to a tree balance. Alternatively, an item with the previous key can exist as the only element of a leaf (big fat item). Therefore account for these 2 cases, so that our callers (like btrfs_previous_item) don't miss an existing item with a key matching the previous key we computed above. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
faa2dbf0 |
|
07-May-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add sanity tests for new qgroup accounting code This exercises the various parts of the new qgroup accounting code. We do some basic stuff and do some things with the shared refs to make sure all that code works. I had to add a bunch of infrastructure because I needed to be able to insert items into a fake tree without having to do all the hard work myself, hopefully this will be usefull in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fcebe456 |
|
13-May-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: rework qgroup accounting Currently qgroups account for space by intercepting delayed ref updates to fs trees. It does this by adding sequence numbers to delayed ref updates so that it can figure out how the tree looked before the update so we can adjust the counters properly. The problem with this is that it does not allow delayed refs to be merged, so if you say are defragging an extent with 5k snapshots pointing to it we will thrash the delayed ref lock because we need to go back and manually merge these things together. Instead we want to process quota changes when we know they are going to happen, like when we first allocate an extent, we free a reference for an extent, we add new references etc. This patch accomplishes this by only adding qgroup operations for real ref changes. We only modify the sequence number when we need to lookup roots for bytenrs, this reduces the amount of churn on the sequence number and allows us to merge delayed refs as we add them most of the time. This patch encompasses a bunch of architectural changes 1) qgroup ref operations: instead of tracking qgroup operations through the delayed refs we simply add new ref operations whenever we notice that we need to when we've modified the refs themselves. 2) tree mod seq: we no longer have this separation of major/minor counters. this makes the sequence number stuff much more sane and we can remove some locking that was needed to protect the counter. 3) delayed ref seq: we now read the tree mod seq number and use that as our sequence. This means each new delayed ref doesn't have it's own unique sequence number, rather whenever we go to lookup backrefs we inc the sequence number so we can make sure to keep any new operations from screwing up our world view at that given point. This allows us to merge delayed refs during runtime. With all of these changes the delayed ref stuff is a little saner and the qgroup accounting stuff no longer goes negative in some cases like it was before. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
27cdeb70 |
|
02-Apr-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3f8a18cc |
|
28-Mar-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: hold the commit_root_sem when getting the commit root during send We currently rely too heavily on roots being read-only to save us from just accessing root->commit_root. We can easily balance blocks out from underneath a read only root, so to save us from getting screwed make sure we only access root->commit_root under the commit root sem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
9e351cc8 |
|
13-Mar-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: remove transaction from send Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, Reportedy-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
6baa4293 |
|
20-Feb-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: correctly determine if blocks are shared in btrfs_compare_trees Just comparing the pointers (logical disk addresses) of the btree nodes is not completely bullet proof, we have to check if their generation numbers match too. It is guaranteed that a COW operation will result in a block with a different logical disk address than the original block's address, but over time we can reuse that former logical disk address. For example, creating a 2Gb filesystem on a loop device, and having a script running in a loop always updating the access timestamp of a file, resulted in the same logical disk address being reused for the same fs btree block in about only 4 minutes. This could make us skip entire subtrees when doing an incremental send (which is currently the only user of btrfs_compare_trees). However the odds of getting 2 blocks at the same tree level, with the same logical disk address, equal first slot keys and different generations, should hopefully be very low. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
23c6bf6a |
|
11-Jan-2014 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix btrfs_search_slot_for_read backwards iteration If the current path's leaf slot is 0, we do search for the previous leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a value corresponding to the number of items - 1 of the former leaf. Fix this by using the slot set by btrfs_prev_leaf, decrementing it by 1 if it's equal to the leaf's number of items. Use of btrfs_search_slot_for_read() for backward iteration is used in particular by the send feature, which could miss items when the input leaf has less items than its previous leaf. This could be reproduced by running btrfs/007 from xfstests in a loop. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ade2e0b3 |
|
12-Jan-2014 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: fix to search previous metadata extent item since skinny metadata There is a bug that using btrfs_previous_item() to search metadata extent item. This is because in btrfs_previous_item(), we need type match, however, since skinny metada was introduced by josef, we may mix this two types. So just use btrfs_previous_item() is not working right. To keep btrfs_previous_item() like normal tree search, i introduce another function btrfs_previous_extent_item(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
eb653de1 |
|
23-Dec-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: reduce btree node locking duration on item update If we do a btree search with the goal of updating an existing item without changing its size (ins_len == 0 and cow == 1), then we never need to hold locks on upper level nodes (even when slot == 0) after we COW their child nodes/leaves, as we won't have node splits or merges in this scenario (that is, no key additions, removals or shifts on any nodes or leaves). Therefore release the locks immediately after COWing the child nodes/leaves while navigating the btree, even if their parent slot is 0, instead of returning a path to the caller with those nodes locked, which would get released only when the caller releases or frees the path (or if it calls btrfs_unlock_up_safe). This is a common scenario, for example when updating inode items in fs trees and block group items in the extent tree. The following benchmarks were performed on a quad core machine with 32Gb of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more quickly). sysbench --test=fileio --file-num=131072 --file-total-size=8G \ --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \ --max-requests=100000 --file-io-mode=sync [prepare|run] Before this change: 49.85Mb/s (average of 5 runs) After this change: 50.38Mb/s (average of 5 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
efe120a0 |
|
20-Dec-2013 |
Frank Holton <fholton@gmail.com> |
Btrfs: convert printk to btrfs_ and fix BTRFS prefix Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5de865ee |
|
20-Dec-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix tree mod logging While running the test btrfs/004 from xfstests in a loop, it failed about 1 time out of 20 runs in my desktop. The failure happened in the backref walking part of the test, and the test's error message was like this: btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad) --- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000 +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000 @@ -1,3 +1,8 @@ QA output created by 004 *** test backref walking -*** done +unexpected output from + /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got: + ... (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff) Ran: btrfs/004 Failures: btrfs/004 Failed 1 of 1 tests But immediately after the test finished, the btrfs inspect-internal command returned the expected output: $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 inode 405 offset 454656 root 258 inode 405 offset 454656 root 5 It turned out this was because the btrfs_search_old_slot() calls performed during backref walking (backref.c:__resolve_indirect_ref) were not finding anything. The reason for this turned out to be that the tree mod logging code was not logging some node multi-step operations atomically, therefore btrfs_search_old_slot() callers iterated often over an incomplete tree that wasn't fully consistent with any tree state from the past. Besides missing items, this often (but not always) resulted in -EIO errors during old slot searches, reported in dmesg like this: [ 4299.933936] ------------[ cut here ]------------ [ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]() [ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h [ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007 [ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70 [ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000 [ 4299.933987] Call Trace: [ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76 [ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0 [ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20 [ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs] [ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50 [ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs] [ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs] [ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs] [ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs] [ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs] [ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs] [ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00 [ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40 [ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934104] ---[ end trace 48f0cfc902491414 ]--- [ 4299.934378] btrfs bad fsid on block 0 These tree mod log operations that must be performed atomically, tree_mod_log_free_eb, tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to be performed atomically before the following commit: c8cc6341653721b54760480b0d0d9b5f09b46741 (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations) That change removed the atomicity of such operations. This patch restores the atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem structures, so it has to do the allocations using GFP_NOFS before acquiring the mod log lock. This issue has been experienced by several users recently, such as for example: http://www.spinics.net/lists/linux-btrfs/msg28574.html After running the btrfs/004 test for 679 consecutive iterations with this patch applied, I didn't ran into the issue anymore. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
78357766 |
|
12-Dec-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: return immediately if tree log mod is not necessary In ctree.c:tree_mod_log_set_node_key() we were calling __tree_mod_log_insert_key() even when the modification doesn't need to be logged. This would allocate a tree_mod_elem structure, fill it and pass it to __tree_mod_log_insert(), which would just acquire the tree mod log write lock and then free the tree_mod_elem structure and return (that is, a no-op). Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert() which just returns immediately if the modification doesn't need to be logged (without allocating the structure, fill it, acquire write lock, free structure). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2ef1fed2 |
|
04-Dec-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: more efficient push_leaf_right Currently when finding the leaf to insert a key into a btree, if the leaf doesn't have enough space to store the item we attempt to move off some items from our leaf to its right neighbor leaf, and if this fails to create enough free space in our leaf, we try to move off more items to the left neighbor leaf as well. When trying to move off items to the right neighbor leaf, if it has enough room to store the new key but not not enough room to move off at least one item from our target leaf, __push_leaf_right returns 1 and we have to attempt to move items to the left neighbor (push_leaf_left function) without touching the right neighbor leaf. For the case where the right leaf has enough room to store at least 1 item from our leaf, we end up modifying (and dirtying) both our leaf and the right leaf. This is non-optimal for the case where the new key is greater than any key in our target leaf because it can be inserted at slot 0 of the right neighbor leaf and we don't need to touch our leaf at all nor to attempt to move off items to the left neighbor leaf. Therefore this change just selects the right neighbor leaf as our new target leaf if it has enough room for the new key without modifying our initial target leaf - we do this only if the new key is higher than any key in the initial target leaf. While running the following test, push_leaf_right was called by split_leaf 4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this special case (right leaf has enough room and new key is higher than any key in the initial target leaf). Test: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \ --max-requests=100000 --file-io-mode=sync [prepare|run] Results: sequential writes Throughput before this change: 65.71Mb/sec (average of 10 runs) Throughput after this change: 66.58Mb/sec (average of 10 runs) random writes Throughput before this change: 10.75Mb/sec (average of 10 runs) Throughput after this change: 11.56Mb/sec (average of 10 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5a4267ca |
|
24-Nov-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: try harder to avoid btree node splits When attempting to move items from our target leaf to its neighbor leaves (right and left), we only need to free data_size - free_space bytes from our leaf in order to add the new item (which has size of data_size bytes). Therefore attempt to move items to the right and left leaves if they have at least data_size - free_space bytes free, instead of data_size bytes free. After 5 runs of the following test, I got a smaller number of btree node splits overall: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=8192 --max-requests=100000 --file-io-mode=sync Before this change: * 6171 splits (average of 5 test runs) * 61.508Mb/sec of throughput (average of 5 test runs) After this change: * 6036 splits (average of 5 test runs) * 63.533Mb/sec of throughput (average of 5 test runs) An ideal test would not just have multiple threads/processes writing to a file (insertion of file extent items) but also do other operations that result in insertion of items with varied sizes, like file/directory creations, creation of links, symlinks, xattrs, etc. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3f870c28 |
|
04-Nov-2013 |
Kelley Nielsen <kelleynnn@gmail.com> |
btrfs: expand btrfs_find_item() to include find_orphan_item functionality This is the third step in bootstrapping the btrfs_find_item interface. The function find_orphan_item(), in orphan.c, is similar to the two functions already replaced by the new interface. It uses two parameters, which are already present in the interface, and is nearly identical to the function brought in in the previous patch. Replace the two calls to find_orphan_item() with calls to btrfs_find_item(), with the defined objectid and type that was used internally by find_orphan_item(), a null path, and a null key. Add a test for a null path to btrfs_find_item, and if it passes, allocate and free the path. Finally, remove find_orphan_item(). Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
75ac2dd9 |
|
04-Nov-2013 |
Kelley Nielsen <kelleynnn@gmail.com> |
btrfs: expand btrfs_find_item() to include find_root_ref functionality This patch is the second step in bootstrapping the btrfs_find_item interface. The btrfs_find_root_ref() is similar to the former __inode_info(); it accepts four of its parameters, and duplicates the first half of its functionality. Replace the one former call to btrfs_find_root_ref() with a call to btrfs_find_item(), along with the defined key type that was used internally by btrfs_find_root ref, and a null found key. In btrfs_find_item(), add a test for the null key at the place where the functionality of btrfs_find_root_ref() ends; btrfs_find_item() then returns if the test passes. Finally, remove btrfs_find_root_ref(). Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Suggested-by: Zach Brown <zab@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e33d5c3d |
|
04-Nov-2013 |
Kelley Nielsen <kelleynnn@gmail.com> |
btrfs: bootstrap generic btrfs_find_item interface There are many btrfs functions that manually search the tree for an item. They all reimplement the same mechanism and differ in the conditions that they use to find the item. __inode_info() is one such example. Zach Brown proposed creating a new interface to take the place of these functions. This patch is the first step to creating the interface. A new function, btrfs_find_item, has been added to ctree.c and prototyped in ctree.h. It is identical to __inode_info, except that the order of the parameters has been rearranged to more closely those of similar functions elsewhere in the code (now, root and path come first, then the objectid, offset and type, and the key to be filled in last). __inode_info's callers have been set to call this new function instead, and __inode_info itself has been removed. Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Suggested-by: Zach Brown <zab@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
16e7549f |
|
21-Oct-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: incompatible format change to remove hole extents Btrfs has always had these filler extent data items for holes in inodes. This has made somethings very easy, like logging hole punches and sending hole punches. However for large holey files these extent data items are pure overhead. So add an incompatible feature to no longer add hole extents to reduce the amount of metadata used by these sort of files. This has a few changes for logging and send obviously since they will need to detect holes and log/send the holes if there are any. I've tested this thoroughly with xfstests and it doesn't cause any issues with and without the incompat format set. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
67871254 |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Fix checkpatch.pl warning of spacing issues Fix spacing issues detected via checkpatch.pl in accordance with the kernel style guidelines. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
fae7f21c |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Use WARN_ON()'s return value in place of WARN_ON(1) Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source code that outputs a more descriptive warnings. Also fix the styling warning of redundant braces that came up as a result of this fix. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e8b0d724 |
|
14-Oct-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix btrfs_prev_leaf() previous key computation If we decrement the key type, we must reset its offset to the largest possible offset (u64)-1. If we decrement the key's objectid, then we must reset the key's type and offset to their largest possible values, (u8)-1 and (u64)-1 respectively. Not doing so can make us miss an items in the tree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
498456d3 |
|
08-Oct-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: kill unused code in btrfs_search_forward After commit de78b51a2852bddccd6535e9e12de65f92787a1e (btrfs: remove cache only arguments from defrag path), @blockptr is no more used. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6174d3cb |
|
01-Oct-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: remove unused max_key arg from btrfs_search_forward It is not used for anything. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
0a4e5586 |
|
24-Sep-2013 |
Ross Kirk <ross.kirk@gmail.com> |
btrfs: remove unused parameter from btrfs_header_fsid Remove unused parameter, 'eb'. Unused since introduction in 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee Updated to be rebased against current upstream and correct diff supplied this time! Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d4b4087c |
|
24-Sep-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: do a full search everytime in btrfs_search_old_slot While running some snashot aware defrag tests I noticed I was panicing every once and a while in key_search. This is because of the optimization that says if we find a key at slot 0 it will be at slot 0 all the way down the rest of the tree. This isn't the case for btrfs_search_old_slot since it will likely replay changes to a buffer if something has changed since we took our sequence number. So short circuit this optimization by setting prev_cmp to -1 every time we call key_search so we will do our normal binary search. With this patch I am no longer seeing the panics I was seeing before. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
dd3cc16b |
|
16-Sep-2013 |
Ross Kirk <ross.kirk@gmail.com> |
btrfs: drop unused parameter from btrfs_item_nr Remove unused eb parameter from btrfs_item_nr Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
83d4cfd4 |
|
30-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fixup error handling in btrfs_reloc_cow If we failed to actually allocate the correct size of the extent to relocate we will end up in an infinite loop because we won't return an error, we'll just move on to the next extent. So fix this up by returning an error, and then fix all the callers to return an error up the stack rather than BUG_ON()'ing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d7396f07 |
|
30-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: optimize key searches in btrfs_search_slot When the binary search returns 0 (exact match), the target key will necessarily be at slot 0 of all nodes below the current one, so in this case the binary search is not needed because it will always return 0, and we waste time doing it, holding node locks for longer than necessary, etc. Below follow histograms with the times spent on the current approach of doing a binary search when the previous binary search returned 0, and times for the new approach, which directly picks the first item/child node in the leaf/node. Current approach: Count: 6682 Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429 Percentiles: 90th: 124.000; 95th: 145.000; 99th: 206.000 35.000 - 61.080: 1235 ################ 61.080 - 106.053: 4207 ##################################################### 106.053 - 183.606: 1122 ############## 183.606 - 317.341: 111 # 317.341 - 547.959: 6 | 547.959 - 8370.000: 1 | Approach proposed by this patch: Count: 6682 Range: 6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev: 7.160 Percentiles: 90th: 23.000; 95th: 27.000; 99th: 40.000 6.000 - 8.418: 58 # 8.418 - 11.670: 1149 ######################### 11.670 - 16.046: 2418 ##################################################### 16.046 - 21.934: 2098 ############################################## 21.934 - 29.854: 744 ################ 29.854 - 40.511: 154 ### 40.511 - 54.848: 41 # 54.848 - 74.136: 5 | 74.136 - 100.087: 9 | 100.087 - 135.000: 6 | These samples were captured during a run of the btrfs tests 001, 002 and 004 in the xfstests, with a leaf/node size of 4Kb. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b308bc2f |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
fba6aa75 |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_header_fsid() return unsigned long Internally, btrfs_header_fsid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
c1c9ff7c |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Remove superfluous casts from u64 to unsigned long long u64 is "unsigned long long" on all architectures now, so there's no need to cast it when formatting it using the "ll" length modifier. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
35a3621b |
|
14-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: get rid of sparse warnings make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__ I tried to filter out the warnings for which patches have already been sent to the mailing list, pending for inclusion in btrfs-next. All these changes should be obviously safe. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ba5e8f2e |
|
16-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix send issues related to inode number reuse If you are sending a snapshot and specifying a parent snapshot we will walk the trees and figure out where they differ and send the differences only. The way we check for differences are if the leaves aren't the same and if the keys are not the same within the leaves. So if neither leaf is the same (ie the leaf has been cow'ed from the parent snapshot) we walk each item in the send root and check it against the parent root. If the items match exactly then we don't do anything. This doesn't quite work for inode refs, since they will just have the name and the parent objectid. If you move the file from a directory and then remove that directory and re-create a directory with the same inode number as the old directory and then move that file back into that directory we will assume that nothing changed and you will get errors when you try to receive. In order to fix this we need to do extra checking to see if the inode ref really is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the items match. Then if the key type is an inode ref we can do some extra checking, otherwise we just keep processing. The extra checking is to look up the generation of the directory in the parent volume and compare it to the generation of the send volume. If they match then they are the same directory and we are good to go. If they don't we have to add them to the changed refs list. This means we have to track the generation of the ref we're trying to lookup when we iterate all the refs for a particular inode. So in the case of looking for new refs we have to get the generation from the parent volume, and in the case of looking for deleted refs we have to get the generation from the send volume to compare with. There was also the issue of using a ulist to keep track of the directories we needed to check. Because we can get a deleted ref and a new ref for the same inode number the ulist won't work since it indexes based on the value. So instead just dup any directory ref we find and add it to a local list, and then process that list as normal and do away with using a ulist for this altogether. Before we would fail all of the tests in the far-progs that related to moving directories (test group 32). With this patch we now pass these tests, and all of the tests in the far-progs send testing suite. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9ec72677 |
|
07-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: stop using GFP_ATOMIC when allocating rewind ebs There is no reason we can't just set the path to blocking and then do normal GFP_NOFS allocations for these extent buffers. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
db7f3436 |
|
07-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: deal with enomem in the rewind path We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret) in this path. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
c8cc6341 |
|
01-Jul-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: stop using GFP_ATOMIC for the tree mod log allocations Previously we held the tree mod lock when adding stuff because we use it to check and see if we truly do want to track tree modifications. This is admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty hard and often is not nice. So instead do our basic checks to see if we don't need to track modifications, and if those pass then do our allocation, and then when we go to insert the new modification check if we still care, and if we don't just free up our mod and return. Otherwise we're good to go and we can carry on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b5b9b5b3 |
|
03-Jul-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix extent buffer leak after backref walking commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks) takes an extra increment on the reference of allocated dummy extent buffer, so now we cannot free this dummy one, and end up with extent buffer leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
7fb7d76f |
|
01-Jul-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: only do the tree_mod_log_free_eb if this is our last ref There is another bug in the tree mod log stuff in that we're calling tree_mod_log_free_eb every single time a block is cow'ed. The problem with this is that if this block is shared by multiple snapshots we will call this multiple times per block, so if we go to rewind the mod log for this block we'll BUG_ON() in __tree_mod_log_rewind because we try to rewind a free twice. We only want to call tree_mod_log_free_eb if we are actually freeing the block. With this patch I no longer hit the panic in __tree_mod_log_rewind. Thanks, Cc: stable@vger.kernel.org Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
f1ca7e98 |
|
29-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: hold the tree mod lock in __tree_mod_log_rewind We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk forward in the tree mod entries, otherwise we'll end up with random entries and trip the BUG_ON() at the front of __tree_mod_log_rewind. This fixes the panics people were seeing when running find /whatever -type f -exec btrfs fi defrag {} \; Thansk, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
0b08851f |
|
17-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: optimize reada_for_balance This patch does two things. First we no longer explicitly read in the blocks we're trying to readahead. For things like balance_level we may never actually use the blocks so this just adds uneeded latency, and balance_level and split_node will both read in the blocks they care about explicitly so if the blocks need to be waited on it will be done there. Secondly we no longer drop the path if we do readahead, we just set the path blocking before we call reada_for_balance() and then we're good to go. Hopefully this will cut down on the number of re-searches. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
bdf7c00e |
|
17-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: optimize read_block_for_search This patch does two things, first it only does one call to btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then again with gen specified. The other thing is to call btrfs_read_buffer() on the buffer we've found instead of dropping it and then calling read_tree_block(). This will keep us from doing yet another radix tree lookup for a buffer we've already found. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
33157e05 |
|
21-May-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: check if leaf's parent exists before pushing items around During splitting a leaf, pushing items around to hopefully get some space only works when we have a parent, ie. we have at least one sibling leaf. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
fdd99c72 |
|
21-May-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: dont do log_removal in insert_new_root As for splitting a leaf, root is just the leaf, and tree mod log does not apply on leaf, so in this case, we don't do log_removal. As for splitting a node, the old root is kept as a normal node and we have nicely put records in tree mod log for moving keys and items, so in this case we don't do that either. As above, insert_new_root can get rid of log_removal. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
8f69dbd2 |
|
07-May-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: fix a comment The size parameter to btrfs_extend_item() is the number of bytes to add to the item, not the size of the item after the operation (like it is for btrfs_truncate_item(), there the size parameter is not the number of bytes to take away, but the total size of the item after truncation). Fix it in the comment. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
7e21f14d |
|
06-May-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
btrfs: fix btrfs_extend_item() comment The size parameter to btrfs_extend_item() is the number of bytes to add to the item, not the size of the item after the operation (like it is for btrfs_truncate_item(), there the size parameter is not the number of bytes to take away, but the total size of the item after truncation). Fix it in the comment. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
b1c79e09 |
|
09-May-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: handle running extent ops with skinny metadata Chris hit a bug where we weren't finding extent records when running extent ops. This is because we use the delayed_ref_head when running the extent op, which means we can't use the ->type checks to see if we are metadata. We also lose the level of the metadata we are working on. So to fix this we can just check the ->is_data section of the extent_op, and we can store the level of the buffer we were modifying in the extent_op. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
48a3b636 |
|
25-Apr-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: make static code static & remove dead code Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
fc36ed7e |
|
24-Apr-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log Sequence numbers for delayed refs have been introduced in the first version of the qgroup patch set. To solve the problem of find_all_roots on a busy file system, the tree mod log was introduced. The sequence numbers for that were simply shared between those two users. However, at one point in qgroup's quota accounting, there's a statement accessing the previous sequence number, that's still just doing (seq - 1) just as it would have to in the very first version. To satisfy that requirement, this patch makes the sequence number counter 64 bit and splits it into a major part (used for qgroup sequence number counting) and a minor part (incremented for each tree modification in the log). This enables us to go exactly one major step backwards, as required for qgroups, while still incrementing the sequence counter for tree mod log insertions to keep track of their order. Keeping them in a single variable means there's no need to change all the code dealing with comparisons of two sequence numbers. The sequence number is reset to 0 on commit (not new in this patch), which ensures we won't overflow the two 32 bit counters. Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs from the tree mod log code may happen. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
416bc658 |
|
23-Apr-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix all callers of read_tree_block We kept leaking extent buffers when mounting a broken file system and it turns out it's because not everybody uses read_tree_block properly. You need to check and make sure the extent_buffer is uptodate before you use it. This patch fixes everybody who calls read_tree_block directly to make sure they check that it is uptodate and free it and return an error if it is not. With this we no longer leak EB's when things go horribly wrong. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
4b90c680 |
|
15-Apr-2013 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: remove unused argument of btrfs_extend_item() Argument 'trans' is not used in btrfs_extend_item(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
afe5fea7 |
|
15-Apr-2013 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: cleanup of function where fixup_low_keys() is called If argument 'trans' is unnecessary in the function where fixup_low_keys() is called, 'trans' is deleted. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
d6a0a126 |
|
15-Apr-2013 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: remove unused argument of fixup_low_keys() Argument 'trans' is not used in fixup_low_keys(). So, remove it. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
47fb091f |
|
13-Apr-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix unlock after free on rewinded tree blocks When tree_mod_log_rewind decides to make a copy of the current tree buffer for its modifications, it subsequently freed the buffer before unlocking it. Obviously, those operations are required in reverse order. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
30b0463a |
|
13-Apr-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix accessing the root pointer in tree mod log functions The tree mod log functions were accessing root->node->... directly, without use of btrfs_root_node() or explicit rcu locking. This could lead to an extent buffer reference being leaked and another reference being freed too early when preemtion was enabled. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
90f8d62e |
|
13-Apr-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix tree mod log regression on root split operations Commit d9abbf1c changed tree mod log locking around ROOT_REPLACE operations. When a tree root is split, however, we were logging removal of all elements from the root node before logging removal of half of the elements for the split operation. This leads to a BUG_ON when rewinding. This commit removes the erroneous logging of removal of all elements. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
09a2a8f9 |
|
05-Apr-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix bad extent logging A user sent me a btrfs-image of a file system that was panicing on mount during the log recovery. I had originally thought these problems were from a bug in the free space cache code, but that was just a symptom of the problem. The problem is if your application does something like this [prealloc][prealloc][prealloc] the internal extent maps will merge those all together into one extent map, even though on disk they are 3 separate extents. So if you go to write into one of these ranges the extent map will be right since we use the physical extent when doing the write, but when we log the extents they will use the wrong sizes for the remainder prealloc space. If this doesn't happen to trip up the free space cache (which it won't in a lot of cases) then you will get bogus entries in your extent tree which will screw stuff up later. The data and such will still work, but everything else is broken. This patch fixes this by not allowing extents that are on the modified list to be merged. This has the side effect that we are no longer adding everything to the modified list all the time, which means we now have to call btrfs_drop_extents every time we log an extent into the tree. So this allows me to drop all this speciality code I was using to get around calling btrfs_drop_extents. With this patch the testcase I've created no longer creates a bogus file system after replaying the log. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
3173a18f |
|
07-Mar-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add a incompatible format change for smaller metadata extent refs We currently store the first key of the tree block inside the reference for the tree block in the extent tree. This takes up quite a bit of space. Make a new key type for metadata which holds the level as the offset and completely removes storing the btrfs_tree_block_info inside the extent ref. This reduces the size from 51 bytes to 33 bytes per extent reference for each tree block. In practice this results in a 30-35% decrease in the size of our extent tree, which means we COW less and can keep more of the extent tree in memory which makes our heavy metadata operations go much faster. This is not an automatic format change, you must enable it at mkfs time or with btrfstune. This patch deals with having metadata stored as either the old format or the new format so it is easy to convert. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
d9abbf1c |
|
20-Mar-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log To resolve backrefs, ROOT_REPLACE operations in the tree mod log are required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation. Therefore, those operations must be enclosed by tree_mod_log_write_lock() and tree_mod_log_write_unlock() calls. Those calls are private to the tree_mod_log_* functions, which means that removal of the elements of an old root node must be logged from tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5 (Btrfs: fix a tree mod logging issue for root replacement operations). This fixes the brand-new version of xfstest 276 as of commit cfe73f71. Cc: stable@vger.kernel.org Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
de78b51a |
|
31-Jan-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: remove cache only arguments from defrag path The entry point at the defrag ioctl always sets "cache only" to 0; the codepaths haven't run for a long time as far as I can tell. Chris says they're dead code, so remove them. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
1c697d4a |
|
30-Jan-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: annotate intentional switch case fallthroughs This keeps static checkers happy. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2a745b14 |
|
13-Feb-2013 |
Arne Jansen <sensille@gmx.net> |
Btrfs: fix crash in log replay with qgroups enabled When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a sanity-check of the number of items against the maximum possible number. It calculates that number with the nodesize of fs_root. Unfortunately fs_root is not yet set at this stage. So instead use the nodesize from tree_root, which is already initialized. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
57ba86c0 |
|
18-Dec-2012 |
Chris Mason <chris.mason@fusionio.com> |
Revert "Btrfs: reorder tree mod log operations in deleting a pointer" This reverts commit 6a7a665d78c5dd8bc76a010648c4e7d84517ab5a. This was bug was fixed differently in 3.6, so this commit isn't needed. Conflicts: fs/btrfs/ctree.c Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
4c3e6969 |
|
18-Dec-2012 |
Chris Mason <chris.mason@fusionio.com> |
Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems" This reverts commit 95c80bb1f6b24b57058d971ed252b2c1c5121b51. The bug addressed by this commit was fixed differently back in 3.6 Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
5124e00e |
|
07-Nov-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: only unlock and relock if we have to I noticed while doing fsync tests that we were always dropping the path and re-searching when we first cow the log root even though we've already gotten the write lock on the root. That's because we don't take into account that there might not be a parent node, so fix the check to make sure there is actually a parent node before we undo all of this work for nothing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
41be1f3b |
|
15-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: optimize leaf_space_used This gets called at least 4 times for every level while adding an object, and it involves 3 kmapping calls, which on my box take about 5us a piece. So instead use a token, which brings us down to 1 kmap call and makes this function take 1/3 of the time per call. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
70c8a91c |
|
11-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: log changed inodes based on the extent map tree We don't really need to copy extents from the source tree since we have all of the information already available to us in the extent_map tree. So instead just write the extents straight to the log tree and don't bother to copy the extent items from the source tree. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d6393786 |
|
12-Dec-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add path->really_keep_locks You'd think path->keep_locks would keep all the locks wouldn't you? You'd be wrong. It only keeps them if the slot is pointing to the last item in the node. This is for use with btrfs_next_leaf, which needs this sort of thing. But the horrible horrible things I'm going to do to the tree log means I really need everything held from root to leaf so I can add and delete items in the same search. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
5f3ab90a |
|
07-Dec-2012 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: rename root_times_lock to root_item_lock Originally root_times_lock was introduced as part of send/receive code however newly developed patch to label the subvol reused the same lock, so renaming it for a meaningful name. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6c1500f2 |
|
03-Nov-2012 |
Julia Lawall <Julia.Lawall@lip6.fr> |
fs/btrfs: drop if around WARN_ON Just use WARN_ON rather than an if containing only WARN_ON(1). A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression e; @@ - if (e) WARN_ON(1); + WARN_ON(e); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
31b1a2bd |
|
03-Nov-2012 |
Julia Lawall <Julia.Lawall@lip6.fr> |
fs/btrfs: use WARN Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
32adf090 |
|
18-Oct-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: cleanup unused arguments 'disk_key' is not used at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
0e411ece |
|
19-Oct-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: kill unnecessary arguments in del_ptr The argument 'tree_mod_log' is not necessary since all of callers enable it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6a7a665d |
|
19-Oct-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: reorder tree mod log operations in deleting a pointer Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
95c80bb1 |
|
19-Oct-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside an extent buffer node, and the node's number of items remains unchanged (unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that). So we don't need to increase node's number of items during rewinding, otherwise we may get an node larger than leafsize and cause general protection errors later. Here is the details, - If we do memory move for inserting a single pointer, we need to add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding. - If we do memory move for deleting a single pointer, we need to decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for deleting. - If we do memory move for balance left/right, we need to decrease node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
7bfdcf7f |
|
25-Oct-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix memory leak when cloning root's node After cloning root's node, we forgot to dec the src's ref which can lead to a memory leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
01763a2e |
|
23-Oct-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: comment for loop in tree_mod_log_insert_move Emphasis the way tree_mod_log_insert_move avoids adding MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of the move operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
d6381084 |
|
23-Oct-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix extent buffer reference for tree mod log roots In get_old_root we grab a lock on the extent buffer before we obtain a reference on that buffer. That order is changed now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
5b6602e7 |
|
23-Oct-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: determine level of old roots In btrfs_find_all_roots' termination condition, we compare the level of the old buffer we got from btrfs_search_old_slot to the level of the current root node. We'd better compare it to the level of the rewinded root node. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
834328a8 |
|
23-Oct-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: tree mod log's old roots could still be part of the tree Tree mod log treated old root buffers as always empty buffers when starting the rewind operations. However, the old root may still be part of the current tree at a lower level, with still some valid entries. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
ba1bfbd5 |
|
22-Oct-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix a tree mod logging issue for root replacement operations Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in two places. Where needed, we call tree_mod_log_free_eb explicitly now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
57911b8b |
|
19-Oct-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: don't put removals from push_node_left into tree mod log twice Independant of the check (push_items < src_items) tree_mod_log_eb_copy did log the removal of the old data entries from the source buffer. Therefore, we must not call tree_mod_log_eb_move if the check evaluates to true, as that would log the removal twice, finally resulting in (rewinded) buffers with wrong values for header_nritems. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
8d1a1317 |
|
29-Sep-2012 |
Robin Dong <sanbai@taobao.com> |
btrfs: remove unused function btrfs_insert_some_items() The function btrfs_insert_some_items() would not be called by any other functions, so remove it. Signed-off-by: Robin Dong <sanbai@taobao.com>
|
#
74dd17fb |
|
07-Aug-2012 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: fix btrfs send for inline items and compression The btrfs send code was assuming the offset of the file item into the extent translated to bytes on disk. If we're compressed, this isn't true, and so it was off into extents owned by other files. It was also improperly handling inline extents. This solves a crash where we may have gone past the end of the file extent item by not testing early enough for an inline extent. It also solves problems where we have a whole between the end of the inline item and the start of the full extent. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b12a3b1e |
|
07-Aug-2012 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: don't run __tree_mod_log_free_eb on leaves When we split a leaf, we may end up inserting a new root on top of that leaf. The reflog code was incorrectly assuming the old root was always a node. This makes sure we skip over leaves. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
1fa11e26 |
|
06-Aug-2012 |
Arne Jansen <sensille@gmx.net> |
Btrfs: fix deadlock in wait_for_more_refs Commit a168650c introduced a waiting mechanism to prevent busy waiting in btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations, where a tree_mod_seq is held while waiting for the io to complete, while the end_io calls btrfs_run_delayed_refs. This whole mechanism is unnecessary. If not enough runnable refs are available to satisfy count, just return as count is more like a guideline than a strict requirement. In case we have to run all refs, commit transaction makes sure that no other threads are working in the transaction anymore, so we just assert here that no refs are blocked. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
7069830a |
|
05-Jun-2012 |
Alexander Block <ablock84@googlemail.com> |
Btrfs: add btrfs_compare_trees function This function is used to find the differences between two trees. The tree compare skips whole subtrees if it detects shared tree blocks and thus is pretty fast. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
|
#
e6793769 |
|
13-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: add helper for tree enumeration Often no exact match is wanted but just the next lower or higher item. There's a lot of duplicated code throughout btrfs to deal with the corner cases. This patch adds a helper function that can facilitate searching. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
e6466e35 |
|
04-Jul-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix buffer leak in btrfs_next_old_leaf When calling btrfs_next_old_leaf, we were leaking an extent buffer in the rare case of using the deadlock avoidance code needed for the tree mod log. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2f38b3e1 |
|
13-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: add helper for tree enumeration Often no exact match is wanted but just the next lower or higher item. There's a lot of duplicated code throughout btrfs to deal with the corner cases. This patch adds a helper function that can facilitate searching. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
097b8a7c |
|
21-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: join tree mod log code with the code holding back delayed refs We've got two mechanisms both required for reliable backref resolving (tree mod log and holding back delayed refs). You cannot make use of one without the other. So instead of requiring the user of this mechanism to setup both correctly, we join them into a single interface. Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list as we did before, which was of no value. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
cf538830 |
|
04-Jul-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix buffer leak in btrfs_next_old_leaf When calling btrfs_next_old_leaf, we were leaking an extent buffer in the rare case of using the deadlock avoidance code needed for the tree mod log. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
d42244a0 |
|
22-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: resolve tree mod log locking issue in btrfs_next_leaf With the tree mod log, we may end up with two roots (the current root and a rewinded version of it) both pointing to two leaves, l1 and l2, of which l2 had already been cow-ed in the current transaction. If we don't rewind any tree blocks, we cannot have two roots both pointing to an already cowed tree block. Now there is btrfs_next_leaf, which has a leaf locked and wants a lock on the next (right) leaf. And there is push_leaf_left, which has a (cowed!) leaf locked and wants a lock on the previous (left) leaf. In order to solve this dead lock situation, we use try_lock in btrfs_next_leaf (only in case it's called with a tree mod log time_seq paramter) and if we fail to get a lock on the next leaf, we give up our lock on the current leaf and retry from the very beginning. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
19956c7e |
|
22-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix tree mod log rewind of ADD operations When a MOD_LOG_KEY_ADD operation is rewinded, we remove the key from the tree block. If its not the last key, removal involves a move operation. This move operation was explicitly done before this commit. However, at insertion time, there's a move operation before the actual addition to make room for the new key, which is recorded in the tree mod log as well. This means, we must drop the move operation when rewinding the add operation, because the next operation we'll be rewinding will be the corresponding MOD_LOG_MOVE_KEYS operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
c3e06965 |
|
21-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: always put insert_ptr modifications into the tree mod log Several callers of insert_ptr set the tree_mod_log parameter to 0 to avoid addition to the tree mod log. In fact, we need all of those operations. This commit simply removes the additional parameter and makes addition to the tree mod log unconditional. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
28da9fb4 |
|
21-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix tree mod log for root replacements at leaf level For the tree mod log, we don't log any operations at leaf level. If the root is at the leaf level (i.e. the tree consists only of the root), then __tree_mod_log_oldest_root will find a ROOT_REPLACE operation in the log (because we always log that one no matter which level), but no other operations. With this patch __tree_mod_log_oldest_root exits cleanly instead of BUGging in this situation. get_old_root checks if its really a root at leaf level in case we don't have any operations and WARNs if this assumption breaks. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
4325edd0 |
|
15-Jun-2012 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: init old_generation in get_old_root gcc was giving an uninit variable warning here. Strictly speaking we don't need to init it, but this will make things much less error prone. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3310c36e |
|
11-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix race in tree mod log addition When adding to the tree modification log, we grab two locks at different stages. We must not drop the outer lock until we're done with section protected by the inner lock. This moves the unlock call for the outer lock to the appropriate position. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
3d7806ec |
|
11-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add btrfs_next_old_leaf To make sense of the tree mod log, the backref walker not only needs btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was calling btrfs_search_slot. This obviously didn't give the correct result. This commit adds btrfs_next_old_leaf, a drop-in replacement for btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot with this time_seq parameter. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
a95236d9 |
|
05-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix return value for __tree_mod_log_oldest_root In __tree_mod_log_oldest_root() we must return the found operation even if it's not a ROOT_REPLACE operation. Otherwise, the caller assumes that there are no operations to be rewinded and returns immediately. The code in the caller is modified to improve readability. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
8ba97a15 |
|
04-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: use btrfs_read_lock_root_node in get_old_root get_old_root could race with root node updates because we weren't locking the node early enough. Use btrfs_read_lock_root_node to grab the root locked in the very beginning and release the lock as soon as possible (just like btrfs_search_slot does). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
4d5a0565 |
|
30-Apr-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: remove call to btrfs_header_nritems with no effect This is a leftover from cleanup patch 559af821. Before the cleanup, btrfs_header_nritems was called inside an if condition. As it has no side effects we need to preserve here, it should simply be dropped. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
c3193108 |
|
31-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix tree mod log rewinded level and rewinding of moved keys When we rewind REMOVE_WHILE_FREEING operations, there's code that allocates a fresh buffer instead of cloning the old one. Setting that buffer's level correctly was missing in this case. When rewinding a MOVE_KEYS operation, btrfs_node_key_ptr_offset(slot) was missing for memmove_extent_buffer()'s arguments. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
f395694c |
|
31-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix tree mod log del_ptr Logging for del_ptr when we're not deleting the last pointer was wrong. This fixes both, duplicate log entries and log sequence. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
e9b7fd4d |
|
31-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add tree_mod_dont_log helper Replace duplicate code by small inline helper function. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
926dd8a6 |
|
31-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add missing spin_lock for insertion into tree mod log tree_mod_alloc calls __get_tree_mod_seq and must acquire a spinlock before doing so. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
018642a1 |
|
29-May-2012 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: return value of btrfs_read_buffer is checked correctly btrfs_read_buffer() has the possibility of returning the error. Therefore, I add the code in which the return value of btrfs_read_buffer() is checked. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
|
#
5d9e75c4 |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add btrfs_search_old_slot The tree modification log together with the current state of the tree gives a consistent, old version of the tree. btrfs_search_old_slot is used to search through this old version and return old (dummy!) extent buffers. Naturally, this function cannot do any tree modifications. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
f3ea38da |
|
26-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add del_ptr and insert_ptr modifications to the tree mod log Record all relevant modifications to block pointers in the tree mod log so that we can rewind them later on for backref walking. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
f230475e |
|
26-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: put all block modifications into the tree mod log When running functions that can make changes to the internal trees (e.g. btrfs_search_slot), we check if somebody may be interested in the block we're currently modifying. If so, we record our modification to be able to rewind it later on. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
bd989ba3 |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add tree modification log functions The tree mod log will log modifications made fs-tree nodes. Most modifications are done by autobalance of the tree. Such changes are recorded as long as a block entry exists. When released, the log is cleaned. With the tree modification log, it's possible to reconstruct a consistent old state of the tree. This is required to do backref walking on a busy file system. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
5581a51a |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: don't set for_cow parameter for tree block functions Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed parameter for_cow = 1. In fact, these two functions should never mark their tree modification operations as for_cow, because they can change the number of blocks referenced by a tree. Hence, we remove the extra for_cow parameter from these functions and make them pass a zero down. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
f775738f |
|
30-Mar-2012 |
Wang Sheng-Hui <shhuiw@gmail.com> |
btrfs/ctree.c: remove the unnecessary 'return -1;' at the end of bin_search The code path should not reach there. Remove it. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
|
#
b9fab919 |
|
06-May-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: avoid sleeping in verify_parent_transid while atomic verify_parent_transid needs to lock the extent range to make sure no IO is underway, and so it can safely clear the uptodate bits if our checks fail. But, a few callers are using it with spinlocks held. Most of the time, the generation numbers are going to match, and we don't want to switch to a blocking lock just for the error case. This adds an atomic flag to verify_parent_transid, and changes it to return EAGAIN if it needs to block to properly verifiy things. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e5846fc6 |
|
02-May-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add properly locking around add_root_to_dirty_list add_root_to_dirty_list happens once at the very beginning of the transaction, but it is still racey. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f7c79f30 |
|
19-Mar-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: adjust the write_lock_level as we unlock btrfs_search_slot sometimes needs write locks on high levels of the tree. It remembers the highest level that needs a write lock and will use that for all future searches through the tree in a given call. But, very often we'll just cow the top level or the level below and we won't really need write locks on the root again after that. This patch changes things to adjust the write lock requirement as it unlocks levels. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cfed81a0 |
|
03-Mar-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add the ability to cache a pointer into the eb This cuts down on the CPU time used by map_private_extent_buffer Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3083ee2e |
|
09-Mar-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: introduce free_extent_buffer_stale Because btrfs cow's we can end up with extent buffers that are no longer necessary just sitting around in memory. So instead of evicting these pages, we could end up evicting things we actually care about. Thus we have free_extent_buffer_stale for use when we are freeing tree blocks. This will make it so that the ref for the eb being in the radix tree is dropped as soon as possible and then is freed when the refcount hits 0 instead of waiting to be released by releasepage. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
79787eaa |
|
12-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: replace many BUG_ONs with proper error handling btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
305a26af |
|
01-Sep-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Go readonly on tree errors in balance_level balace_level() seems to deal with missing tree nodes by BUG_ON(). Instead, we can easily just set the file system readonly and bubble -EROFS back up the stack. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
|
#
b68dc2a9 |
|
29-Aug-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Don't BUG_ON errors from update_ref_for_cow() __btrfs_cow_block(), the only caller of update_ref_for_cow() will BUG_ON() any error return. Instead, we can go read-only fs as update_ref_for_cow() manipulates disk data in a way which doesn't look like it's easily rolled back. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
e5df9573 |
|
29-Aug-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Go readonly on bad extent refs in update_ref_for_cow() update_ref_for_cow() will BUG_ON() after it's call to btrfs_lookup_extent_info() if no existing references are found. Since refs are computed directly from disk, this should be treated as a corruption instead of a logic error. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
be1a5564 |
|
08-Aug-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Don't BUG_ON() errors in update_ref_for_cow() The only caller of update_ref_for_cow() is __btrfs_cow_block() which was originally ignoring any return values. update_ref_for_cow() however doesn't look like a candidate to become a void function - there are a few places where errors can occur. So instead I changed update_ref_for_cow() to bubble all errors up (instead of BUG_ON). __btrfs_cow_block() was then updated to catch and BUG_ON() any errors from update_ref_for_cow(). The end effect is that we have no change in behavior, but about 8 different places where a BUG_ON(ret) was removed. Obviously a future patch will have to address the BUG_ON() in __btrfs_cow_block(). Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
143bede5 |
|
01-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: return void in functions without error conditions Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
66d7e7f0 |
|
12-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: mark delayed refs as for cow Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value from every call site. The for_cow parameter will later on be used to determine if a ref will change anything with respect to qgroups. Delayed refs coming from relocation are always counted as for_cow, as they don't change subvol quota. Also pass in the fs_info for later use. btrfs_find_all_roots() will use this as an optimization, as changes that are for_cow will not change anything with respect to which root points to a certain leaf. Thus, we don't need to add the current sequence number to those delayed refs. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
f1ebcc74 |
|
14-Nov-2011 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush The btrfs snapshotting code requires that once a root has been snapshotted, we don't change it during a commit. But there are two cases to lead to tree corruptions: 1) multi-thread snapshots can commit serveral snapshots in a transaction, and this may change the src root when processing the following pending snapshots, which lead to the former snapshots corruptions; 2) the free inode cache was changing the roots when it root the cache, which lead to corruptions. This fixes things by making sure we force COW the block after we create a snapshot during commiting a transaction, then any changes to the roots will result in COW, and we get all the fs roots and snapshot roots to be consistent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a05a9bb1 |
|
06-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix array bound checking Otherwise we can execced the array bound of path->slots[]. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
31533fb2 |
|
26-Jul-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: remove lockdep magic from btrfs_next_leaf Before the reader/writer locks, btrfs_next_leaf needed to keep the path blocking to avoid making lockdep upset. Now that btrfs_next_leaf only takes read locks, this isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bd681513 |
|
16-Jul-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: switch the btrfs tree locks to reader/writer The btrfs metadata btree is the source of significant lock contention, especially in the root node. This commit changes our locking to use a reader/writer lock. The lock is built on top of rw spinlocks, and it extends the lock tracking to remember if we have a read lock or a write lock when we go to blocking. Atomics count the number of blocking readers or writers at any given time. It removes all of the adaptive spinning from the old code and uses only the spinning/blocking hints inside of btrfs to decide when it should continue spinning. In read heavy workloads this is dramatically faster. In write heavy workloads we're still faster because of less contention on the root node lock. We suffer slightly in dbench because we schedule more often during write locks, but all other benchmarks so far are improved. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a6591715 |
|
18-Jul-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: stop using highmem for extent_buffers The extent_buffers have a very complex interface where we use HIGHMEM for metadata and try to cache a kmap mapping to access the memory. The next commit adds reader/writer locks, and concurrent use of this kmap cache would make it even more complex. This commit drops the ability to use HIGHMEM with extent buffers, and rips out all of the related code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ad3e34bb |
|
08-Jun-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: don't map extent buffer if path->skip_locking is set Arne's scrub stuff exposed a problem with mapping the extent buffer in reada_for_search. He searches the commit root with multiple threads and with skip_locking set, so we can race and overwrite node->map_token since node isn't locked. So fix this so that we only map the extent buffer if we don't already have a map_token and skip_locking isn't set. Without this patch scrub would panic almost immediately, with the patch it doesn't panic anymore. Thanks, Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
25b8b936 |
|
08-Jun-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: don't map extent buffer if path->skip_locking is set Arne's scrub stuff exposed a problem with mapping the extent buffer in reada_for_search. He searches the commit root with multiple threads and with skip_locking set, so we can race and overwrite node->map_token since node isn't locked. So fix this so that we only map the extent buffer if we don't already have a map_token and skip_locking isn't set. Without this patch scrub would panic almost immediately, with the patch it doesn't panic anymore. Thanks, Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
1cd30799 |
|
18-May-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item Currently, btrfs_truncate_item and btrfs_extend_item returns only 0. So, the check by BUG_ON in the caller is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
026fd317 |
|
13-May-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: don't always do readahead Our readahead is sort of sloppy, and really isn't always needed. For example if ls is doing a stating ls (which is the default) it's going to stat in non-disk order, so if say you have a directory with a stupid amount of files, readahead is going to do nothing but waste time in the case of doing the stat. Taking the unconditional readahead out made my test go from 57 minutes to 36 minutes. This means that everywhere we do loop through the tree we want to make sure we do set path->reada properly, so I went through and found all of the places where we loop through the path and set reada to 1. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
7e2355ba |
|
10-May-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: don't look at the extent buffer level 3 times in a row We have a bit of debugging in btrfs_search_slot to make sure the level of the cow block is the same as the original block we were cow'ing. I don't think I've ever seen this tripped, so kill it. This saves us 2 kmap's per level in our search. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
cb25c2ea |
|
10-May-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: map the node block when looking for readahead targets If we have particularly full nodes, we could call btrfs_node_blockptr up to 32 times, which is 32 pairs of kmap/kunmap, which _sucks_. So go ahead and map the extent buffer while we look for readahead targets. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
16cdcec7 |
|
22-Apr-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: implement delayed inode items operation Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dave@jikos.cz> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b3b4aa74 |
|
20-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: drop unused parameter from btrfs_release_path parameter tree root it's not used since commit 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee ("Btrfs: Create extent_buffer interface for large blocksizes") Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
62a45b60 |
|
20-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: make functions static when possible Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
edc95aec |
|
19-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: remove nested duplicate variable declarations Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
97d9a8a4 |
|
24-Mar-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: check return value of read_tree_block() This patch is checking return value of read_tree_block(), and if it is NULL, error processing. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
db5b493a |
|
23-Mar-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: cleanup some BUG_ON() This patch changes some BUG_ON() to the error return. (but, most callers still use BUG_ON()) Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1abe9b8a |
|
24-Mar-2011 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: add initial tracepoint support for btrfs Tracepoints can provide insight into why btrfs hits bugs and be greatly helpful for debugging, e.g dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0 dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0 btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0) btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0) btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8 flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0) flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0) flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0) Here is what I have added: 1) ordere_extent: btrfs_ordered_extent_add btrfs_ordered_extent_remove btrfs_ordered_extent_start btrfs_ordered_extent_put These provide critical information to understand how ordered_extents are updated. 2) extent_map: btrfs_get_extent extent_map is used in both read and write cases, and it is useful for tracking how btrfs specific IO is running. 3) writepage: __extent_writepage btrfs_writepage_end_io_hook Pages are cirtical resourses and produce a lot of corner cases during writeback, so it is valuable to know how page is written to disk. 4) inode: btrfs_inode_new btrfs_inode_request btrfs_inode_evict These can show where and when a inode is created, when a inode is evicted. 5) sync: btrfs_sync_file btrfs_sync_fs These show sync arguments. 6) transaction: btrfs_transaction_commit In transaction based filesystem, it will be useful to know the generation and who does commit. 7) back reference and cow: btrfs_delayed_tree_ref btrfs_delayed_data_ref btrfs_delayed_ref_head btrfs_cow_block Btrfs natively supports back references, these tracepoints are helpful on understanding btrfs's COW mechanism. 8) chunk: btrfs_chunk_alloc btrfs_chunk_free Chunk is a link between physical offset and logical offset, and stands for space infomation in btrfs, and these are helpful on tracing space things. 9) reserved_extent: btrfs_reserved_extent_alloc btrfs_reserved_extent_free These can show how btrfs uses its space. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
240f62c8 |
|
23-Mar-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use RCU instead of a spinlock to protect the root node The pointer to the extent buffer for the root of each tree is protected by a spinlock so that we can safely read the pointer and take a reference on the extent buffer. But now that the extent buffers are freed via RCU, we can safely use rcu_read_lock instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a826d6dc |
|
16-Mar-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: check items for correctness as we search Currently if we have corrupted items things will blow up in spectacular ways. So as we read in blocks and they are leaves, check the entire leaf to make sure all of the items are correct and point to valid parts in the leaf for the item data the are responsible for. If the item is corrupt we will kick back EIO and not read any of the copies since they are likely to not be correct either. This will catch generic corruptions, it will be up to the individual callers of btrfs_search_slot to make sure their items are right. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
91ca338d |
|
04-Jan-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
btrfs: check NULL or not Should check if functions returns NULL or not. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ff175d57 |
|
25-Dec-2010 |
Jesper Juhl <jj@chaosbits.net> |
btrfs: Don't pass NULL ptr to func that may deref it. Hi, In fs/btrfs/inode.c::fixup_tree_root_location() we have this code: ... if (!path) { err = -ENOMEM; goto out; } ... out: btrfs_free_path(path); return err; btrfs_free_path() passes its argument on to other functions and some of them end up dereferencing the pointer. In the code above that pointer is clearly NULL, so btrfs_free_path() will eventually cause a NULL dereference. There are many ways to cut this cake (fix the bug). The one I chose was to make btrfs_free_path() deal gracefully with NULL pointers. If you disagree, feel free to come up with an alternative patch. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
559af821 |
|
29-Oct-2010 |
Andi Kleen <andi@firstfloor.org> |
Btrfs: cleanup warnings from gcc 4.6 (nonbugs) These are all the cases where a variable is set, but not read which are not bugs as far as I can see, but simply leftovers. Still needs more review. Found by gcc 4.6's new warnings Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cb44921a |
|
24-Oct-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't loop forever on bad btree blocks When btrfs discovers the generation number in a btree block is incorrect, it can loop forever without forcing the RAID code to try a valid mirror, and without returning EIO. This changes things to properly kick out the EIO. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
99d8f83c |
|
07-Jul-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix split_leaf double split corner case split_leaf was not properly balancing leaves when it was forced to split a leaf twice. This commit adds an extra push left and right before forcing the double split in hopes of getting the slot where we want to insert at either the start or end of the leaf. If the extra pushes do work, then we are able to avoid splitting twice and we keep the tree properly balanced. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5bdd3536 |
|
26-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Fix block generation verification race After the path is released, the generation number got from block pointer is no long valid. The race may cause disk corruption, because verify_parent_transid() calls clear_extent_buffer_uptodate() when generation numbers mismatch. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3fd0a558 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Metadata ENOSPC handling for balance This patch adds metadata ENOSPC handling for the balance code. It is consisted by following major changes: 1. Avoid COW tree leave in the phrase of merging tree. 2. Handle interaction with snapshot creation. 3. make the backref cache can live across transactions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f0486c68 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Introduce contexts for metadata reservation Introducing metadata reseravtion contexts has two major advantages. First, it makes metadata reseravtion more traceable. Second, it can reclaim freed space and re-add them to the itself after transaction committed. Besides add btrfs_block_rsv structure and related helper functions, This patch contains following changes: Move code that decides if freed tree block should be pinned into btrfs_free_tree_block(). Make space accounting more accurate, mainly for handling read only block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
109f6aef |
|
02-Apr-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add check for changed leaves in setup_leaf_for_split setup_leaf_for_split needs to drop the path and search again, and has checks to see if the item we want to split changed size. But, it misses the case where the leaf changed and now has enough room for the item we want to insert. This adds an extra check to make sure the leaf really needs splitting before we call btrfs_split_leaf(), which keeps us from trying to split a leaf with a single item. btrfs_split_leaf() will blindly split the single item leaf, leaving us with one good leaf and one empty leaf and then a crash. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
86b9f2ec |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Fix per root used space accounting The bytes_used field in root item was originally planned to trace the amount of used data and tree blocks. But it never worked right since we can't trace freeing of data accurately. This patch changes it to only trace the amount of tree blocks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ad48fd75 |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Add btrfs_duplicate_item btrfs_duplicate_item duplicates item with new key, guaranteeing the source item and the new items are in the same tree leaf and contiguous. It allows us to split file extent in place, without using lock_extent to prevent bookend extent race. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a5719521 |
|
24-Sep-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: check size of inode backref before adding hardlink For every hardlink in btrfs, there is a corresponding inode back reference. All inode back references for hardlinks in a given directory are stored in single b-tree item. The size of b-tree item is limited by the size of b-tree leaf, so we can only create limited number of hardlinks to a given file in a directory. The original code lacks of the check, it oops if the number of hardlinks goes over the limit. This patch fixes the issue by adding check to btrfs_link and btrfs_rename. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d717aa1d |
|
23-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Avoid delayed reference update looping btrfs_split_leaf and btrfs_del_items can end up in a loop where one is constantly spliting a given leaf and the other is constantly merging it back with the adjacent nodes. There is a better fix for this, but in the interest of something small, this patch just changes btrfs_del_items back to balancing less often. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0a4eefbb |
|
24-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Fix ordering of key field checks in btrfs_previous_item Check objectid of item before checking the item type, otherwise we may return zero for a key that is actually too low. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
20736aba |
|
24-Jul-2009 |
Diego Calleja <diegocg@gmail.com> |
Btrfs: Remove code duplication in comp_keys comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just call it. Signed-off-by: Diego Calleja <diegocg@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
33c66f43 |
|
22-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix locking issue in btrfs_find_next_key When walking up the tree, btrfs_find_next_key assumes the upper level tree block is properly locked. This isn't always true even path->keep_locks is 1. This is because btrfs_find_next_key may advance path->slots[] several times instead of only once. When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found, we can't guarantee the original value of 'path->slots[level]' is 'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at 'level + 1' isn't locked. This patch fixes the issue by explicitly checking the locking state, re-searching the tree if it's not locked. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e457afec |
|
22-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf if 1 is returned by btrfs_search_slot, the path already points to the first item with 'key > searching key'. So increasing path->slots[0] by one is superfluous in that case. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cfbb9308 |
|
18-May-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: balance btree more often With the new back reference code, the cost of a balance has gone down in terms of the number of back reference updates done. This commit makes us more aggressively balance leaves and nodes as they become less full. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b3612421 |
|
13-May-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: stop avoiding balancing at the end of the transaction. When the delayed reference code was added, some checks were added to avoid extra balancing while the delayed references were being flushed. This made for less efficient btrees, but it reduced the chances of loops where no forward progress was made because the balances made more delayed ref updates. With the new dead root removal code and the mixed back references, the extent allocation tree is no longer using precise back refs, and the delayed reference updates don't carry the risk of looping forever anymore. So, the balance avoidance is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5d4f98a2 |
|
10-Jun-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
76a05b35 |
|
14-May-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Don't loop forever on metadata IO failures When a btrfs metadata read fails, the first thing we try to do is find a good copy on another mirror of the block. If this fails, read_tree_block() ends up returning a buffer that isn't up to date. The btrfs btree reading code was reworked to drop locks and repeat the search when IO was done, but the changes didn't add a check for failed reads. The end result was looping forever on buffers that were never going to become up to date. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8c594ea8 |
|
20-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use the right node in reada_for_balance reada_for_balance was using the wrong index into the path node array, so it wasn't reading the right blocks. We never directly used the results of the read done by this function because the btree search is started over at the end. This fixes reada_for_balance to reada in the correct node and to avoid searching past the last slot in the node. It also makes sure to hold the parent lock while we are finding the nodes to read. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c293498b |
|
02-Apr-2009 |
Stoyan Gaydarov <stoyboyker@gmail.com> |
Btrfs: BUG to BUG_ON changes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8e73f275 |
|
03-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Optimize locking in btrfs_next_leaf() btrfs_next_leaf was using blocking locks when it could have been using faster spinning ones instead. This adds a few extra checks around the pieces that block and switches over to spinning locks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c8c42864 |
|
03-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: break up btrfs_search_slot into smaller pieces btrfs_search_slot was doing too many things at once. This breaks it up into more reasonable units. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a4b6e07d |
|
16-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: limit balancing work while flushing delayed refs The delayed reference mechanism is responsible for all updates to the extent allocation trees, including those updates created while processing the delayed references. This commit tries to limit the amount of work that gets created during the final run of delayed refs before a commit. It avoids cowing new blocks unless it is required to finish the commit, and so it avoids new allocations that were not really required. The goal is to avoid infinite loops where we are always making more work on the final run of delayed refs. Over the long term we'll make a special log for the last delayed ref updates as well. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b9473439 |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: leave btree locks spinning more often btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
44871b1b |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: reduce stack usage in some crucial tree balancing functions Many of the tree balancing functions follow the same pattern. 1) cow a block 2) do something to the result This commit breaks them up into two functions so the variables and code required for part two don't suck down stack during part one. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
56bec294 |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9fa8cfe7 |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't preallocate metadata blocks during btrfs_search_slot In order to avoid doing expensive extent management with tree locks held, btrfs_search_slot will preallocate tree blocks for use by COW without any tree locks held. A later commit moves all of the extent allocation work for COW into a delayed update mechanism, and this preallocation will no longer be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b9447ef8 |
|
09-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix spinlock assertions on UP systems btrfs_tree_locked was being used to make sure a given extent_buffer was properly locked in a few places. But, it wasn't correct for UP compiled kernels. This switches it to using assert_spin_locked instead, and renames it to btrfs_assert_tree_locked to better reflect how it was really being used. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4008c04a |
|
12-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make a lockdep class for the extent buffer locks Btrfs is currently using spin_lock_nested with a nested value based on the tree depth of the block. But, this doesn't quite work because the max tree depth is bigger than what spin_lock_nested can deal with, and because locks are sometimes taken before the level field is filled in. The solution here is to use lockdep_set_class_and_name instead, and to set the class before unlocking the pages when the block is read from the disk and just after init of a freshly allocated tree block. btrfs_clear_path_blocking is also changed to take the locks in the proper order, and it also makes sure all the locks currently held are properly set to blocking before it tries to retake the spinlocks. Otherwise, lockdep gets upset about bad lock orderin. The lockdep magic cam from Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e00f7308 |
|
12-Feb-2009 |
Jeff Mahoney <jeffm@suse.com> |
Btrfs: remove btrfs_init_path btrfs_init_path was initially used when the path objects were on the stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path isn't required. This patch removes it, and just uses kmem_cache_zalloc to zero out the object. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7951f3ce |
|
12-Feb-2009 |
Jeff Mahoney <jeffm@suse.com> |
Btrfs: balance_level checks !child after access The BUG_ON() is in the wrong spot. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
284b066a |
|
09-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't use spin_is_contended Btrfs was using spin_is_contended to see if it should drop locks before doing extent allocations during btrfs_search_slot. The idea was to avoid expensive searches in the tree unless the lock was actually contended. But, spin_is_contended is specific to the ticket spinlocks on x86, so this is causing compile errors everywhere else. In practice, the contention could easily appear some time after we started doing the extent allocation, and it makes more sense to always drop the lock instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7b78c170 |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Only prep for btree deletion balances when nodes are mostly empty Whenever an item deletion is done, we need to balance all the nodes in the tree to make sure we don't end up with an empty node if a pointer is deleted. This balance prep happens from the root of the tree down so we can drop our locks as we go. reada_for_balance was triggering read-ahead on neighboring nodes even when no balancing was required. This adds an extra check to avoid calling balance_level() and avoid reada_for_balance() when a balance won't be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
12f4dacc |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix btrfs_unlock_up_safe to walk the entire path btrfs_unlock_up_safe would break out at the first NULL node entry or unlocked node it found in the path. Some of the callers have missing nodes at the lower levels of the path, so this commit fixes things to check all the nodes in the path before returning. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4d081c41 |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: change btrfs_del_leaf to drop locks earlier btrfs_del_leaf does two things. First it removes the pointer in the parent, and then it frees the block that has the leaf. It has the parent node locked for both operations. But, it only needs the parent locked while it is deleting the pointer. After that it can safely free the block without the parent locked. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b4ce94de |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Change btree locking to use explicit blocking points Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c487685d |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: hash_lock is no longer needed Before metadata is written to disk, it is updated to reflect that writeout has begun. Once this update is done, the block must be cow'd before it can be modified again. This update was originally synchronized by using a per-fs spinlock. Today the buffers for the metadata blocks are locked before writeout begins, and everyone that tests the flag has the buffer locked as well. So, the per-fs spinlock (called hash_lock for no good reason) is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a7175319 |
|
22-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: do less aggressive btree readahead Just before reading a leaf, btrfs scans the node for blocks that are close by and reads them too. It tries to build up a large window of IO looking for blocks that are within a max distance from the top and bottom of the IO window. This patch changes things to just look for blocks within 64k of the target block. It will trigger less IO and make for lower latencies on the read size. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d397712b |
|
05-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
87b29b20 |
|
17-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: properly check free space for tree balancing btrfs_insert_empty_items takes the space needed by the btrfs_item structure into account when calculating the required free space. So the tree balancing code shouldn't add sizeof(struct btrfs_item) to the size when checking the free space. This patch removes these superfluous additions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
42dc7bab |
|
15-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix compressed writes on truncated pages The compression code was using isize to limit the amount of data it sent through zlib. But, it wasn't properly limiting the looping to just the pages inside i_size. The end result was trying to compress too many pages, including those that had not been setup and properly locked down. This made the compression code oops while trying find_get_page on a page that didn't exist. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
459931ec |
|
10-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Delete csum items when freeing extents This finishes off the new checksumming code by removing csum items for extents that are no longer in use. The trick is doing it without racing because a single csum item may hold csums for more than one extent. Extra checks are added to btrfs_csum_file_blocks to make sure that we are using the correct csum item after dropping locks. A new btrfs_split_item is added to split a single csum item so it can be split without dropping the leaf lock. This is used to remove csum bytes from the middle of an item. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
934d375b |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use map_private_extent_buffer during generic_bin_search It is possible that generic_bin_search will be called on a tree block that has not been locked. This happens because cache_block_block skips locking on the tree blocks. Since the tree block isn't locked, we aren't allowed to change the extent_buffer->map_token field. Using map_private_extent_buffer avoids any changes to the internal extent buffer fields. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b2950863 |
|
02-Dec-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: make things static and include the right headers Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
b4eec2ca |
|
18-Nov-2008 |
Liu Hui <onlyflyer@gmail.com> |
Btrfs: Some fixes for batching extent insert. In insert_extents(), when ret==1 and last is not zero, it should check if the current inserted item is the last item in this batching inserts. If so, it should just break from loop. If not, 'cur = insert_list->next' will make no sense because the list is empty now, and 'op' will point to an unexpectable place. There are also some trivial fixs in this patch including one comment typo error and deleting two redundant lines. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2b82032c |
|
17-Nov-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Seed device support Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
f3465ca4 |
|
12-Nov-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: batch extent inserts/updates/deletions on the extent root While profiling the allocator I noticed a good amount of time was being spent in finish_current_insert and del_pending_extents, and as the filesystem filled up more and more time was being spent in those functions. This patch aims to try and reduce that problem. This happens two ways 1) track if we tried to delete an extent that we are going to update or insert. Once we get into finish_current_insert we discard any of the extents that were marked for deletion. This saves us from doing unnecessary work almost every time finish_current_insert runs. 2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for each individual extent and doing the needed operation, we instead keep the leaf around and see if there is anything else we can do on that leaf. On the insert case I introduced a btrfs_insert_some_items, which will take an array of keys with an array of data_sizes and try and squeeze in as many of those keys as possible, and then return how many keys it was able to insert. In the update case we search for an extent ref, update the ref and then loop through the leaf to see if any of the other refs we are looking to update are on that leaf, and then once we are done we release the path and search for the next ref we need to update. And finally for the deletion we try and delete the extent+ref in pairs, so we will try to find extent+ref pairs next to the extent we are trying to free and free them in bulk if possible. This along with the other cluster fix that Chris pushed out a bit ago helps make the allocator preform more uniformly as it fills up the disk. There is still a slight drop as we fill up the disk since we start having to stick new blocks in odd places which results in more COW's than on a empty fs, but the drop is not nearly as severe as it was before. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
6f3577bd |
|
13-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Improve metadata read latencies This fixes latency problems on metadata reads by making sure they don't go through the async submit queue, and by tuning down the amount of readahead done during btree searches. Also, the btrfs bdi congestion function is tuned to ignore the number of pending async bios and checksums pending. There is additional code that throttles new async bios now and the congestion function doesn't need to worry about it anymore. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
25179201 |
|
29-Oct-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: nuke fs wide allocation mutex V2 This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
f82d02d9 |
|
29-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Improve space balancing code This patch improves the space balancing code to keep more sharing of tree blocks. The only case that breaks sharing of tree blocks is data extents get fragmented during balancing. The main changes in this patch are: Add a 'drop sub-tree' function. This solves the problem in old code that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block. Remove relocation mapping tree. Relocation mappings are stored in struct btrfs_ref_path and updated dynamically during walking up/down the reference path. This reduces CPU usage and simplifies code. This patch also fixes a bug. Root items for reloc trees should be updated in btrfs_free_reloc_root. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
3bb1a1bc |
|
09-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Remove offset field from struct btrfs_extent_ref The offset field in struct btrfs_extent_ref records the position inside file that file extent is referenced by. In the new back reference system, tree leaves holding references to file extent are recorded explicitly. We can scan these tree leaves very quickly, so the offset field is not required. This patch also makes the back reference system check the objectid when extents are in deleting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
323ac95b |
|
01-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't read leaf blocks containing only checksums during truncate Checksum items take up a significant portion of the metadata for large files. It is possible to avoid reading them during truncates by checking the keys in the higher level nodes. If a given leaf is followed by another leaf where the lowest key is a checksum item from the same file, we know we can safely delete the leaf without reading it. For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete the file with a cold cache. It is read bound during the run. With this change, Btrfs is able to delete the file in 0.5s Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d352ac68 |
|
29-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add and improve comments This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1a40e23b |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: update space balancing code This patch updates the space balancing code to utilize the new backref format. Before, btrfs-vol -b would break any COW links on data blocks or metadata. This was slow and caused the amount of space used to explode if a large number of snapshots were present. The new code can keeps the sharing of all data extents and most of the tree blocks. To maintain the sharing of data extents, the space balance code uses a seperate inode hold data extent pointers, then updates the references to point to the new location. To maintain the sharing of tree blocks, the space balance code uses reloc trees to relocate tree blocks in reference counted roots. There is one reloc tree for each subvol, and all reloc trees share same root key objectid. Reloc trees are snapshots of the latest committed roots of subvols (root->commit_root). To relocate a tree block referenced by a subvol, there are two steps. COW the block through subvol's reloc tree, then update block pointer in the subvol to point to the new block. Since all reloc trees share same root key objectid, doing special handing for tree blocks owned by them is easy. Once a tree block has been COWed in one reloc tree, we can use the resulting new block directly when the same block is required to COW again through other reloc trees. In this way, relocated tree blocks are shared between reloc trees, so they are also shared between subvols. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5b21f2ed |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: extent_map and data=ordered fixes for space balancing * Add an EXTENT_BOUNDARY state bit to keep the writepage code from merging data extents that are in the process of being relocated. This allows us to do accounting for them properly. * The balancing code relocates data extents indepdent of the underlying inode. The extent_map code was modified to properly account for things moving around (invalidating extent_map caches in the inode). * Don't take the drop_mutex in the create_subvol ioctl. It isn't required. * Fix walking of the ordered extent list to avoid races with sys_unlink * Change the lock ordering rules. Transaction start goes outside the drop_mutex. This allows btrfs_commit_transaction to directly drop the relocation trees. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31840ae1 |
|
23-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: Full back reference support This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f9dd46c |
|
23-Sep-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: free space accounting redo 1) replace the per fs_info extent_io_tree that tracked free space with two rb-trees per block group to track free space areas via offset and size. The reason to do this is because most allocations come with a hint byte where to start, so we can usually find a chunk of free space at that hint byte to satisfy the allocation and get good space packing. If we cannot find free space at or after the given offset we fall back on looking for a chunk of the given size as close to that given offset as possible. When we fall back on the size search we also try to find a slot as close to the size we want as possible, to avoid breaking small chunks off of huge areas if possible. 2) remove the extent_io_tree that tracked the block group cache from fs_info and replaced it with an rb-tree thats tracks block group cache via offset. also added a per space_info list that tracks the block group cache for the particular space so we can lookup related block groups easily. 3) cleaned up the allocation code to make it a little easier to read and a little less complicated. Basically there are 3 steps, first look from our provided hint. If we couldn't find from that given hint, start back at our original search start and look for space from there. If that fails try to allocate space if we can and start looking again. If not we're screwed and need to start over again. 4) small fixes. there were some issues in volumes.c where we wouldn't allocate the rest of the disk. fixed cow_file_range to actually pass the alloc_hint, which has helped a good bit in making the fs_mark test I run have semi-normal results as we run out of space. Generally with data allocations we don't track where we last allocated from, so everytime we did a data allocation we'd search through every block group that we have looking for free space. Now searching a block group with no free space isn't terribly time consuming, it was causing a slight degradation as we got more data block groups. The alloc_hint has fixed this slight degredation and made things semi-normal. There is still one nagging problem I'm working on where we will get ENOSPC when there is definitely plenty of space. This only happens with metadata allocations, and only when we are almost full. So you generally hit the 85% mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm still tracking it down, but until then this seems to be pretty stable and make a significant performance gain. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f25956cc |
|
12-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Fix leaf overflow check in btrfs_insert_empty_items It was incorrectly adding an extra sizeof(struct btrfs_item) and causing false positives (oops) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b214107e |
|
05-Sep-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: trivial sparse fixes Fix a bunch of trivial sparse complaints. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ad3d81ba |
|
05-Sep-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: missing endianess conversion in insert_new_root Add two missing endianess conversions in this function, found by sparse. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e02119d5 |
|
05-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a write ahead tree log to optimize synchronous operations File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
65b51a00 |
|
01-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
btrfs_search_slot: reduce lock contention by cowing in two stages A btree block cow has two parts, the first is to allocate a destination block and the second is to copy the old bock over. The first part needs locks in the extent allocation tree, and may need to do IO. This changeset splits that into a separate function that can be called without any tree locks held. btrfs_search_slot is changed to drop its path and start over if it has to COW a contended block. This often means that many writers will pre-alloc a new destination for a the same contended block, but they cache their prealloc for later use on lower levels in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bcc63abb |
|
30-Jul-2008 |
Yan <zheng.yan@oracle.com> |
Btrfs: implement memory reclaim for leaf reference cache The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31153d81 |
|
28-Jul-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Add a leaf reference cache Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9652480b |
|
23-Jul-2008 |
Yan <yanzheng@21cn.com> |
Fix path slots selection in btrfs_search_forward We should decrease the found slot by one as btrfs_search_slot does when bin_search return 1 and node level > 0. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7b128766 |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Create orphan inode records to prevent lost files after a crash Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0bd40a71 |
|
16-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
btrfs_next_leaf: do readahead when skip_locking is turned on Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7d9eb12c |
|
08-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add locking around volume management (device add/remove/balance) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f9efa9c7 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Reduce contention on the root node This calls unlock_up sooner in btrfs_search_slot in order to decrease the amount of work done with the higher level tree locks held. Also, it changes btrfs_tree_lock to spin for a big against the page lock before scheduling. This makes a big difference in context switch rate under highly contended workloads. Longer term, a better locking structure is needed than the page lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3f157a2f |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Online btree defragmentation fixes The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e7a84565 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add btree locking to the tree defragmentation code The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a74a4b97 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace the transaction work queue with kthreads This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
333db94c |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix snapshot deletion to release the alloc_mutex much more often. This lowers the impact of snapshot deletion on the rest of the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5cd57b2c |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
168fd7d2 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Fix btrfs_next_leaf to check for new items after dropping locks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
594a24eb |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks This allows us to delete an unlinked inode with dirty pages from the list instead of forcing commit to write these out before deleting the inode. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
051e1b9f |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Drop locks in btrfs_search_slot when reading a tree block. One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a2135011 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace the big fs_mutex with a collection of other locks Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
925baedd |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Start btree concurrency work. The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0ef3e66b |
|
24-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allocator fix variety pack * Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1259ab75 |
|
12-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Handle write errors on raid1 and raid10 When duplicate copies exist, writes are allowed to fail to one of those copies. This changeset includes a few changes that allow the FS to continue even when some IOs fail. It also adds verification of the parent generation number for btree blocks. This generation is stored in the pointer to a block, and it ensures that missed writes to are detected. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ca7a79ad |
|
11-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Pass down the expected generation number when reading tree blocks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bce4eae9 |
|
24-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix balance_level to free the middle block if there is room in the left one balance level starts by trying to empty the middle block, and then pushes from the right to the middle. This might empty the right block and leave a small number of pointers in the middle. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
971a1f66 |
|
24-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Don't empty the middle buffer in push_nodes_for_insert Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c448acf0 |
|
24-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix split_node to require more empty slots in the node as well Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1514794e |
|
24-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Make sure nodes have enough room for a double split Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
699122f5 |
|
15-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Don't wait on tree block writeback before freeing them anymore This isn't required anymore because we don't reallocate blocks that have already been written in this transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e17cade2 |
|
15-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add chunk uuids and update multi-device back references Block headers now store the chunk tree uuid Chunk items records the device uuid for each stripes Device extent items record better back refs to the chunk tree Block groups record better back refs to the chunk tree The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY used to be the logical offset of the chunk. Now it is a chunk tree id, with the logical offset being stored in the offset field of the key. This allows a single chunk tree to record multiple logical address spaces, upping the number of bytes indexed by a chunk tree from 2^64 to 2^128. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
85d824c4 |
|
10-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Disable extra debugging checks on tree blocks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f188591e |
|
09-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Retry metadata reads in the face of checksum failures Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ce9adaa5 |
|
09-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Do metadata checksums for reads via a workqueue Before, metadata checksumming was done by the callers of read_tree_block, which would set EXTENT_CSUM bits in the extent tree to show that a given range of pages was already checksummed and didn't need to be verified again. But, those bits could go away via try_to_releasepage, and the end result was bogus checksum failures on pages that never left the cache. The new code validates checksums when the page is read. It is a little tricky because metadata blocks can span pages and a single read may end up going via multiple bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cea9e445 |
|
09-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Change btrfs_map_block to return a structure with mappings for all stripes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0ef8b242 |
|
03-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Properly dirty buffers in the split corner cases Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0999df54 |
|
01-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Verify checksums on tree blocks found without read_tree_block Checksums were only verified by btrfs_read_tree_block, which meant the functions to probe the page cache for blocks were not validating checksums. Normally this is fine because the buffers will only be in cache if they have already been validated. But, there is a window while the buffer is being read from disk where it could be up to date in the cache but not yet verified. This patch makes sure all buffers go through checksum verification before they are used. This is safer, and it prevents modification of buffers before they go through the csum code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
63b10fc4 |
|
01-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Reorder the flags field in struct btrfs_header and record a flag on writeout This allows detection of blocks that have already been written in the running transaction so they can be recowed instead of modified again. It is step one in trusting the transid field of the block pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0b86a832 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for multiple devices per filesystem Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2f375ab9 |
|
01-Feb-2008 |
Yan <yanzheng@21cn.com> |
Call btrfs_cow_block while lowering tree level. When freeing root block of a tree, btrfs_free_extent' parameter 'ref_generation' is from root block itseft. When freeing non-root block, 'ref_generation' is from its parent. so when converting a non-root block to root block, we must guarantee its generation is equal to its parent's generation. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5a01a2e3 |
|
30-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Copy correct tree when inserting into slot 0 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9c58309d |
|
29-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add inode item and backref in one insert, reducing cpu usage Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
85e21bac |
|
29-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: During deletes and truncate, remove many items at once from the tree Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dc17ff8f |
|
08-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add data=ordered support This forces file data extents down the disk along with the metadata that references them. The current implementation is fairly simple, and just writes out all of the dirty pages in an inode before the commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
98ed5174 |
|
03-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Force inlining off in a few places to save stack usage Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8f662a76 |
|
02-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add readahead to the online shrinker, and a mount -o alloc_start= for testing Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
01f46658 |
|
21-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Less aggressive readahead on deletes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4aec2b52 |
|
18-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
kmalloc a few large stack objects in the btrfs_ioctl path Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
be20aa9d |
|
17-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount option to turn off data cow A number of workloads do not require copy on write data or checksumming. mount -o nodatasum to disable checksums and -o nodatacow to disable both copy on write and checksumming. In nodatacow mode, copy on write is still performed when a given extent is under snapshot. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7bb86316 |
|
11-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add back pointers from extents to the btree or file referencing them Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
74493f7a |
|
11-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Implement generation numbers in block pointers Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
eef1c494 |
|
26-Nov-2007 |
Yan <yanzheng@21cn.com> |
Btrfs: Properly update right_nritems in push_leaf_left The codes that fixup the right leaf and the codes that dirty the extnet buffer use the variable 'right_nritems' , both of them expect 'right_nritems' is the number of items in right leaf after the push. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
34a38218 |
|
07-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Change push_leaf_{leaf,right} to empty the src leave during item deletion Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
081e9573 |
|
06-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Make defrag check nodes against the progress key to prevent repeating work Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
179e29e4 |
|
01-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix a number of inline extent problems that Yan Zheng reported. The fixes do a number of things: 1) Most btrfs_drop_extent callers will try to leave the inline extents in place. It can truncate bytes off the beginning of the inline extent if required. 2) writepage can now update the inline extent, allowing mmap writes to go directly into the inline extent. 3) btrfs_truncate_in_transaction truncates inline extents 4) extent_map.c fixed to not merge inline extent mappings and hole mappings together Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5708b959 |
|
25-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Tune the automatic defrag code 1) Forced defrag wasn't working properly (btrfsctl -d) because some cache only checks were incorrect. 2) Defrag only the leaves unless in forced defrag mode. 3) Don't use complex logic to figure out if a leaf is needs defrag Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cc0c5538 |
|
25-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix split_leaf to detect when it is extending an item When making room for a new item, it is ok to create an empty leaf, but when making room to extend an item, split_leaf needs to make sure it keeps the item we're extending in the path and make sure we don't end up with an empty leaf. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5ee78ac7 |
|
19-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix split_leaf to avoid incorrect double splits Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3685f791 |
|
19-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: CPU usage optimizations in push and the extent_map code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ae2f5411 |
|
19-Oct-2007 |
Jens Axboe <jens.axboe@oracle.com> |
btrfs: 32-bit type problems An assorted set of casts to get rid of the warnings on 32-bit archs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7936ca38 |
|
19-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Default to 8k max packed tails Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a6b6e75e |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Defrag only leaves, and only when the parent node has a single objectid This allows us to defrag huge directories, but skip the expensive defrag case in more common usage, where it does not help as much. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cf786e79 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Defrag: only walk into nodes with the defrag bit set Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f1ebbd1 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Large block related defrag optimizations Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f82731f |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
810191ff |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: extent_map optimizations to cut down on CPU usage Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3326d1b0 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allow tails larger than one page Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4dc11904 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add an extent buffer LRU to reduce radix tree hits Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6b80053d |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add back the online defragging code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
db94535d |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allow tree blocks larger than the page size Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f510cfec |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix extent_buffer and extent_state leaks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
479965d6 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Optimizations for the extent_buffer code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5f39d397 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Create extent_buffer interface for large blocksizes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
86479a04 |
|
10-Sep-2007 |
Chris Mason <chris.mason@oracle.com> |
Add support for defragging files via btrfsctl -d. Avoid OOM on extent tree defrag. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
252c38f0 |
|
29-Aug-2007 |
Yan <yanzheng@21cn.com> |
Btrfs: ctree.c cleanups Fixup a few buffer_head release errors, and fix an off by one in balance_node_right. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2cc58cf2 |
|
27-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Do more extensive readahead during tree searches Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
32020611 |
|
27-Aug-2007 |
Yan <yanzheng@21cn.com> |
fix block readahead in btrfs_next_leaf Send the correct slot down to reada_for_search Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f2183bde |
|
10-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add BH_Defrag to mark buffers that are in need of defragging This allows the tree walking code to defrag only the newly allocated buffers, it seems to be a good balance between perfect defragging and the performance hit of repeatedly reallocating blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e9d0b13b |
|
10-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Btree defrag on the extent-mapping tree as well Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6702ed49 |
|
07-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add run time btree defrag, and an ioctl to force btree defrag This adds two types of btree defrag, a run time form that tries to defrag recently allocated blocks in the btree when they are still in ram, and an ioctl that forces defrag of all btree blocks. File data blocks are not defragged yet, but this can make a huge difference in sequential btree reads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3c69faec |
|
07-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fold some btree readahead routines into something more generic. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9f3a7427 |
|
07-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Do snapshot deletion in smaller chunks. Before, snapshot deletion was a single atomic unit. This caused considerable lock contention and required an unbounded amount of space. Now, the drop_progress field in the root item is used to indicate how far along snapshot deletion is, and to resume where it left off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a1f39630 |
|
11-Jul-2007 |
Aneesh <aneesh.kumar@gmail.com> |
Btrfs: Some code cleanups Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ec6b910f |
|
11-Jul-2007 |
Zach Brown <zach.brown@oracle.com> |
Btrfs: trivial include fixups Almost none of the files including module.h need to do so, remove them. Include sched.h in extent-tree.c to silence a warning about cond_resched() being undeclared. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ccd467d6 |
|
28-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: crash recovery fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
54aa1f4d |
|
22-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f1ace244 |
|
13-Jun-2007 |
Aneesh <aneesh.kumar@linux.vnet.ibm.com> |
btrfs: Code cleanup Attaching below is some of the code cleanups that i came across while reading the code. a) alloc_path already calls init_path. b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as the in memory copy ext4_inode as the disk copy Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6cbd5570 |
|
12-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add GPLv2 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
098f59c2 |
|
11-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: patch queue: fix corruption when splitting large items Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8d7be552 |
|
10-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix check_node and check_leaf to use less cpu Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31f3c99b |
|
30-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allocator improvements, inode block groups Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4d775673 |
|
20-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add owner and type fields to the extents aand block headers Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
33ade1f8 |
|
20-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: node balance optimizations Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e66f709b |
|
20-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: write barriers on commit, balance level before split Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a429e513 |
|
18-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: working file_write, reorganized key flags Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
70b2befd |
|
17-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: rework csums and extent item ordering Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b18c6685 |
|
17-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: progress on file_write Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6567e837 |
|
16-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early work to file_write in big extents Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7eccb903 |
|
11-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: create a logical->phsyical block number mapping scheme Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d0dbc624 |
|
09-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: drop owner and parentid Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3eb0314d |
|
05-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: uuids Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d4dbff95 |
|
04-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: support for items bigger than 1/2 the blocksize Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
df24a2b9 |
|
04-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early inline file data code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2c90e5d6 |
|
02-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: still corruption hunting Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d6025579 |
|
30-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: corruption hunt continues Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
22b0ebda |
|
30-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: hunting slab corruption Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f254e52c |
|
29-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: verify csums on read Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9a6f11ed |
|
27-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: split out level field in struct header Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7f5c1516 |
|
23-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Add generation number to btrfs_header, readdir fixes, hash collision fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d5719762 |
|
23-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
btrfs_create, btrfs_write_super, btrfs_sync_fs Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e20d96d6 |
|
21-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Mountable btrfs, with readdir Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2e635a27 |
|
21-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: initial move to kernel module land Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
40689478 |
|
17-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: minor comments Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e089f05c |
|
16-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: transaction handles everywhere Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
88fd146c |
|
16-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: pin freed blocks from the FS tree too Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a8a2ee0c |
|
16-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add a name_len to dir items, reorder key Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
62e2749e |
|
14-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use a chunk of the key flags to record the item type. Add (untested and simple) directory item code Fix comp_keys to use the new key ordering Add btrfs_insert_empty_item Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
123abc88 |
|
14-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: variable block size support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
eaee50e8 |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: merge leaves before split Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9aca1d51 |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make some funcs static Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
234b63a0 |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
rename funcs and structs to btrfs Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1d4f8a0c |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: node->blockptrs endian fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0783fcfc |
|
12-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: struct item endian fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e2fa7227 |
|
12-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: struct key endian fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7518a238 |
|
11-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: get/set for struct header fields Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
037e6390 |
|
07-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: get rid of add recursion Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a28ec197 |
|
06-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fixup reference counting on cows Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
02217ed2 |
|
02-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early reference counting Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f0930a37 |
|
02-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix extent code to use merge during delete Remove implicit commit in del_item and insert_item Add implicit commit to close() Add commit op in random-test Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ed2ff2cb |
|
01-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: pretend page cache & commit code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
79f95c82 |
|
01-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fixup the code to merge during path walks Add a bulk insert/remove test to random-test Add the quick-test code back as another regression test Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bb803951 |
|
28-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: merge on the way down during deletes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f70abe2 |
|
28-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: more return code checking Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
aa5d6bed |
|
28-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: return code checking Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8e19f2cd |
|
28-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Take out the merge-during-search-on-delete code, it is buggy. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fec577fb |
|
26-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add fsx-style randomized tree tester Add debug-tree command to print the tree Add extent-tree.c to the repo Comment ctree.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
97571fd0 |
|
24-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: cleanup & comment Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
00ec4c51 |
|
23-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: push_leaf_right Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5de08d7d |
|
24-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Break up ctree.c a little Extent fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9a8dd150 |
|
23-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Block sized tree extents and extent deletion Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5c680ed6 |
|
22-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: switch to early splits Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cfaa7295 |
|
21-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: extent fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d97e63b6 |
|
20-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early extent mapping support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
74123bd7 |
|
02-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Commenting/cleanup Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
eb60ceac |
|
02-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add backing store, memory management Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4920c9ac |
|
26-Jan-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Faster deletes, add Makefile and kerncompat Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
be0e5c09 |
|
26-Jan-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Initial checkin, basic working tree code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|