#
5693a128 |
|
26-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: add forward declarations and headers, part 3 Do a cleanup in the rest of the headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
91701bdf |
|
12-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: make btrfs_error_unpin_extent_range() return void This helper is used in transaction abort or cleanup context and the callers cannot handle all errors, only do best effort. btrfs_cleanup_one_transaction btrfs_destroy_delayed_refs btrfs_error_unpin_extent_range btrfs_destroy_pinned_extent btrfs_error_unpin_extent_range Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6140ba8a |
|
06-Dec-2023 |
David Sterba <dsterba@suse.com> |
btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-tree The radix-tree has been superseded by the xarray (https://lwn.net/Articles/745073), this patch converts the btrfs_root::delayed_nodes, the APIs are used in a simple way. First idea is to do xa_insert() but this would require GFP_ATOMIC allocation which we want to avoid if possible. The preload mechanism of radix-tree can be emulated within the xarray API. - xa_reserve() with GFP_NOFS outside of the lock, the reserved entry is inserted atomically at most once - xa_store() under a lock, in case something races in we can detect that and xa_load() returns a valid pointer All uses of xa_load() must check for a valid pointer in case they manage to get between the xa_reserve() and xa_store(), this is handled in btrfs_get_delayed_node(). Otherwise the functionality is equivalent, xarray implements the radix-tree and there should be no performance difference. The patch continues the efforts started in 253bf57555e451 ("btrfs: turn delayed_nodes_tree into an XArray") and fixes the problems with locking and GFP flags 088aea3b97e0ae ("Revert "btrfs: turn delayed_nodes_tree into an XArray""). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
46524fab |
|
20-Nov-2023 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused btrfs_root::type Looks like the struct member was added in 2007 in 2.6.29 in commit 87ee04eb0f2f ("Btrfs: Add simple stripe size parameter") but hasn't been used at all since. So let's remove it. This was found by tool https://github.com/jirislaby/clang-struct, then build tested after removing the struct member. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6e5de50f |
|
19-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use bool for return type of btrfs_block_can_be_shared() Currently btrfs_block_can_be_shared() returns an int that is used as a boolean. Since it all it needs is to return true or false, and it can't return errors for example, change the return type from int to bool to make it a bit more readable and obvious. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6000d931 |
|
18-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove log_extents_lock and logged_list from struct btrfs_root The logged_list[2] and log_extents_lock[2] members of struct btrfs_root are no longer used, their last use was removed in commit 5636cf7d6dc8 ("btrfs: remove the logged extents infrastructure"). So remove these fields. This reduces the size of struct btrfs_root, on a release kernel, from 1392 bytes down to 1352 bytes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6008859b |
|
04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helpers for reading and writing log_transid Currently the log_transid field of a root is always modified while holding the root's log_mutex locked. Most readers of a root's log_transid are also holding the root's log_mutex locked, however there is one exception which is btrfs_set_inode_last_trans() where we don't take the lock to avoid blocking several operations if log syncing is happening in parallel. Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f9850787 |
|
04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helpers for reading and writing last_log_commit Currently, the last_log_commit of a root can be accessed concurrently without any lock protection. Readers can be calling btrfs_inode_in_log() early in a fsync call, which reads a root's last_log_commit, while a writer can change the last_log_commit while a log tree if being synced, at btrfs_sync_log(). Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_log_commit field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6422b4cd |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move btrfs_realloc_node() from ctree.c into defrag.c btrfs_realloc_node() is only used by the defrag code. Nowadays we have a defrag.c file, so move it, and its helper close_blocks(), into defrag.c. During the move also do a few minor cosmetic changes: 1) Change the return value of close_blocks() from int to bool; 2) Use SZ_32K instead of 32768 at close_blocks(); 3) Make some variables const in btrfs_realloc_node(), 'blocksize' and 'end_slot'; 4) Get rid of 'parent_nritems' variable, in both places where it was used it could be replaced by calling btrfs_header_nritems(parent); 5) Change the type of a couple variables from int to bool; 6) Rename variable 'err' to 'ret', as that's the most common name we use to track the return value of a function; 7) Move some variables from the top scope to the scope of the for loop where they are used. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
79d25df0 |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: export comp_keys() from ctree.c as btrfs_comp_keys() Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a later patch we can move out defrag specific code from ctree.c into defrag.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95f93bc4 |
|
26-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename and export __btrfs_cow_block() Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is to allow to move defrag specific code out of ctree.c and into defrag.c in one of the next patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2672a051 |
|
28-Jun-2023 |
Boris Burkov <boris@bur.io> |
btrfs: track data relocation with simple quota Relocation data allocations are quite tricky for simple quotas. The basic data relocation sequence is (ignoring details that aren't relevant to this fix): - create a fake relocation data fs root - create a fake relocation inode in that root - for each data extent: - preallocate a data extent on behalf of the fake inode - copy over the data - for each extent - swap the refs so that the original file extent now refers to the new extent item - drop the fake root, dropping its refs on the old extents, which lets us delete them. Done naively, this results in storing an extent item in the extent tree whose owner_ref points at the relocation data root and a no-op squota recording, since the reloc root is not a legit fstree. So far, that's OK. The problem comes when you do the swap, and leave an extent item owned by this bogus root as the real permanent extents of the file. If the file then drops that ref, we free it and no-op account that against the fake relocation root. Essentially, this means that relocation is simple quota "extent laundering", since we re-own the extents into a fake root. Simple quotas very intentionally doesn't have a mechanism for transferring ownership of extents, as that is exactly the complicated thing we are trying to avoid with the new design. Further, it cannot be correctly done in this case, since at the time you create the new "real" refs, there is no way to know which was the original owner before relocation unless we track it. Therefore, it makes more sense to trick the preallocation to handle relocation as a special case and note the proper owner ref from the beginning. That way, we never write out an extent item without the correct owner ref that it will eventually have. This could be done by wiring a special root parameter all the way through the allocation code path, but to avoid that special case touching all the code, take advantage of the serial nature of relocation to store the src root on the relocation root object. Then when we finish the prealloc, if it happens to be this case, prepare the delayed ref appropriately. We must also add logic to handle relocating adjacent extents with different owning roots. Those cannot be preallocated together in a cluster as it would lose the separate ownership information. This is obviously a smelly bit of code, but I think it is the best solution to the problem, given the relocation implementation. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50564b65 |
|
12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction on generation mismatch when marking eb as dirty When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2a3a1dd9 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove extraneous includes from ctree.h We don't need any of these includes in the ctree.h header file for the header file itself, remove them to clean up ctree.h a little bit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1b9e6a15 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_name_hash to dir-item.h This is related to the name hashing for dir items, move it into dir-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
98e4f060 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_extref_hash into inode-item.h Ideally this would be un-inlined, but that is a cleanup for later. For now move this into inode-item.h, which is where the extref code lives. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
03e86348 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove btrfs_crc32c wrapper This simply sends the same arguments into crc32c(), and is just used in a few places. Remove this wrapper and directly call crc32c() in these instances. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
102f2640 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_crc32c_final into free-space-cache.c This is the only place this helper is used, take it out of ctree.h and move it into free-space-cache.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eb96e221 |
|
19-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix unwritten extent buffer after snapshotting a new subvolume When creating a snapshot of a subvolume that was created in the current transaction, we can end up not persisting a dirty extent buffer that is referenced by the snapshot, resulting in IO errors due to checksum failures when trying to read the extent buffer later from disk. A sequence of steps that leads to this is the following: 1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical address 36007936, for the leaf/root of a new subvolume that has an ID of 291. We mark the extent buffer as dirty, and at this point the subvolume tree has a single node/leaf which is also its root (level 0); 2) We no longer commit the transaction used to create the subvolume at create_subvol(). We used to, but that was recently removed in commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create"); 3) The transaction used to create the subvolume has an ID of 33, so the extent buffer 36007936 has a generation of 33; 4) Several updates happen to subvolume 291 during transaction 33, several files created and its tree height changes from 0 to 1, so we end up with a new root at level 1 and the extent buffer 36007936 is now a leaf of that new root node, which is extent buffer 36048896. The commit root remains as 36007936, since we are still at transaction 33; 5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and we end up at transaction.c:create_pending_snapshot(), in the critical section of a transaction commit. There we COW the root of subvolume 291, which is extent buffer 36048896. The COW operation returns extent buffer 36048896, since there's no need to COW because the extent buffer was created in this transaction and it was not written yet. The we call btrfs_copy_root() against the root node 36048896. During this operation we allocate a new extent buffer to turn into the root node of the snapshot, copy the contents of the root node 36048896 into this snapshot root extent buffer, set the owner to 292 (the ID of the snapshot), etc, and then we call btrfs_inc_ref(). This will create a delayed reference for each leaf pointed by the root node with a reference root of 292 - this includes a reference for the leaf 36007936. After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state. Then we call btrfs_insert_dir_item(), to create the directory entry in in the tree of subvolume 291 that points to the snapshot. This ends up needing to modify leaf 36007936 to insert the respective directory items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state, we need to COW the leaf. We end up at btrfs_force_cow_block() and then at update_ref_for_cow(). At update_ref_for_cow() we call btrfs_block_can_be_shared() which returns false, despite the fact the leaf 36007936 is shared - the subvolume's root and the snapshot's root point to that leaf. The reason that it incorrectly returns false is because the commit root of the subvolume is extent buffer 36007936 - it was the initial root of the subvolume when we created it. So btrfs_block_can_be_shared() which has the following logic: int btrfs_block_can_be_shared(struct btrfs_root *root, struct extent_buffer *buf) { if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && buf != root->node && buf != root->commit_root && (btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) || btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) return 1; return 0; } Returns false (0) since 'buf' (extent buffer 36007936) matches the root's commit root. As a result, at update_ref_for_cow(), we don't check for the number of references for extent buffer 36007936, we just assume it's not shared and therefore that it has only 1 reference, so we set the local variable 'refs' to 1. Later on, in the final if-else statement at update_ref_for_cow(): static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, struct extent_buffer *cow, int *last_ref) { (...) if (refs > 1) { (...) } else { (...) btrfs_clear_buffer_dirty(trans, buf); *last_ref = 1; } } So we mark the extent buffer 36007936 as not dirty, and as a result we don't write it to disk later in the transaction commit, despite the fact that the snapshot's root points to it. Attempting to access the leaf or dumping the tree for example shows that the extent buffer was not written: $ btrfs inspect-internal dump-tree -t 292 /dev/sdb btrfs-progs v6.2.2 file tree key (292 ROOT_ITEM 33) node 36110336 level 1 items 2 free space 119 generation 33 owner 292 node 36110336 flags 0x1(WRITTEN) backref revision 1 checksum stored a8103e3e checksum calced a8103e3e fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07 key (256 INODE_ITEM 0) block 36007936 gen 33 key (257 EXTENT_DATA 0) block 36052992 gen 33 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 total bytes 107374182400 bytes used 38572032 uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 The respective on disk region is full of zeroes as the device was trimmed at mkfs time. Obviously 'btrfs check' also detects and complains about this: $ btrfs check /dev/sdb Opening filesystem to check... Checking filesystem on /dev/sdb UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 generation: 33 (33) [1/7] checking root items [2/7] checking extents checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 owner ref check failed [36007936 4096] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 The following tree block(s) is corrupted in tree 292: tree block bytenr: 36110336, level: 1, node key: (256, 1, 0) root 292 root dir 256 not found ERROR: errors found in fs roots found 38572032 bytes used, error(s) found total csum bytes: 16048 total tree bytes: 1265664 total fs tree bytes: 1118208 total extent tree bytes: 65536 btree space waste bytes: 562598 file data blocks allocated: 65978368 referenced 36569088 Fix this by updating btrfs_block_can_be_shared() to consider that an extent buffer may be shared if it matches the commit root and if its generation matches the current transaction's generation. This can be reproduced with the following script: $ cat test.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi # Use a filesystem with a 64K node size so that we have the same node # size on every machine regardless of its page size (on x86_64 default # node size is 16K due to the 4K page size, while on PPC it's 64K by # default). This way we can make sure we are able to create a btree for # the subvolume with a height of 2. mkfs.btrfs -f -n 64K $DEV mount $DEV $MNT btrfs subvolume create $MNT/subvol # Create a few empty files on the subvolume, this bumps its btree # height to 2 (root node at level 1 and 2 leaves). for ((i = 1; i <= 300; i++)); do echo -n > $MNT/subvol/file_$i done btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap umount $DEV btrfs check $DEV Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment): $ ./test.sh Create subvolume '/mnt/sdi/subvol' Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap' Opening filesystem to check... Checking filesystem on /dev/sdi UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1 [1/7] checking root items [2/7] checking extents parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure owner ref check failed [30539776 65536] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0) Wrong generation of child node/leaf, wanted: 5, have: 7 root 257 root dir 256 not found ERROR: errors found in fs roots found 917504 bytes used, error(s) found total csum bytes: 0 total tree bytes: 851968 total fs tree bytes: 393216 total extent tree bytes: 65536 btree space waste bytes: 736550 file data blocks allocated: 0 referenced 0 A test case for fstests will follow soon. Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create") CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b378f6a |
|
12-Aug-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix infinite directory reads The readdir implementation currently processes always up to the last index it finds. This however can result in an infinite loop if the directory has a large number of entries such that they won't all fit in the given buffer passed to the readdir callback, that is, dir_emit() returns a non-zero value. Because in that case readdir() will be called again and if in the meanwhile new directory entries were added and we still can't put all the remaining entries in the buffer, we keep repeating this over and over. The following C program and test script reproduce the problem: $ cat /mnt/readdir_prog.c #include <sys/types.h> #include <dirent.h> #include <stdio.h> int main(int argc, char *argv[]) { DIR *dir = opendir("."); struct dirent *dd; while ((dd = readdir(dir))) { printf("%s\n", dd->d_name); rename(dd->d_name, "TEMPFILE"); rename("TEMPFILE", dd->d_name); } closedir(dir); } $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV &> /dev/null #mkfs.xfs -f $DEV &> /dev/null #mkfs.ext4 -F $DEV &> /dev/null mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= 2000; i++)); do echo -n > $MNT/testdir/file_$i done cd $MNT/testdir /mnt/readdir_prog cd /mnt umount $MNT This behaviour is surprising to applications and it's unlike ext4, xfs, tmpfs, vfat and other filesystems, which always finish. In this case where new entries were added due to renames, some file names may be reported more than once, but this varies according to each filesystem - for example ext4 never reported the same file more than once while xfs reports the first 13 file names twice. So change our readdir implementation to track the last index number when opendir() is called and then make readdir() never process beyond that index number. This gives the same behaviour as ext4. Reported-by: Rob Landley <rob@landley.net> Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681 CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
751a2761 |
|
08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr() At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in the context of all callers, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. This implies btrfs_del_ptr() return an int instead of void, and making all callers check for returned errors. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
016f9d0b |
|
29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename del_ptr to btrfs_del_ptr and export it This exists internal to ctree.c, however btrfs check needs to use it for some of its operations. I'd rather not duplicate that code inside of btrfs check as this is low level and I want to keep this code in one place, so rename the function to btrfs_del_ptr and export it so that it can be used inside of btrfs-progs safely. Add a comment to make sure this doesn't get removed by a future cleanup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b3cbfb0d |
|
29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a btrfs_csum_type_size helper This is needed in btrfs-progs for the tools that convert the checksum types for file systems and a few other things. We don't have it in the kernel as we just want to get the size for the super blocks type. However I don't want to have to manually add this every time we sync ctree.c into btrfs-progs, so add the helper in the kernel with a note so it doesn't get removed by a later cleanup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6c75a589 |
|
27-Apr-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: print-tree: pass const extent buffer pointer Since print-tree infrastructure only prints the content of a tree block, we can make them to accept const extent buffer pointer. This removes a forced type convert in extent-tree, where we convert a const extent buffer pointer to regular one, just to avoid compiler warning. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f469c8bd |
|
12-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: unexport btrfs_prev_leaf() btrfs_prev_leaf() is not used outside ctree.c, so there's no need to export it at ctree.h - just make it static at ctree.c and move its definition above btrfs_search_slot_for_read(), since that function calls it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fdf8d595 |
|
23-Feb-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: open code btrfs_bin_search() btrfs_bin_search() is a simple wrapper that searches for the whole slots by calling btrfs_generic_bin_search() with the starting slot/first_slot preset to 0. This simple wrapper can be open coded as btrfs_bin_search(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a724f313 |
|
08-Feb-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do unsigned integer division in the extent buffer binary search loop In the search loop of the binary search function, we are doing a division by 2 of the sum of the high and low slots. Because the slots are integers, the generated assembly code for it is the following on x86_64: 0x00000000000141f1 <+145>: mov %eax,%ebx 0x00000000000141f3 <+147>: shr $0x1f,%ebx 0x00000000000141f6 <+150>: add %eax,%ebx 0x00000000000141f8 <+152>: sar %ebx It's a few more instructions than a simple right shift, because signed integer division needs to round towards zero. However we know that slots can never be negative (btrfs_header_nritems() returns an u32), so we can instead use unsigned types for the low and high slots and therefore use unsigned integer division, which results in a single instruction on x86_64: 0x00000000000141f0 <+144>: shr %ebx So use unsigned types for the slots and therefore unsigned division. This is part of a small patchset comprised of the following two patches: btrfs: eliminate extra call when doing binary search on extent buffer btrfs: do unsigned integer division in the extent buffer binary search loop The following fs_mark test was run on a non-debug kernel (Debian's default kernel config) before and after applying the patchset: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 6 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Results before applying patchset: FSUse% Count Size Files/sec App Overhead 2 1200000 0 174472.0 11549868 4 2400000 0 253503.0 11694618 4 3600000 0 257833.1 11611508 6 4800000 0 247089.5 11665983 6 6000000 0 211296.1 12121244 10 7200000 0 187330.6 12548565 Results after applying patchset: FSUse% Count Size Files/sec App Overhead 2 1200000 0 207556.0 11393252 4 2400000 0 266751.1 11347909 4 3600000 0 274397.5 11270058 6 4800000 0 259608.4 11442250 6 6000000 0 238895.8 11635921 8 7200000 0 211942.2 11873825 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b00dfff |
|
08-Feb-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: eliminate extra call when doing binary search on extent buffer The function btrfs_bin_search() is just a wrapper around the function generic_bin_search(), which passes the same arguments plus a default low slot with a value of 0. This adds an unnecessary extra function call, since btrfs_bin_search() is not static. So improve on this by making btrfs_bin_search() an inline function that calls generic_bin_search(), renaming the later to btrfs_generic_bin_search() and exporting it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0e6c40eb |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the csum helpers into ctree.h These got moved because of copy+paste, but this code exists in ctree.c, so move the declarations back into ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b48adda |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move eb offset helpers into extent_io.h These are very specific to how the extent buffer is defined, so this differs between btrfs-progs and the kernel. Make things easier by moving these helpers into extent_io.h so we don't have to worry about this when syncing ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6bfd0ffa |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move file_extent_item helpers into file-item.h These helpers use functions that are in multiple places, which makes it tricky to sync them into btrfs-progs. Move them to file-item.h and then include file-item.h in places that use these helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1fe5ebc4 |
|
15-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move root helpers back into ctree.h These accidentally got brought into accessors.h, but belong with the btrfs_root definitions which are currently in ctree.h. Move these to make it easier to sync accessors.[ch] into btrfs-progs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3c32c721 |
|
11-Nov-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use cached state when looking for delalloc ranges with lseek During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent, we will look for delalloc in that range, and one of the things we do for that is to find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using calls to count_range_bits(). Typically there's a single, or few, searches in the io_tree for delalloc per lseek call. However it's common for applications to keep calling lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in a file, read the extents and skip holes in order to avoid unnecessary IO and save disk space by preserving holes. One popular user is the cp utility from coreutils. Starting with coreutils 9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a file. Before 9.0, it used fiemap to figure out where holes and extents are in the source file. Another popular user is the tar utility when used with the --sparse / -S option to detect and preserve holes. Given that the pattern is to keep calling lseek with a start offset that matches the returned offset from the previous lseek call, we can benefit from caching the last extent state visited in count_range_bits() and use it for the next count_range_bits() from the next lseek call. Example, the following strace excerpt from running tar: $ strace tar cJSvf foo.tar.xz qemu_disk_file.raw (...) lseek(5, 125019574272, SEEK_HOLE) = 125024989184 lseek(5, 125024989184, SEEK_DATA) = 125024993280 lseek(5, 125024993280, SEEK_HOLE) = 125025239040 lseek(5, 125025239040, SEEK_DATA) = 125025255424 lseek(5, 125025255424, SEEK_HOLE) = 125025353728 lseek(5, 125025353728, SEEK_DATA) = 125025357824 lseek(5, 125025357824, SEEK_HOLE) = 125026766848 lseek(5, 125026766848, SEEK_DATA) = 125026770944 lseek(5, 125026770944, SEEK_HOLE) = 125027053568 (...) Shows that pattern, which is the same as with cp from coreutils 9.0+. So start using a cached state for the delalloc searches in lseek, and store it in struct file's private data so that it can be reused across lseek calls. This change is part of a patchset that is comprised of the following patches: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek The following test was run before and after applying the whole patchset: $ cat test-cp.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh # coreutils 8.32, cp uses fiemap to detect holes and extents #CP_PROG=/usr/bin/cp # coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT FILE_SIZE=$((1024 * 1024 * 1024)) echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M" # Create a very sparse file, where each extent has a length of 4K and # is preceded by a 4K hole and followed by another 4K hole. start=$(date +%s%N) echo -n > $MNT/foobar for ((off = 0; off < $FILE_SIZE; off += 8192)); do xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null echo -ne "\r$off / $FILE_SIZE ..." done end=$(date +%s%N) echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)" start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds with data/metadata cached and delalloc" # Flush all delalloc. sync start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds with data/metadata cached and no delalloc" # Unmount and mount again to test the case without any metadata # loaded in memory. umount $MNT mount $DEV $MNT start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds without data/metadata cached and no delalloc" umount $MNT The results, running on a box with a non-debug kernel (Debian's default kernel config), were the following: 128M file, before patchset: cp took 16574 milliseconds with data/metadata cached and delalloc cp took 122 milliseconds with data/metadata cached and no delalloc cp took 20144 milliseconds without data/metadata cached and no delalloc 128M file, after patchset: cp took 6277 milliseconds with data/metadata cached and delalloc cp took 109 milliseconds with data/metadata cached and no delalloc cp took 210 milliseconds without data/metadata cached and no delalloc 512M file, before patchset: cp took 14369 milliseconds with data/metadata cached and delalloc cp took 429 milliseconds with data/metadata cached and no delalloc cp took 88034 milliseconds without data/metadata cached and no delalloc 512M file, after patchset: cp took 12106 milliseconds with data/metadata cached and delalloc cp took 427 milliseconds with data/metadata cached and no delalloc cp took 824 milliseconds without data/metadata cached and no delalloc 1G file, before patchset: cp took 10074 milliseconds with data/metadata cached and delalloc cp took 886 milliseconds with data/metadata cached and no delalloc cp took 181261 milliseconds without data/metadata cached and no delalloc 1G file, after patchset: cp took 3320 milliseconds with data/metadata cached and delalloc cp took 880 milliseconds with data/metadata cached and no delalloc cp took 1801 milliseconds without data/metadata cached and no delalloc Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aa5d3003 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move orphan prototypes into orphan.h Move these out of ctree.h into orphan.h to cut down on code in ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c03b2207 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move super prototypes into super.h Move these out of ctree.h into super.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a6b4daf |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move CONFIG_BTRFS_FS_RUN_SANITY_TESTS checks to fs.h We already have a few of these in fs.h, move the remaining checks out of ctree.h into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c11adcc |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move verity prototypes into verity.h Move these out of ctree.h into verity.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
77407dc0 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move dev-replace prototypes into dev-replace.h We already have a dev-replace.h, simply move these prototypes and helpers into dev-replace.h where they belong. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2fc6822c |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move scrub prototypes into scrub.h Move these out of ctree.h into scrub.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67707479 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move relocation prototypes into relocation.h Move these out of ctree.h into relocation.h to cut down on code in ctree.h Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
33cf97a7 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move acl prototypes into acl.h Move these out of ctree.h into acl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cc68414c |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the snapshot drop related prototypes to extent-tree.h These belong in extent-tree.h, they were missed because they were not grouped with the other extent-tree.c prototypes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b538a271 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the 32bit warn defines into messages.h The code for these functions are in messages.c, move the defines and prototypes to messages.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
af142b6f |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move file prototypes to file.h Move these out of ctree.h into file.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7572dec8 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move ioctl prototypes into ioctl.h Move these out of ctree.h into ioctl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7a03b52 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move uuid tree prototypes to uuid-tree.h Move these out of ctree.h into uuid-tree.h to cut down on the code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c8ede16 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move file-item prototypes into their own header Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f2b39277 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move dir-item prototypes into dir-item.h Move these prototypes out of ctree.h and into their own header file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
59b818e0 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move defrag related prototypes to their own header Now that the defrag code is all in one file, create a defrag.h and move all the defrag related prototypes and helper out of ctree.h and into defrag.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2885fd63 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move inode prototypes to btrfs_inode.h I initially wanted to make a new header file for this, but these prototypes do naturally fit into btrfs_inode.h. If we want to extract vfs from pure btrfs code in the future we may need to split this up, but btrfs_inode embeds the vfs_inode, so it makes sense to put the prototypes in this header for now. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b31bed17 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_chunk_item_size out of ctree.h This is used by the volumes code and the tree checker code. We want to maintain inline however, so simply move it to volumes.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
911bd75a |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove unused function prototypes I wrote the following coccinelle script to find function declarations that didn't have the corresponding code for them @funcproto@ identifier func; type T; position p0; @@ T func@p0(...); @funccode@ identifier funcproto.func; position p1; @@ func@p1(...) { ... } @script:python depends on !funccode@ p0 << funcproto.p0; @@ print("Proto with no function at %s:%s" % (p0[0].file, p0[0].line)) and ran it against btrfs, which identified the 4 function prototypes I've removed in this patch. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
45c40c8f |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move root tree prototypes to their own header Move all the root-tree.c prototypes to root-tree.h, and then update all the necessary files to include the new header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6d2049a2 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: delete unused function prototypes in ctree.h This batch of prototypes no longer have code associated with them, so remove them. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2839c2c1 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move delalloc space related prototypes to delalloc-space.h These exist in delalloc-space.c, move them from ctree.h into delalloc-space.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0231804 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move extent-tree helpers into their own header file Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e2f13b34 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_account_ro_block_groups_free_space into space-info.c This was prototyped in ctree.h and the code existed in extent-tree.c, but it's space-info related so move it into space-info.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8483d402 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove extra space info prototypes in ctree.h These are defined already in space-info.h, remove them from ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
13d925c1 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: minor whitespace in ctree.h We've accumulated some whitespace problems in ctree.h, clean these up. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eb33a4d6 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the lockdep helpers into locking.h These more naturally fit in with the locking related code, and they're all defines so they can easily go anywhere, move them out of ctree.h into locking.h Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a56159d4 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_fs_info declarations into fs.h Now that we have a lot of the fs_info related helpers and stuff isolated, copy these over to fs.h out of ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6db75318 |
|
19-Oct-2022 |
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> |
btrfs: use struct fscrypt_str instead of struct qstr While struct qstr is more natural without fscrypt, since it's provided by dentries, struct fscrypt_str is provided by the fscrypt handlers processing dentries, and is thus more natural in the fscrypt world. Replace all of the struct qstr uses with struct fscrypt_str. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab3c5c18 |
|
19-Oct-2022 |
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> |
btrfs: setup qstr from dentrys using fscrypt helper Most places where we get a struct qstr, we are doing so from a dentry. With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt provides a helper to convert a dentry name to the appropriate disk name if necessary. Convert each of the dentry name accesses to use fscrypt_setup_filename(), then convert the resulting fscrypt_name back to an unencrypted qstr. This does not work for nokey names, but the specific locations that could spawn nokey names are noted. At present, since there are no encrypted directories, nothing goes down the filename encryption paths. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e43eec81 |
|
19-Oct-2022 |
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> |
btrfs: use struct qstr instead of name and namelen pairs Many functions throughout btrfs take name buffer and name length arguments. Most of these functions at the highest level are usually called with these arguments extracted from a supplied dentry's name. But the entire name can be passed instead, making each function a little more elegant. Each function whose arguments are currently the name and length extracted from a dentry is herein converted to instead take a pointer to the name in the dentry. The couple of calls to these calls without a struct dentry are converted to create an appropriate qstr to pass in. Additionally, every function which is only called with a name/len extracted directly from a qstr is also converted. This change has positive effect on stack consumption, frame of many functions is reduced but this will be used in the future for fscrypt related structures. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9c83077 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove temporary btrfs_map_token declaration in ctree.h This was added while I was moving this code to its new home, it can be removed now. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
07e81dc9 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad1ac501 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_map_token to accessors This is specific to the item-accessor code, move it out of ctree.h into accessor.h/.c and then update the users to include the new header file. This un-inlines btrfs_init_map_token, however this is only called once per function so it's not critical to be inlined. This also saves 904 bytes of code on a release build. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d83eb482 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the compat/incompat flag masks to fs.h This is fs wide information, move it out of ctree.h into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
55e5cfd3 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove fs_info::pending_changes and related code Now that we're not using this code anywhere we can remove it as well as the member from fs_info. We don't have any mount options or on/off features that would utilize the pending infrastructure, the last one was inode_cache. There was a patchset [1] to enable some features from sysfs that would break things if it would be set immediately. In case we'll need that kind of logic again the patch can be reverted, but for the current use it can be replaced by the single state bit to do the commit. [1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7966a6b5 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move fs_info::flags enum to fs.h These definitions are fs wide, take them out of ctree.h and put them in fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc97a410 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move mount option definitions to fs.h These are fs wide definitions and helpers, move them out of ctree.h and into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ec8eb376 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h We're going to use fs.h to hold fs wide related helpers and definitions, move the FS_STATE enum and related helpers to fs.h, and then update all files that need these definitions to include fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b569ea0 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the printk helpers out of ctree.h We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e118578a |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move assert helpers out of ctree.h These call functions that aren't defined in, or will be moved out of, ctree.h Move them to super.c where the other assert/error message code is defined. Drop the __noreturn attribute for btrfs_assertfail as objtool does not like it and fails with warnings like fs/btrfs/dir-item.o: warning: objtool: .text.unlikely: unexpected end of section fs/btrfs/xattr.o: warning: objtool: btrfs_setxattr() falls through to next function btrfs_setxattr_trans.cold() fs/btrfs/xattr.o: warning: objtool: .text.unlikely: unexpected end of section Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7f13d42 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move fs wide helpers out of ctree.h We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
63a7cb13 |
|
26-Jul-2022 |
David Sterba <dsterba@suse.com> |
btrfs: auto enable discard=async when possible There's a request to automatically enable async discard for capable devices. We can do that, the async mode is designed to wait for larger freed extents and is not intrusive, with limits to iops, kbps or latency. The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . The automatic selection is done if there's at least one discard capable device in the filesystem (not capable devices are skipped). Mounting with any other discard option will honor that option, notably mounting with nodiscard will keep it disabled. Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7321b76 |
|
09-Sep-2022 |
David Sterba <dsterba@suse.com> |
btrfs: convert BTRFS_ILOCK-* defines to enum bit Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a66eda3 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h These are wrapped in CONFIG_FS_VERITY, but we can have the definitions without verity enabled. Move these definitions up with the other accessor helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
890d2b1a |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_next_old_item into ctree.c This uses btrfs_header_nritems, which I will be moving out of ctree.h. In order to avoid needing to include the relevant header in ctree.h, simply move this helper function into ctree.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename parameters ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eda517fd |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move free space cachep's out of ctree.h This is local to the free-space-cache.c code, remove it from ctree.h and inode.c, create new init/exit functions for the cachep, and move it locally to free-space-cache.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
226463d7 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_path_cachep out of ctree.h This is local to the ctree code, remove it from ctree.h and inode.c, create new init/exit functions for the cachep, and move it locally to ctree.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
956504a3 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move trans_handle_cachep out of ctree.h This is local to the transaction code, remove it from ctree.h and inode.c, create new helpers in the transaction to handle the init work and move the cachep locally to transaction.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f1e5c618 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move flush related definitions to space-info.h This code is used in space-info.c, move the definitions to space-info.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
390d89cc |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move discard stat defs to free-space-cache.h These definitions are used for discard statistics, move them out of ctree.h and put them in free-space-cache.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed4c491a |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move BTRFS_MAX_MIRRORS into scrub.c This is only used locally in scrub.c, move it out of ctree.h into scrub.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad4b63ca |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move maximum limits to btrfs_tree.h We have maximum link and name length limits, move these to btrfs_tree.h as they're on disk limitations. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4300c58f |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs on-disk definitions out of ctree.h The bulk of our on-disk definitions exist in btrfs_tree.h, which user space can use. Keep things consistent and move the rest of the on disk definitions out of ctree.h into btrfs_tree.h. Note I did have to update all u8's to __u8, but otherwise this is a strict copy and paste. Most of the definitions are mainly for internal use and are not guaranteed stable public API and may change as we need. Compilation failures by user applications can happen. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4ce76e8e |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove unused BTRFS_IOPRIO_READA The last user of this definition was removed in patch f26c92386028 ("btrfs: remove reada infrastructure") so we can remove this definition. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ea206640 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove unused BTRFS_TOTAL_BYTES_PINNED_BATCH This hasn't been used since 138a12d86574 ("btrfs: rip out btrfs_space_info::total_bytes_pinned") so it is safe to remove. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d60d956e |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove unused set/clear_pending_info helpers The last users of these helpers were removed in 5297199a8bca ("btrfs: remove inode number cache feature") so delete these helpers. The point was for mount options that were applicable after transaction commit so they could not be applied immediately. We don't have such options anymore and if we do the patch can be reverted. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
138060ba |
|
23-Sep-2022 |
Christian Brauner <brauner@kernel.org> |
fs: pass dentry to set acl method The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. Since some filesystem rely on the dentry being available to them when setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode operation. But since ->set_acl() is required in order to use the generic posix acl xattr handlers filesystems that do not implement this inode operation cannot use the handler and need to implement their own dedicated posix acl handlers. Update the ->set_acl() inode method to take a dentry argument. This allows all filesystems to rely on ->set_acl(). As far as I can tell all codepaths can be switched to rely on the dentry instead of just the inode. Note that the original motivation for passing the dentry separate from the inode instead of just the dentry in the xattr handlers was because of security modules that call security_d_instantiate(). This hook is called during d_instantiate_new(), d_add(), __d_instantiate_anon(), and d_splice_alias() to initialize the inode's security context and possibly to set security.* xattrs. Since this only affects security.* xattrs this is completely irrelevant for posix acls. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
8bb808c6 |
|
03-Nov-2022 |
David Sterba <dsterba@suse.com> |
btrfs: don't print stack trace when transaction is aborted due to ENOMEM Add ENOMEM among the error codes that don't print stack trace on transaction abort. We've got several reports from syzbot that detects stacks as errors but caused by limiting memory. As this is an artificial condition we don't need to know where exactly the error happens, the abort and error cleanup will continue like e.g. for EIO. As the transaction aborts code needs to be inline in a lot of code, the implementation cases about minimal bloat. The error codes are in a separate function and the WARN uses the condition directly. This increases the code size by 571 bytes on release build. Alternatives considered: add -ENOMEM among the errors, this increases size by 2340 bytes, various attempts to combine the WARN and helper calls, increase by 700 or more bytes. Example syzbot reports (error -12): - https://syzkaller.appspot.com/bug?extid=5244d35be7f589cf093e - https://syzkaller.appspot.com/bug?extid=9c37714c07194d816417 Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8184620a |
|
28-Oct-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix lost file sync on direct IO write with nowait and dsync iocb When doing a direct IO write using a iocb with nowait and dsync set, we end up not syncing the file once the write completes. This is because we tell iomap to not call generic_write_sync(), which would result in calling btrfs_sync_file(), in order to avoid a deadlock since iomap can call it while we are holding the inode's lock and btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens only if the write happens synchronously, when iomap_dio_rw() calls iomap_dio_complete() before it returns. Instead we do the sync ourselves at btrfs_do_write_iter(). For a nowait write however we can end up not doing the sync ourselves at at btrfs_do_write_iter() because the write could have been queued, and therefore we get -EIOCBQUEUED returned from iomap in such case. That makes us skip the sync call at btrfs_do_write_iter(), as we don't do it for any error returned from btrfs_direct_write(). We can't simply do the call even if -EIOCBQUEUED is returned, since that would block the task waiting for IO, both for the data since there are bios still in progress as well as potentially blocking when joining a log transaction and when syncing the log (writing log trees, super blocks, etc). So let iomap do the sync call itself and in order to avoid deadlocks for the case of synchronous writes (without nowait), use __iomap_dio_rw() and have ourselves call iomap_dio_complete() after unlocking the inode. A test case will later be sent for fstests, after this is fixed in Linus' tree. Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes") Reported-by: Марк Коренберг <socketpair@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/ CC: stable@vger.kernel.org # 6.0+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c0c8cfc |
|
19-Sep-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move btrfs_drop_extent_cache() to extent_map.c The function btrfs_drop_extent_cache() doesn't really belong at file.c because what it does is drop a range of extent maps for a file range. It directly allocates and manipulates extent maps, by dropping, splitting and replacing them in an extent map tree, so it should be located at extent_map.c, where all manipulations of an extent map tree and its extent maps are supposed to be done. So move it out of file.c and into extent_map.c. Additionally do the following changes: 1) Rename it into btrfs_drop_extent_map_range(), as this makes it more clear about what it does. The term "cache" is a bit confusing as it's not widely used, "extent maps" or "extent mapping" is much more common; 2) Change its 'skip_pinned' argument from int to bool; 3) Turn several of its local variables from int to bool, since they are used as booleans; 4) Move the declaration of some variables out of the function's main scope and into the scopes where they are used; 5) Remove pointless assignment of false to 'modified' early in the while loop, as later that variable is set and it's not used before that second assignment; 6) Remove checks for NULL before calling free_extent_map(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3050dfa6 |
|
22-Sep-2022 |
Jeff Layton <jlayton@kernel.org> |
btrfs: remove stale prototype of btrfs_write_inode This function no longer exists, was removed in 3c4276936f6f ("Btrfs: fix btrfs_write_inode vs delayed iput deadlock"). Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
80f9d241 |
|
12-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make btrfs_check_nocow_lock nowait compatible Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add a nowait flag to btrfs_check_nocow_lock so it can be used by the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
26ce9114 |
|
12-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make can_nocow_extent nowait compatible If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
857bc13f |
|
12-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: implement a nowait option for tree searches For NOWAIT IOCBs we'll need a way to tell search to not wait on locks or anything. Accomplish this by adding a path->nowait flag that will use trylocks and skip reading of metadata, returning -EAGAIN in either of these cases. For now we only need this for reads, so only the read side is handled. Add an ASSERT() to catch anybody trying to use this for writes so they know they'll have to implement the write side. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d9d88fde |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the fs_info related helpers closer to fs_info in ctree.h This is purely cosmetic, to make it straightforward to copy and paste the definition and helpers from ctree.h into fs.h. These are helpers that act directly on the fs_info, and were scattered throughout ctree.h. Move them directly below the fs_info definition to make it easier to move them later. This includes the exclop prototypes, which shares an enum that's used in struct btrfs_fs_info as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f119553f |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_csum_ptr to inode.c This helper is only used in inode.c, move it locally to that file instead of defining it in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0e75f005 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move fs_info forward declarations to the top of ctree.h In order to make it more straightforward to move the fs_info struct and it's related structures, move the struct declarations to the top of ctree.h. This will make it easier to clean up after the fact. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2103da3b |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_swapfile_pin into volumes.h This isn't a great spot for this, but one of the swapfile helper functions is in volumes.c, so move the struct to volumes.h. In the future when we have better separation of code there will be a more natural spot for this. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2e79e86 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_pinned_by_swapfile prototype into volumes.h This is defined in volumes.c, move the prototype into volumes.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43712116 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_init_async_reclaim_work prototype to space-info.h The code for this helper is in space-info.c, move the prototype to space-info.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c29abab4 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_full_stripe_locks_tree into block-group.h This is actually embedded in struct btrfs_block_group, so move this definition to block-group.h, and then open-code the init of the tree where we init the rest of the block group instead of using a helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16708a88 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_caching_type to block-group.h This is a block group related definition, move it into block-group.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ea1a526 |
|
09-Sep-2022 |
Gaosheng Cui <cuigaosheng1@huawei.com> |
btrfs: remove btrfs_bit_radix_cachep declaration btrfs_bit_radix_cachep has been removed since commit 45c06543afe2 ("Btrfs: remove unused btrfs_bit_radix slab"), so remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
011b46c3 |
|
23-Aug-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction() Btrfs qgroup has a long history of bringing performance penalty in btrfs_commit_transaction(). Although we tried our best to migrate such impact, there is still an unsolved call site, btrfs_drop_snapshot(). This function will find the highest shared tree block and modify its extent ownership to do a subvolume/snapshot dropping. Such change will affect the whole subtree, and cause tons of qgroup dirty extents and stall btrfs_commit_transaction(). To avoid such problem, here we introduce a new sysfs interface, /sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at whether and at which level we should skip qgroup accounting for subtree dropping. The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go through qgroup accounting, to ensure qgroup numbers are kept as consistent as possible. While for performance sensitive cases, add a way to change the values to more reasonable values like 3, to make any subtree, which is at or higher than level 3, to mark qgroup inconsistent and skip the accounting. The cost is obvious, the qgroup number is no longer consistent, but at least performance is more reasonable, and users have the control. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ac3c0d36 |
|
01-Sep-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make fiemap more efficient and accurate reporting extent sharedness The current fiemap implementation does not scale very well with the number of extents a file has. This is both because the main algorithm to find out the extents has a high algorithmic complexity and because for each extent we have to check if it's shared. This second part, checking if an extent is shared, is significantly improved by the two previous patches in this patchset, while the first part is improved by this specific patch. Every now and then we get reports from users mentioning fiemap is too slow or even unusable for files with a very large number of extents, such as the two recent reports referred to by the Link tags at the bottom of this change log. To understand why the part of finding which extents a file has is very inefficient, consider the example of doing a full ranged fiemap against a file that has over 100K extents (normal for example for a file with more than 10G of data and using compression, which limits the extent size to 128K). When we enter fiemap at extent_fiemap(), the following happens: 1) Before entering the main loop, we call get_extent_skip_holes() to get the first extent map. This leads us to btrfs_get_extent_fiemap(), which in turn calls btrfs_get_extent(), to find the first extent map that covers the file range [0, LLONG_MAX). btrfs_get_extent() will first search the inode's extent map tree, to see if we have an extent map there that covers the range. If it does not find one, then it will search the inode's subvolume b+tree for a fitting file extent item. After finding the file extent item, it will allocate an extent map, fill it in with information extracted from the file extent item, and add it to the inode's extent map tree (which requires a search for insertion in the tree). 2) Then we enter the main loop at extent_fiemap(), emit the details of the extent, and call again get_extent_skip_holes(), with a start offset matching the end of the extent map we previously processed. We end up at btrfs_get_extent() again, will search the extent map tree and then search the subvolume b+tree for a file extent item if we could not find an extent map in the extent tree. We allocate an extent map, fill it in with the details in the file extent item, and then insert it into the extent map tree (yet another search in this tree). 3) The second step is repeated over and over, until we have processed the whole file range. Each iteration ends at btrfs_get_extent(), which does a red black tree search on the extent map tree, then searches the subvolume b+tree, allocates an extent map and then does another search in the extent map tree in order to insert the extent map. In the best scenario we have all the extent maps already in the extent tree, and so for each extent we do a single search on a red black tree, so we have a complexity of O(n log n). In the worst scenario we don't have any extent map already loaded in the extent map tree, or have very few already there. In this case the complexity is much higher since we do: - A red black tree search on the extent map tree, which has O(log n) complexity, initially very fast since the tree is empty or very small, but as we end up allocating extent maps and adding them to the tree when we don't find them there, each subsequent search on the tree gets slower, since it's getting bigger and bigger after each iteration. - A search on the subvolume b+tree, also O(log n) complexity, but it has items for all inodes in the subvolume, not just items for our inode. Plus on a filesystem with concurrent operations on other inodes, we can block doing the search due to lock contention on b+tree nodes/leaves. - Allocate an extent map - this can block, and can also fail if we are under serious memory pressure. - Do another search on the extent maps red black tree, with the goal of inserting the extent map we just allocated. Again, after every iteration this tree is getting bigger by 1 element, so after many iterations the searches are slower and slower. - We will not need the allocated extent map anymore, so it's pointless to add it to the extent map tree. It's just wasting time and memory. In short we end up searching the extent map tree multiple times, on a tree that is growing bigger and bigger after each iteration. And besides that we visit the same leaf of the subvolume b+tree many times, since a leaf with the default size of 16K can easily have more than 200 file extent items. This is very inefficient overall. This patch changes the algorithm to instead iterate over the subvolume b+tree, visiting each leaf only once, and only searching in the extent map tree for file ranges that have holes or prealloc extents, in order to figure out if we have delalloc there. It will never allocate an extent map and add it to the extent map tree. This is very similar to what was previously done for the lseek's hole and data seeking features. Also, the current implementation relying on extent maps for figuring out which extents we have is not correct. This is because extent maps can be merged even if they represent different extents - we do this to minimize memory utilization and keep extent map trees smaller. For example if we have two extents that are contiguous on disk, once we load the two extent maps, they get merged into a single one - however if only one of the extents is shared, we end up reporting both as shared or both as not shared, which is incorrect. This reproducer triggers that bug: $ cat fiemap-bug.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Create a file with two 256K extents. # Since there is no other write activity, they will be contiguous, # and their extent maps merged, despite having two distinct extents. xfs_io -f -c "pwrite -S 0xab 0 256K" \ -c "fsync" \ -c "pwrite -S 0xcd 256K 256K" \ -c "fsync" \ $MNT/foo # Now clone only the second extent into another file. xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar # Filefrag will report a single 512K extent, and say it's not shared. echo filefrag -v $MNT/foo umount $MNT Running the reproducer: $ ./fiemap-bug.sh wrote 262144/262144 bytes at offset 0 256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec) wrote 262144/262144 bytes at offset 262144 256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec) linked 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec) Filesystem type is: 9123683e File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 3328.. 3455: 128: last,eof /mnt/sdj/foo: 1 extent found We end up reporting that we have a single 512K that is not shared, however we have two 256K extents, and the second one is shared. Changing the reproducer to clone instead the first extent into file 'bar', makes us report a single 512K extent that is shared, which is algo incorrect since we have two 256K extents and only the first one is shared. This patch is part of a larger patchset that is comprised of the following patches: btrfs: allow hole and data seeking to be interruptible btrfs: make hole and data seeking a lot more efficient btrfs: remove check for impossible block start for an extent map at fiemap btrfs: remove zero length check when entering fiemap btrfs: properly flush delalloc when entering fiemap btrfs: allow fiemap to be interruptible btrfs: rename btrfs_check_shared() to a more descriptive name btrfs: speedup checking for extent sharedness during fiemap btrfs: skip unnecessary extent buffer sharedness checks during fiemap btrfs: make fiemap more efficient and accurate reporting extent sharedness The patchset was tested on a machine running a non-debug kernel (Debian's default config) and compared the tests below on a branch without the patchset versus the same branch with the whole patchset applied. The following test for a large compressed file without holes: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1214 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 684 milliseconds (metadata cached) That's a speedup of about 3x for both cases (no metadata cached and all metadata cached). The test provided by Pavel (first Link tag at the bottom), which uses files with a large number of holes, was also used to measure the gains, and it consists on a small C program and a shell script to invoke it. The C program is the following: $ cat pavels-test.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <linux/fiemap.h> #define FILE_INTERVAL (1<<13) /* 8Kb */ long long interval(struct timeval t1, struct timeval t2) { long long val = 0; val += (t2.tv_usec - t1.tv_usec); val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000; return val; } int main(int argc, char **argv) { struct fiemap fiemap = {}; struct timeval t1, t2; char data = 'a'; struct stat st; int fd, off, file_size = FILE_INTERVAL; if (argc != 3 && argc != 2) { printf("usage: %s <path> [size]\n", argv[0]); return 1; } if (argc == 3) file_size = atoi(argv[2]); if (file_size < FILE_INTERVAL) file_size = FILE_INTERVAL; file_size -= file_size % FILE_INTERVAL; fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644); if (fd < 0) { perror("open"); return 1; } for (off = 0; off < file_size; off += FILE_INTERVAL) { if (pwrite(fd, &data, 1, off) != 1) { perror("pwrite"); close(fd); return 1; } } if (ftruncate(fd, file_size)) { perror("ftruncate"); close(fd); return 1; } if (fstat(fd, &st) < 0) { perror("fstat"); close(fd); return 1; } printf("size: %ld\n", st.st_size); printf("actual size: %ld\n", st.st_blocks * 512); fiemap.fm_length = FIEMAP_MAX_OFFSET; gettimeofday(&t1, NULL); if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) { perror("fiemap"); close(fd); return 1; } gettimeofday(&t2, NULL); printf("fiemap: fm_mapped_extents = %d\n", fiemap.fm_mapped_extents); printf("time = %lld us\n", interval(t1, t2)); close(fd); return 0; } $ gcc -o pavels_test pavels_test.c And the wrapper shell script: $ cat fiemap-pavels-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f -O no-holes $DEV mount $DEV $MNT echo echo "*********** 256M ***********" echo ./pavels-test $MNT/testfile $((1 << 28)) echo ./pavels-test $MNT/testfile $((1 << 28)) echo echo "*********** 512M ***********" echo ./pavels-test $MNT/testfile $((1 << 29)) echo ./pavels-test $MNT/testfile $((1 << 29)) echo echo "*********** 1G ***********" echo ./pavels-test $MNT/testfile $((1 << 30)) echo ./pavels-test $MNT/testfile $((1 << 30)) umount $MNT Running his reproducer before applying the patchset: *********** 256M *********** size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 4003133 us size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 4895330 us *********** 512M *********** size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 30123675 us size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 33450934 us *********** 1G *********** size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 224924074 us size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 217239242 us Running it after applying the patchset: *********** 256M *********** size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 29475 us size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 29307 us *********** 512M *********** size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 58996 us size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 59115 us *********** 1G *********** size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 116251 time = 124141 us size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 119387 us The speedup is massive, both on the first fiemap call and on the second one as well, as his test creates files with many holes and small extents (every extent follows a hole and precedes another hole). For the 256M file we go from 4 seconds down to 29 milliseconds in the first run, and then from 4.9 seconds down to 29 milliseconds again in the second run, a speedup of 138x and 169x, respectively. For the 512M file we go from 30.1 seconds down to 59 milliseconds in the first run, and then from 33.5 seconds down to 59 milliseconds again in the second run, a speedup of 510x and 568x, respectively. For the 1G file, we go from 225 seconds down to 124 milliseconds in the first run, and then from 217 seconds down to 119 milliseconds in the second run, a speedup of 1815x and 1824x, respectively. Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
12a824dc |
|
01-Sep-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: speedup checking for extent sharedness during fiemap One of the most expensive tasks performed during fiemap is to check if an extent is shared. This task has two major steps: 1) Check if the data extent is shared. This implies checking the extent item in the extent tree, checking delayed references, etc. If we find the data extent is directly shared, we terminate immediately; 2) If the data extent is not directly shared (its extent item has a refcount of 1), then it may be shared if we have snapshots that share subtrees of the inode's subvolume b+tree. So we check if the leaf containing the file extent item is shared, then its parent node, then the parent node of the parent node, etc, until we reach the root node or we find one of them is shared - in which case we stop immediately. During fiemap we process the extents of a file from left to right, from file offset 0 to EOF. This means that we iterate b+tree leaves from left to right, and has the implication that we keep repeating that second step above several times for the same b+tree path of the inode's subvolume b+tree. For example, if we have two file extent items in leaf X, and the path to leaf X is A -> B -> C -> X, then when we try to determine if the data extent referenced by the first extent item is shared, we check if the data extent is shared - if it's not, then we check if leaf X is shared, if not, then we check if node C is shared, if not, then check if node B is shared, if not than check if node A is shared. When we move to the next file extent item, after determining the data extent is not shared, we repeat the checks for X, C, B and A - doing all the expensive searches in the extent tree, delayed refs, etc. If we have thousands of tile extents, then we keep repeating the sharedness checks for the same paths over and over. On a file that has no shared extents or only a small portion, it's easy to see that this scales terribly with the number of extents in the file and the sizes of the extent and subvolume b+trees. This change eliminates the repeated sharedness check on extent buffers by caching the results of the last path used. The results can be used as long as no snapshots were created since they were cached (for not shared extent buffers) or no roots were dropped since they were cached (for shared extent buffers). This greatly reduces the time spent by fiemap for files with thousands of extents and/or large extent and subvolume b+trees. Example performance test: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1646 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 698 milliseconds (metadata cached) That's about 2.2x faster when no metadata is cached, and about 3x faster when all metadata is cached. On a real filesystem with many other files, data, directories, etc, the b+trees will be 2 or 3 levels higher, therefore this optimization will have a higher impact. Several reports of a slow fiemap show up often, the two Link tags below refer to two recent reports of such slowness. This patch, together with the next ones in the series, is meant to address that. Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1c56ab99 |
|
08-Aug-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2 The problem of long mount time caused by block group item search is already known for some time, and the solution of block group tree has been proposed. There is really no need to bound this feature into extent tree v2, just introduce compat RO flag, BLOCK_GROUP_TREE, to correctly solve the problem. All the code handling block group root is already in the upstream kernel, thus this patch really only needs to introduce the new compat RO flag. This patch introduces one extra artificial limitation on block group tree feature, that free space cache v2 and no-holes feature must be enabled to use this new compat RO feature. This artificial requirement is mostly to reduce the test combinations, and can be a guideline for future features, to mostly rely on the latest default features. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14033b08 |
|
08-Aug-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: don't save block group root into super block The extent tree v2 needs a new root for storing all block group items, the whole feature hasn't been finished yet so we can afford to do some changes. My initial proposal years ago just added a new tree rootid, and load it from tree root, just like what we did for quota/free space tree/uuid/extent roots. But the extent tree v2 patches introduced a completely new way to store block group tree root into super block which is arguably wasteful. Currently there are only 3 trees stored in super blocks, and they all have their valid reasons: - Chunk root Needed for bootstrap. - Tree root Really the entry point for all trees. - Log root This is special as log root has to be updated out of existing transaction mechanism. There is not even any reason to put block group root into super blocks, the block group tree is updated at the same time as the old extent tree, no need for extra bootstrap/out-of-transaction update. So just move block group root from super block into tree root. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e327b9c |
|
25-Aug-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: dump all space infos if we abort transaction due to ENOSPC We have hit some transaction abort due to -ENOSPC internally. Normally we should always reserve enough space for metadata for every transaction, thus hitting -ENOSPC should really indicate some cases we didn't expect. But unfortunately current error reporting will only give a kernel warning and stack trace, not really helpful to debug what's causing the problem. And mount option debug_enospc can only help when user can reproduce the problem, but under most cases, such transaction abort by -ENOSPC is really hard to reproduce. So this patch will dump all space infos (data, metadata, system) when we abort the first transaction with -ENOSPC. This should at least provide some clue to us. The example of a dump would look like this: BTRFS: Transaction aborted (error -28) WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs] <call trace skipped> ---[ end trace 0000000000000000 ]--- BTRFS info (device dm-1: state A): dumping space info: BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0 BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016 BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232 BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728 BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left BTRFS info (device dm-1: state EA): forced readonly Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2bbc72f1 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: don't take a bio_counter reference for cloned bios Stop grabbing an extra bio_counter reference for each clone bio in a mirrored write and instead just release the one original reference in btrfs_end_bioc once all the bios for a single btrfs_bio have completed instead of at the end of btrfs_submit_bio once all bios have been submitted. This means the reference is now carried by the "upper" btrfs_bio only instead of each lower bio. Also remove the now unused btrfs_bio_counter_inc_noblocked helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb731430 |
|
25-Jul-2022 |
David Sterba <dsterba@suse.com> |
btrfs: sysfs: show discard stats and tunables in non-debug build When discard=async was introduced there were also sysfs knobs and stats for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults have been set and so far seem to satisfy all users on a range of workloads. As there are not only tunables (like iops or kbps) but also stats tracking amount of discardable bytes, that should be available when the async discard is on (otherwise it's not). The stats are moved from the per-fs debug directory, so it's under /sys/fs/btrfs/FSID/discard - discard_bitmap_bytes - amount of discarded bytes from data tracked as bitmaps - discard_extent_bytes - dtto but as extents - discard_bytes_saved - - discardable_bytes - amount of bytes that can be discarded - discardable_extents - number of extents to be discarded - iops_limit - tunable limit of number of discard IOs to be issued - kbps_limit - tunable limit of kilobytes per second issued as discard IO - max_discard_size - tunable limit for size of one IO discard request Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
38622010 |
|
15-Aug-2022 |
Boris Burkov <boris@bur.io> |
btrfs: send: add support for fs-verity Preserve the fs-verity status of a btrfs file across send/recv. There is no facility for installing the Merkle tree contents directly on the receiving filesystem, so we package up the parameters used to enable verity found in the verity descriptor. This gives the receive side enough information to properly enable verity again. Note that this means that receive will have to re-compute the whole Merkle tree, similar to how compression worked before encoded_write. Since the file becomes read-only after verity is enabled, it is important that verity is added to the send stream after any file writes. Therefore, when we process a verity item, merely note that it happened, then actually create the command in the send stream during 'finish_inode_if_needed'. This also creates V3 of the send stream format, without any format changes besides adding the new commands and attributes. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d1f68ba0 |
|
23-Jul-2022 |
Omar Sandoval <osandov@osandov.com> |
btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5f4403e1 |
|
25-Jul-2022 |
Ioannis Angelakopoulos <iangelak@fb.com> |
btrfs: add lockdep annotations for the ordered extents wait event This wait event is very similar to the pending ordered wait event in the sense that it occurs in a different context than the condition signaling for the event. The signaling occurs in btrfs_remove_ordered_extent() while the wait event is implemented in btrfs_start_ordered_extent() in fs/btrfs/ordered-data.c However, in this case a thread must not acquire the lockdep map for the ordered extents wait event when the ordered extent is related to a free space inode. That is because lockdep creates dependencies between locks acquired both in execution paths related to normal inodes and paths related to free space inodes, thus leading to false positives. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b53779e |
|
25-Jul-2022 |
Ioannis Angelakopoulos <iangelak@fb.com> |
btrfs: add lockdep annotations for pending_ordered wait event In contrast to the num_writers and num_extwriters wait events, the condition for the pending ordered wait event is signaled in a different context from the wait event itself. The condition signaling occurs in btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait event is implemented in btrfs_commit_transaction() in fs/btrfs/transaction.c Thus the thread signaling the condition has to acquire the lockdep map as a reader at the start of btrfs_remove_ordered_extent() and release it after it has signaled the condition. In this case some dependencies might be left out due to the placement of the annotation, but it is better than no annotation at all. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3e738c53 |
|
25-Jul-2022 |
Ioannis Angelakopoulos <iangelak@fb.com> |
btrfs: add lockdep annotations for transaction states wait events Add lockdep annotations for the transaction states that have wait events; 1) TRANS_STATE_COMMIT_START 2) TRANS_STATE_UNBLOCKED 3) TRANS_STATE_SUPER_COMMITTED 4) TRANS_STATE_COMPLETED The new macros introduced here to annotate the transaction states wait events have the same effect as the generic lockdep annotation macros. With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START the transaction thread has to acquire the lockdep maps for the transaction states as reader after the lockdep map for num_writers is released so that lockdep does not complain. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a9ba670 |
|
25-Jul-2022 |
Ioannis Angelakopoulos <iangelak@fb.com> |
btrfs: add lockdep annotations for num_extwriters wait event Similarly to the num_writers wait event in fs/btrfs/transaction.c add a lockdep annotation for the num_extwriters wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e1489b4f |
|
25-Jul-2022 |
Ioannis Angelakopoulos <iangelak@fb.com> |
btrfs: add lockdep annotations for num_writers wait event Annotate the num_writers wait event in fs/btrfs/transaction.c with lockdep in order to catch deadlocks involving this wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab9a323f |
|
25-Jul-2022 |
Ioannis Angelakopoulos <iangelak@fb.com> |
btrfs: add macros for annotating wait events with lockdep Introduce four macros that are used to annotate wait events in btrfs code with lockdep; 1) the btrfs_lockdep_init_map 2) the btrfs_lockdep_acquire, 3) the btrfs_lockdep_release 4) the btrfs_might_wait_for_event macros. The btrfs_lockdep_init_map macro is used to initialize a lockdep map. The btrfs_lockdep_<acquire,release> macros are used by threads to take the lockdep map as readers (shared lock) and release it, respectively. The btrfs_might_wait_for_event macro is used by threads to take the lockdep map as writers (exclusive lock) and release it. In general, the lockdep annotation for wait events work as follows: The condition for a wait event can be modified and signaled at the same time by multiple threads. These threads hold the lockdep map as readers when they enter a context in which blocking would prevent signaling the condition. Frequently, this occurs when a thread violates a condition (lockdep map acquire), before restoring it and signaling it at a later point (lockdep map release). The threads that block on the wait event take the lockdep map as writers (exclusive lock). These threads have to block until all the threads that hold the lockdep map as readers signal the condition for the wait event and release the lockdep map. The lockdep annotation is used to warn about potential deadlock scenarios that involve the threads that modify and signal the wait event condition and threads that block on the wait event. A simple example is illustrated below: Without lockdep: TA TB cond = false lock(A) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) With lockdep: TA TB rwsem_acquire_read(lockdep_map) cond = false lock(A) rwsem_acquire(lockdep_map) rwsem_release(lockdep_map) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) rwsem_release(lockdep_map) In the second case, with the lockdep annotation, lockdep would warn about an ABBA deadlock, while the first case would just deadlock at some point. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d5b81ced |
|
30-Aug-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: fix API misuse of zone finish waiting The commit 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress") implemented a zone finish waiting mechanism to the write path of zoned mode. However, using wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and wait_var_event() just hangs because no one ever wakes it up once it goes into sleep. Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit() on fs_info->flags with a proper barrier installed. Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ced8ecf0 |
|
23-Aug-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: fix space cache corruption and potential double allocations When testing space_cache v2 on a large set of machines, we encountered a few symptoms: 1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree. All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time. Specifically, the race is as follows: 1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() -> add_to_free_space_tree() adds the deleted extent back to the free space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() -> btrfs_cache_block_group() queues up the block group to get cached. block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls switch_commit_roots(). It sets block_group->last_byte_to_unpin to block_group->progress, which is block_group->start because the block group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots were already switched, load_free_space_tree() sees the deleted extent as free and adds it to the space cache. It finishes caching and sets block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since transaction A is already in TRANS_STATE_SUPER_COMMITTED and the commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls switch_commit_roots(). This time, the block group has already been cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls btrfs_finish_extent_commit(), which calls unpin_extent_range() for the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by transaction B!), so it adds the deleted extent to the space cache again! This explains all of our symptoms above: * If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated *again*, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corruption, it will continue to get worse as extents are deleted and reallocated. The v1 space_cache is synchronously loaded when an extent is deleted (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() with load_cache_only=1), so it is not normally affected by this bug. However, as noted above, if we fail to load the space cache, we will fall back to caching from the extent tree and may hit this bug. The easiest fix for this race is to also make caching from the free space tree or extent tree synchronous. Josef tested this and found no performance regressions. A few extra changes fall out of this change. Namely, this fix does the following, with step 2 being the crucial fix: 1. Factor btrfs_caching_ctl_wait_done() out of btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl that we already hold a reference to. 2. Change the call in btrfs_cache_block_group() of btrfs_wait_space_cache_v1_finished() to btrfs_caching_ctl_wait_done(), which makes us wait regardless of the space_cache option. 3. Delete the now unused btrfs_wait_space_cache_v1_finished() and space_cache_v1_done(). 4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to `bool wait` to more accurately describe its new meaning. 5. Change a few callers which had a separate call to btrfs_wait_block_group_cache_done() to use wait = true instead. 6. Make btrfs_wait_block_group_cache_done() static now that it's not used outside of block-group.c anymore. Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b40130b2 |
|
26-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix lockdep splat with reloc root extent buffers We have been hitting the following lockdep splat with btrfs/187 recently WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30 -> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2); *** DEADLOCK *** 7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of reloc tree -> normal tree for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks. However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of normal tree -> reloc tree which is why lockdep complains. Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root. Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged. This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of normal tree -> reloc tree We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root. With this patch we no longer have the lockdep splat in btrfs/187. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
81bd9328 |
|
06-Jul-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: fix repair of compressed extents Currently the checksum of compressed extents is verified based on the compressed data and the lower btrfs_bio, but the actual repair process is driven by end_bio_extent_readpage on the upper btrfs_bio for the decompressed data. This has a bunch of issues, including not being able to properly communicate the failed mirror up in case that the I/O submission got preempted, a general loss of if an error was an I/O error or a checksum verification failure, but most importantly that this design causes btrfs_clean_io_failure to eventually write back the uncompressed good data onto the disk sectors that are supposed to contain compressed data. Fix this by moving the repair to the lower btrfs_bio. To do so, a fair amount of code has to be reshuffled: a) the lower btrfs_bio now needs a valid csum pointer. The easiest way to achieve that is to pass NULL btrfs_lookup_bio_sums and just use the btrfs_bio management of csums. For a compressed_bio that is split into multiple btrfs_bios this means additional memory allocations, but the code becomes a lot more regular. b) checksum verification now runs directly on the lower btrfs_bio instead of the compressed_bio. This actually nicely simplifies the end I/O processing. c) btrfs_repair_one_sector can't just look up the logical address for the file offset any more, as there is no corresponding relative offsets that apply to the file offset and the logic address for compressed extents. Instead require that the saved bvec_iter in the btrfs_bio is filled out for all read bios and use that, which again removes a fair amount of code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7959bd44 |
|
06-Jul-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove the start argument to check_data_csum and export Derive the value of start from the btrfs_bio now that ->file_offset is always valid. Also export and rename the function so it's available outside of inode.c as we'll need that soon. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2ce543f4 |
|
08-Jul-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: wait until zone is finished when allocation didn't progress When the allocated position doesn't progress, we cannot submit IOs to finish a block group, but there should be ongoing IOs that will finish a block group. So, in that case, we wait for a zone to be finished and retry the allocation after that. Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to indicate we need a zone finish to have proceeded. The flag is set when the allocator detected it cannot activate a new block group. And, it is cleared once a zone is finished. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7d7672bc |
|
08-Jul-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: convert count_max_extents() to use fs_info->max_extent_size If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number of extents needed, btrfs release the metadata reservation too much on its way to write out the data. Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size, convert count_max_extents() to use it instead, and fix the calculation of the metadata reservation. CC: stable@vger.kernel.org # 5.12+ Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7b12a62 |
|
08-Jul-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size On zoned filesystem, data write out is limited by max_zone_append_size, and a large ordered extent is split according the size of a bio. OTOH, the number of extents to be written is calculated using BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the metadata bytes to update and/or create the metadata items. The metadata reservation is done at e.g, btrfs_buffered_write() and then released according to the estimation changes. Thus, if the number of extent increases massively, the reserved metadata can run out. The increase of the number of extents easily occurs on zoned filesystem if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the following warning on a small RAM environment with disabling metadata over-commit (in the following patch). [75721.498492] ------------[ cut here ]------------ [75721.505624] BTRFS: block rsv 1 returned -28 [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs] [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109 [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021 [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs] [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286 [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000 [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7 [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28 [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a [75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000 [75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0 [75721.730499] Call Trace: [75721.735166] <TASK> [75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs] [75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs] [75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs] [75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs] [75721.769520] ? push_leaf_left+0x420/0x620 [btrfs] [75721.776431] ? memcpy+0x4e/0x60 [75721.781931] split_leaf+0x433/0x12d0 [btrfs] [75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs] [75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs] [75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs] [75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs] [75721.818300] ? lock_downgrade+0x7c0/0x7c0 [75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs] [75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs] [75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs] [75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs] [75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs] [75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs] [75721.869313] ? rcu_read_lock_sched_held+0x16/0x80 [75721.876085] ? lock_release+0x552/0xf80 [75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs] [75721.888886] ? __kasan_check_write+0x14/0x20 [75721.895152] ? do_raw_read_unlock+0x44/0x80 [75721.901323] ? _raw_write_lock_irq+0x60/0x80 [75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs] [75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs] [75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs] [75721.929166] ? _raw_write_unlock+0x23/0x40 [75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs] [75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs] [75721.949906] ? try_to_wake_up+0x30/0x14a0 [75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs] [75721.962661] ? rcu_read_lock_sched_held+0x16/0x80 [75721.969111] ? lock_acquire+0x41b/0x4c0 [75721.974982] finish_ordered_fn+0x15/0x20 [btrfs] [75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs] [75721.988184] ? _raw_spin_unlock_irq+0x28/0x50 [75721.994643] process_one_work+0x815/0x1460 [75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250 [75722.006643] ? do_raw_spin_trylock+0xbb/0x190 [75722.013086] worker_thread+0x59a/0xeb0 [75722.018511] kthread+0x2ac/0x360 [75722.023428] ? process_one_work+0x1460/0x1460 [75722.029431] ? kthread_complete_and_exit+0x30/0x30 [75722.036044] ret_from_fork+0x22/0x30 [75722.041255] </TASK> [75722.045047] irq event stamp: 0 [75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0 [75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0 [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0 [75722.085335] ---[ end trace 0000000000000000 ]--- To fix the estimation, we need to introduce fs_info->max_extent_size to replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for regular vs zoned filesystem. Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned filesystem, it is set to fs_info->max_zone_append_size. CC: stable@vger.kernel.org # 5.12+ Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2ae7b77 |
|
08-Jul-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: revive max_zone_append_bytes This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned: remove max_zone_append_size logic"), but without unnecessary ASSERT and check. The max_zone_append_size will be used as a hint to estimate the number of extents to cover delalloc/writeback region in the later commits. The size of a ZONE APPEND bio is also limited by queue_max_segments(), so this commit considers it to calculate max_zone_append_size. Technically, a bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE" as an upper limit of an extent size to calculate the number of extents needed to write data. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e55958c8 |
|
14-Jun-2022 |
Ioannis Angelakopoulos <iangelak@fb.com> |
btrfs: collect commit stats, count, duration Track several stats about transaction commit, to be later exported via sysfs: - number of commits so far - duration of the last commit in ns - maximum commit duration seen so far in ns - total duration for all commits so far in ns The update of the commit stats occurs after the commit thread has gone through all the logic that checks if there is another thread committing at the same time. This means that we only account for actual commit work in the commit stats we report and not the time the thread spends waiting until it is ready to do the commit work. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
37f85ec3 |
|
13-Jun-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use named constant for reserved device space There's a reserved space on each device of size 1MiB that can be used by bootloaders or to avoid accidental overwrite. Use a symbolic constant with the explaining comment instead of hard coding the value and multiple comments. Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for the primary super block (at offset 64KiB), until then the range could have been used by mistake. Kernel has been always respecting the 1MiB range for writes. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6d92b304 |
|
25-Jun-2020 |
David Sterba <dsterba@suse.com> |
btrfs: pass bits by value not by pointer for extent_state helpers The bits are passed to all extent state helpers for no apparent reason, the value only read and never updated so remove the indirection and pass it directly. Also unify the type to u32 where needed. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97f09d55 |
|
07-Jun-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: make btrfs_super_block::log_root_transid deprecated When using "btrfs inspect-internal dump-super" to inspect an fs with dirty log, it always shows the log_root_transid as 0: log_root 30474240 log_root_transid 0 <<< log_root_level 0 It turns out that, btrfs_super_block::log_root_transid is never really utilized (even no read for it). This can date back to the introduction of btrfs into upstream kernel. In fact, when reading log tree root, we always use btrfs_super_block::generation + 1 as the expected generation. So here we're completely safe to mark this member deprecated. In theory we can easily reuse this member for other purposes, but to be extra safe, here we follow the leafsize way, by adding "__unused_" for log_root_transid. And we can safely remove the accessors, since there is no such callers from the very beginning. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d7b9416f |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove btrfs_end_io_wq All reads bio that go through btrfs_map_bio need to be completed in user context. And read I/Os are the most common and timing critical in almost any file system workloads. Embed a work_struct into struct btrfs_bio and use it to complete all read bios submitted through btrfs_map, using the REQ_META flag to decide which workqueue they are placed on. This removes the need for a separate 128 byte allocation (typically rounded up to 192 bytes by slab) for all reads with a size increase of 24 bytes for struct btrfs_bio. Future patches will reorganize struct btrfs_bio to make use of this extra space for writes as well. (All sizes are based a on typical 64-bit non-debug build) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fed8a72d |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: don't use btrfs_bio_wq_end_io for compressed writes Compressed write bio completion is the only user of btrfs_bio_wq_end_io for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal here as we only real need user context for the final completion of a compressed_bio structure, and not every single bio completion. Add a work_struct to struct compressed_bio instead and use that to call finish_compressed_bio_write. This allows to remove all handling of write bios in the btrfs_bio_wq_end_io infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d34e123d |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: defer I/O completion based on the btrfs_raid_bio Instead of attaching an extra allocation an indirect call to each low-level bio issued by the RAID code, add a work_struct to struct btrfs_raid_bio and only defer the per-rbio completion action. The per-bio action for all the I/Os are trivial and can be safely done from interrupt context. As a nice side effect this also allows sharing the boilerplate code for the per-bio completions Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c93104e7 |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: split btrfs_submit_data_bio to read and write parts Split btrfs_submit_data_bio into one helper for reads and one for writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3ea4dc5b |
|
17-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: send: send compressed extents with encoded writes Now that all of the pieces are in place, we can use the ENCODED_WRITE command to send compressed extents when appropriate. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a89ce08c |
|
22-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: factor out a btrfs_csum_ptr helper Add a helper to find the csum for a byte offset into the csum buffer. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ae643a74 |
|
22-May-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: introduce a data checksum checking helper Although we have several data csum verification code, we never have a function really just to verify checksum for one sector. Function check_data_csum() do extra work for error reporting, thus it requires a lot of extra things like file offset, bio_offset etc. Function btrfs_verify_data_csum() is even worse, it will utilize page checked flag, which means it can not be utilized for direct IO pages. Here we introduce a new helper, btrfs_check_sector_csum(), which really only accept a sector in page, and expected checksum pointer. We use this function to implement check_data_csum(), and export it for incoming patch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: keep passing the csum array as an arguments, as the callers want to print it, rename per request] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
143823cf |
|
25-May-2022 |
David Sterba <dsterba@suse.com> |
btrfs: fix typos in comments Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
088aea3b |
|
15-Jul-2022 |
David Sterba <dsterba@suse.com> |
Revert "btrfs: turn delayed_nodes_tree into an XArray" This reverts commit 253bf57555e451dec5a7f09dc95d380ce8b10e5b. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
01cd3909 |
|
15-Jul-2022 |
David Sterba <dsterba@suse.com> |
Revert "btrfs: turn fs_info member buffer_radix into XArray" This reverts commit 8ee922689d67b7cfa6acbe2aa1ee76ac72e6fc8a. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc7cbcd4 |
|
15-Jul-2022 |
David Sterba <dsterba@suse.com> |
Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray" This reverts commit 48b36a602a335c184505346b5b37077840660634. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
983d8209 |
|
06-Jun-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add missing inode updates on each iteration when replacing extents When replacing file extents, called during fallocate, hole punching, clone and deduplication, we may not be able to replace/drop all the target file extent items with a single transaction handle. We may get -ENOSPC while doing it, in which case we release the transaction handle, balance the dirty pages of the btree inode, flush delayed items and get a new transaction handle to operate on what's left of the target range. By dropping and replacing file extent items we have effectively modified the inode, so we should bump its iversion and update its mtime/ctime before we update the inode item. This is because if the transaction we used for partially modifying the inode gets committed by someone after we release it and before we finish the rest of the range, a power failure happens, then after mounting the filesystem our inode has an outdated iversion and mtime/ctime, corresponding to the values it had before we changed it. So add the missing iversion and mtime/ctime updates. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a3e171a0 |
|
05-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: move struct btrfs_dio_private to inode.c The btrfs_dio_private structure is only used in inode.c, so move the definition there. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
36e8c622 |
|
05-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: add a btrfs_dio_rw wrapper Add a wrapper around iomap_dio_rw that keeps the direct I/O internals isolated in inode.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb3a12d9 |
|
27-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: rename bio_flags in parameters and switch type Several functions take parameter bio_flags that was simplified to just compress type, unify it and change the type accordingly. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2fe6a5a1 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: sink parameter is_data to btrfs_set_disk_extent_flags The parameter has been added in 2009 in the infamous monster commit 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") but not used ever since. We can sink it and allow further simplifications. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
48b36a60 |
|
02-May-2022 |
Gabriel Niebler <gniebler@suse.com> |
btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray … rename it to simply fs_roots and adjust all usages of this object to use the XArray API, because it is notionally easier to use and understand, as it provides array semantics, and also takes care of locking for us, further simplifying the code. Also do some refactoring, esp. where the API change requires largely rewriting some functions, anyway. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ee92268 |
|
21-Apr-2022 |
Gabriel Niebler <gniebler@suse.com> |
btrfs: turn fs_info member buffer_radix into XArray … named 'extent_buffers'. Also adjust all usages of this object to use the XArray API, which greatly simplifies the code as it takes care of locking and is generally easier to use and understand, providing notionally simpler array semantics. Also perform some light refactoring. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
253bf575 |
|
26-Apr-2022 |
Gabriel Niebler <gniebler@suse.com> |
btrfs: turn delayed_nodes_tree into an XArray … in the btrfs_root struct and adjust all usages of this object to use the XArray API, because it is notionally easier to use and understand, as it provides array semantics, and also takes care of locking for us, further simplifying the code. Also use the opportunity to do some light refactoring. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
385de0ef |
|
17-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: use a normal workqueue for rmw_workers rmw_workers doesn't need ordered execution or thread disabling threshold (as the thresh parameter is less than DFT_THRESHOLD). Just switch to the normal workqueues that use a lot less resources, especially in the work_struct vs btrfs_work structures. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
be539518 |
|
17-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: use normal workqueues for scrub All three scrub workqueues don't need ordered execution or thread disabling threshold (as the thresh parameter is less than DFT_THRESHOLD). Just switch to the normal workqueues that use a lot less resources, especially in the work_struct vs btrfs_work structures. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a31b4a43 |
|
17-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue Just let the one caller that wants optional WQ_HIGHPRI handling allocate a separate btrfs_workqueue for that. This allows to rename struct __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and separate allocation for all btrfs_workqueue users and generally simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad357938 |
|
15-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: do not return errors from submit_bio_hook_t instances Both btrfs_repair_one_sector and submit_bio_one as the direct caller of one of the instances ignore errors as they expect the methods themselves to call ->bi_end_io on error. Remove the unused and dangerous return value. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7aab8b32 |
|
15-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: move btrfs_readpage to extent_io.c Keep btrfs_readpage next to btrfs_do_readpage and the other address space operations. This allows to keep submit_one_bio and struct btrfs_bio_ctrl file local in extent_io.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16b0c258 |
|
13-Apr-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use a read/write lock for protecting the block groups tree Currently we use a spin lock to protect the red black tree that we use to track block groups. Most accesses to that tree are actually read only and for large filesystems, with thousands of block groups, it actually has a bad impact on performance, as concurrent read only searches on the tree are serialized. Read only searches on the tree are very frequent and done when: 1) Pinning and unpinning extents, as we need to lookup the respective block group from the tree; 2) Freeing the last reference of a tree block, regardless if we pin the underlying extent or add it back to free space cache/tree; 3) During NOCOW writes, both buffered IO and direct IO, we need to check if the block group that contains an extent is read only or not and to increment the number of NOCOW writers in the block group. For those operations we need to search for the block group in the tree. Similarly, after creating the ordered extent for the NOCOW write, we need to decrement the number of NOCOW writers from the same block group, which requires searching for it in the tree; 4) Decreasing the number of extent reservations in a block group; 5) When allocating extents and freeing reserved extents; 6) Adding and removing free space to the free space tree; 7) When releasing delalloc bytes during ordered extent completion; 8) When relocating a block group; 9) During fitrim, to iterate over the block groups; 10) etc; Write accesses to the tree, to add or remove block groups, are much less frequent as they happen only when allocating a new block group or when deleting a block group. We also use the same spin lock to protect the list of currently caching block groups. Additions to this list are made when we need to cache a block group, because we don't have a free space cache for it (or we have but it's invalid), and removals from this list are done when caching of the block group's free space finishes. These cases are also not very common, but when they happen, they happen only once when the filesystem is mounted. So switch the lock that protects the tree of block groups from a spinning lock to a read/write lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
08dddb29 |
|
13-Apr-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use rbtree with leftmost node cached for tracking lowest block group We keep track of the start offset of the block group with the lowest start offset at fs_info->first_logical_byte. This requires explicitly updating that field every time we add, delete or lookup a block group to/from the red black tree at fs_info->block_group_cache_tree. Since the block group with the lowest start address happens to always be the one that is the leftmost node of the tree, we can use a red black tree that caches the left most node. Then when we need the start address of that block group, we can just quickly get the leftmost node in the tree and extract the start offset of that node's block group. This avoids the need to explicitly keep track of that address in the dedicated member fs_info->first_logical_byte, and it also allows the next patch in the series to switch the lock that protects the red black tree from a spin lock to a read/write lock - without this change it would be tricky because block group searches also update fs_info->first_logical_byte. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e010b3d |
|
24-Mar-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove the zoned/zone_size union in struct btrfs_fs_info Reading a value from a different member of a union is not just a great way to obfuscate code, but also creates an aliasing violation. Switch btrfs_is_zoned to look at ->zone_size and remove the union. Note: union was to simplify the detection of zoned filesystem but now this is wrapped behind btrfs_is_zoned so we can drop the union. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d4135134 |
|
23-Mar-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid blocking on space revervation when doing nowait dio writes When doing a NOWAIT direct IO write, if we can NOCOW then it means we can proceed with the non-blocking, NOWAIT path. However reserving the metadata space and qgroup meta space can often result in blocking - flushing delalloc, wait for ordered extents to complete, trigger transaction commits, etc, going against the semantics of a NOWAIT write. So make the NOWAIT write path to try to reserve all the metadata it needs without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT then return -EAGAIN to make the caller fallback to a blocking direct IO write. This is part of a patchset comprised of the following patches: btrfs: avoid blocking on page locks with nowait dio on compressed range btrfs: avoid blocking nowait dio when locking file range btrfs: avoid double nocow check when doing nowait dio writes btrfs: stop allocating a path when checking if cross reference exists btrfs: free path at can_nocow_extent() before checking for checksum items btrfs: release path earlier at can_nocow_extent() btrfs: avoid blocking when allocating context for nowait dio read/write btrfs: avoid blocking on space revervation when doing nowait dio writes The following test was run before and after applying this patchset: $ cat io-uring-nodatacow-test.sh #!/bin/bash DEV=/dev/sdc MNT=/mnt/sdc MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=4 FILE_SIZE=8G RUN_TIME=300 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randrw fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run a 12 cores box with 64G of ram, using a non-debug kernel config (Debian's default config) and a spinning disk. Result before the patchset: READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec Result after the patchset: READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec That's about +7.2% throughput for reads and +6.9% for writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a89f173 |
|
23-Mar-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: stop allocating a path when checking if cross reference exists At btrfs_cross_ref_exist() we always allocate a path, but we really don't need to because all its callers (only 2) already have an allocated path that is not being used when they call btrfs_cross_ref_exist(). So change btrfs_cross_ref_exist() to take a path as an argument and update both its callers to pass in the unused path they have when they call btrfs_cross_ref_exist(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b0a66a31 |
|
17-Mar-2022 |
Jonathan Lassoff <jof@thejof.com> |
btrfs: add messages to printk index In order for end users to quickly react to new issues that come up in production, it is proving useful to leverage this printk indexing system. This printk index enables kernel developers to use calls to printk() with changeable ad-hoc format strings, while still enabling end users to detect changes and develop a semi-stable interface for detecting and parsing these messages. So that detailed Btrfs messages are captured by this printk index, this patch wraps btrfs_printk and btrfs_handle_fs_error with macros. Example of the generated list: https://lore.kernel.org/lkml/12588e13d51a9c3bf59467d3fc1ac2162f1275c1.1647539056.git.jof@thejof.com Signed-off-by: Jonathan Lassoff <jof@thejof.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
63c34cb4 |
|
15-Mar-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helper to assert an inode range is clean We have four different scenarios where we don't expect to find ordered extents after locking a file range: 1) During plain fallocate; 2) During hole punching; 3) During zero range; 4) During reflinks (both cloning and deduplication). This is because in all these cases we follow the pattern: 1) Lock the inode's VFS lock in exclusive mode; 2) Lock the inode's i_mmap_lock in exclusive node, to serialize with mmap writes; 3) Flush delalloc in a file range and wait for all ordered extents to complete - both done through btrfs_wait_ordered_range(); 4) Lock the file range in the inode's io_tree. So add a helper that asserts that we don't have ordered extents for a given range. Make the four scenarios listed above use this helper after locking the respective file range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
caae78e0 |
|
14-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: move common inode creation code into btrfs_create_new_inode() All of our inode creation code paths duplicate the calls to btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation additionally duplicates property inheritance and the call to btrfs_set_inode_index(). Fix this by moving the common code into btrfs_create_new_inode(). This accomplishes a few things at once: 1. It reduces code duplication. 2. It allows us to set up the inode completely before inserting the inode item, removing calls to btrfs_update_inode(). 3. It fixes a leak of an inode on disk in some error cases. For example, in btrfs_create(), if btrfs_new_inode() succeeds, then we have inserted an inode item and its inode ref. However, if something after that fails (e.g., btrfs_init_inode_security()), then we end the transaction and then decrement the link count on the inode. If the transaction is committed and the system crashes before the failed inode is deleted, then we leak that inode on disk. Instead, this refactoring aborts the transaction when we can't recover more gracefully. 4. It exposes various ways that subvolume creation diverges from mkdir in terms of inheriting flags, properties, permissions, and POSIX ACLs, a lot of which appears to be accidental. This patch explicitly does _not_ change the existing non-standard behavior, but it makes those differences more clear in the code and documents them so that we can discuss whether they should be changed. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3538d68dbd |
|
14-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: reserve correct number of items for inode creation The various inode creation code paths do not account for the compression property, POSIX ACLs, or the parent inode item when starting a transaction. Fix it by refactoring all of these code paths to use a new function, btrfs_new_inode_prepare(), which computes the correct number of items. To do so, it needs to know whether POSIX ACLs will be created, so move the ACL creation into that function. To reduce the number of arguments that need to be passed around for inode creation, define struct btrfs_new_inode_args containing all of the relevant information. btrfs_new_inode_prepare() will also be a good place to set up the fscrypt context and encrypted filename in the future. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a1fd0c35 |
|
14-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: allocate inode outside of btrfs_new_inode() Instead of calling new_inode() and inode_init_owner() inside of btrfs_new_inode(), do it in the callers. This allows us to pass in just the inode instead of the mnt_userns and mode and removes the need for memalloc_nofs_{save,restores}() since we do it before starting a transaction. In create_subvol(), it also means we no longer have to look up the inode again to instantiate it. This also paves the way for some more cleanups in later patches. This also removes the comments about Smack checking i_op, which are no longer true since commit 5d6c31910bc0 ("xattr: Add __vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags & IOP_XATTR, which is set based on sb->s_xattr. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62142be3 |
|
09-Mar-2022 |
Gabriel Niebler <gniebler@suse.com> |
btrfs: introduce btrfs_for_each_slot iterator macro There is a common pattern when searching for a key in btrfs: * Call btrfs_search_slot to find the slot for the key * Enter an endless loop: * If the found slot is larger than the no. of items in the current leaf, check the next leaf * If it's still not found in the next leaf, terminate the loop * Otherwise do something with the found key * Increment the current slot and continue To reduce code duplication, we can replace this code pattern with an iterator macro, similar to the existing for_each_X macros found elsewhere in the kernel. This also makes the code easier to understand for newcomers by putting a name to the encapsulated functionality. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb12489b |
|
29-Apr-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
btrfs: Convert btrfs to read_folio This is a "weak" conversion which converts straight back to using pages. A full conversion should be performed at some point, hopefully by someone familiar with the filesystem. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
#
5f0addf7 |
|
18-Apr-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: use dedicated lock for data relocation Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive writeback of the relocation data inode in btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock in the following path. Thread A takes btrfs_inode_lock() and waits for metadata reservation by e.g, waiting for writeback: prealloc_file_extent_cluster() - btrfs_inode_lock(&inode->vfs_inode, 0); - btrfs_prealloc_file_range() ... - btrfs_replace_file_extents() - btrfs_start_transaction ... - btrfs_reserve_metadata_bytes() Thread B (e.g, doing a writeback work) needs to wait for the inode lock to continue writeback process: do_writepages - btrfs_writepages - extent_writpages - btrfs_zoned_data_reloc_lock(BTRFS_I(inode)); - btrfs_inode_lock() The deadlock is caused by relying on the vfs_inode's lock. By using it, we introduced unnecessary exclusion of writeback and btrfs_prealloc_file_range(). Also, the lock at this point is useless as we don't have any dirty pages in the inode yet. Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive writeback. Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16.x: 869f4cdc73f9: btrfs: zoned: encapsulate inode locking for zoned relocation CC: stable@vger.kernel.org # 5.16.x CC: stable@vger.kernel.org # 5.17 Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
895586eb |
|
09-Feb-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
btrfs: Convert from invalidatepage to invalidate_folio A lot of the underlying infrastructure in btrfs needs to be switched over to folios, but this at least documents that invalidatepage can't be passed a tail page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
|
#
7eefae6b |
|
18-Feb-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: pass btrfs_fs_info to btrfs_recover_relocation We don't need a root here, we just need the btrfs_fs_info, we can just get the specific roots we need from fs_info. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c067da87 |
|
23-Feb-2022 |
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> |
btrfs: add filesystems state details to error messages When a filesystem goes read-only due to an error, multiple errors tend to be reported, some of which are knock-on failures. Logging fs_states, in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the first error from subsequent messages which may only exist due to an error state. Under the new format, most initial errors will look like: `BTRFS: error (device loop0) in ...` while subsequent errors will begin with: `error (device loop0: state E) in ...` An initial transaction abort error will look like `error (device loop0: state A) in ...` and subsequent messages will contain `(device loop0: state EA) in ...` In addition to the error states we can also print other states that are temporary, like remounting, device replace, or indicate a global state that may affect functionality. Now implemented: E - filesystem error detected A - transaction aborted L - log tree errors M - remounting in progress R - device replace in progress C - data checksums not verified (mounted with ignoredatacsums) Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c0c7269 |
|
13-Aug-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: add BTRFS_IOC_ENCODED_WRITE The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1881fba8 |
|
09-Oct-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: add BTRFS_IOC_ENCODED_READ ioctl There are 4 main cases: 1. Inline extents: we copy the data straight out of the extent buffer. 2. Hole/preallocated extents: we fill in zeroes. 3. Regular, uncompressed extents: we read the sectors we need directly from disk. 4. Regular, compressed extents: we read the entire compressed extent from disk and indicate what subset of the decompressed extent is in the file. This initial implementation simplifies a few things that can be improved in the future: - Cases 1, 3, and 4 allocate temporary memory to read into before copying out to userspace. - We don't do read repair, because it turns out that read repair is currently broken for compressed data. - We hold the inode lock during the operation. Note that we don't need to hold the mmap lock. We may race with btrfs_page_mkwrite() and read the old data from before the page was dirtied: btrfs_page_mkwrite btrfs_encoded_read --------------------------------------------------- (enter) (enter) btrfs_wait_ordered_range lock_extent_bits btrfs_page_set_dirty unlock_extent_cached (exit) lock_extent_bits read extent (dirty page hasn't been flushed, so this is the old data) unlock_extent_cached (exit) we read the old data from before the page was dirtied. But, that's true even if we were to hold the mmap lock: btrfs_page_mkwrite btrfs_encoded_read ------------------------------------------------------------------- (enter) (enter) btrfs_inode_lock(BTRFS_ILOCK_MMAP) down_read(i_mmap_lock) (blocked) btrfs_wait_ordered_range lock_extent_bits read extent (page hasn't been dirtied, so this is the old data) unlock_extent_cached btrfs_inode_unlock(BTRFS_ILOCK_MMAP) down_read(i_mmap_lock) returns lock_extent_bits btrfs_page_set_dirty unlock_extent_cached In other words, this is inherently racy, so it's fine that we return the old data in this tiny window. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
28c9b1e7 |
|
18-Nov-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: support different disk extent size for delalloc Currently, we always reserve the same extent size in the file and extent size on disk for delalloc because the former is the worst case for the latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of the extent on disk, which may be less than or greater than (for bookends) the size in the file. Add a disk_num_bytes parameter to btrfs_delalloc_reserve_metadata() so that we can reserve the correct amount of csum bytes. No functional change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e331f6b1 |
|
06-Nov-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() btrfs_csum_one_bio() loops over each filesystem block in the bio while keeping a cursor of its current logical position in the file in order to look up the ordered extent to add the checksums to. However, this doesn't make much sense for compressed extents, as a sector on disk does not correspond to a sector of decompressed file data. It happens to work because: 1) the compressed bio always covers one ordered extent 2) the size of the bio is always less than the size of the ordered extent However, the second point will not always be true for encoded writes. Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that it can assume that the bio only covers one ordered extent. Since we're already changing the signature, let's get rid of the contig parameter and make it implied by the offset parameter, similar to the change we recently made to btrfs_lookup_bio_sums(). Additionally, let's rename nr_sectors to blockcount to make it clear that it's the number of filesystem blocks, not the number of 512-byte sectors. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a55e65b8 |
|
01-Feb-2022 |
David Sterba <dsterba@suse.com> |
btrfs: replace BUILD_BUG_ON by static_assert The static_assert introduced in 6bab69c65013 ("build_bug.h: add wrapper for _Static_assert") has been supported by compilers for a long time (gcc 4.6, clang 3.0) and can be used in header files. We don't need to put BUILD_BUG_ON to random functions but rather keep it next to the definition. The exception here is the UAPI header btrfs_tree.h that could be potentially included by userspace code and the static assert is not defined (nor used in any other header). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7238e50 |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add support for multiple global roots With extent tree v2 you will be able to create multiple csum, extent, and free space trees. They will be used based on the block group, which will now use the block_group_item->chunk_objectid to point to the set of global roots that it will use. When allocating new block groups we'll simply mod the gigabyte offset of the block group against the number of global roots we have and that will be the block groups global id. >From there we can take the bytenr that we're modifying in the respective tree, look up the block group and get that block groups corresponding global root id. From there we can get to the appropriate global root for that bytenr. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9c54e80d |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add code to support the block group root This code adds the on disk structures for the block group root, which will hold the block group items for extent tree v2. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2c7d2a23 |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add definition for EXTENT_TREE_V2 This adds the initial definition of the EXTENT_TREE_V2 incompat feature flag. This also hides the support behind CONFIG_BTRFS_DEBUG. THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A DEVELOPER OR A TESTER. The format is in flux and will be added in stages, any fs will need to be re-made between updates to the format. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4be6aef |
|
18-Feb-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not start relocation until in progress drops are done We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
558732df |
|
13-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: reduce extent threshold for autodefrag There is a big gap between inode_should_defrag() and autodefrag extent size threshold. For inode_should_defrag() it has a flexible @small_write value. For compressed extent is 16K, and for non-compressed extent it's 64K. However for autodefrag extent size threshold, it's always fixed to the default value (256K). This means, the following write sequence will trigger autodefrag to defrag ranges which didn't trigger autodefrag: pwrite 0 8k sync pwrite 8k 128K sync The latter 128K write will also be considered as a defrag target (if other conditions are met). While only that 8K write is really triggering autodefrag. Such behavior can cause extra IO for autodefrag. Close the gap, by copying the @small_write value into inode_defrag, so that later autodefrag can use the same @small_write value which triggered autodefrag. With the existing transid value, this allows autodefrag really to scan the ranges which triggered autodefrag. Although this behavior change is mostly reducing the extent_thresh value for autodefrag, I believe in the future we should allow users to specify the autodefrag extent threshold through mount options, but that's an other problem to consider in the future. CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40cdc509 |
|
18-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: skip reserved bytes warning on unmount after log cleanup failure After the recent changes made by commit c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it") and its followup fix, commit 651740a5024117 ("btrfs: check WRITE_ERR when trying to read an extent buffer"), we can now end up not cleaning up space reservations of log tree extent buffers after a transaction abort happens, as well as not cleaning up still dirty extent buffers. This happens because if writeback for a log tree extent buffer failed, then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on, when trying to free the log tree with free_log_tree(), which iterates over the tree, we can end up getting an -EIO error when trying to read a node or a leaf, since read_extent_buffer_pages() returns -EIO if an extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return immediately as we can not iterate over the entire tree. In that case we never update the reserved space for an extent buffer in the respective block group and space_info object. When this happens we get the following traces when unmounting the fs: [174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure [174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure [174957.399379] ------------[ cut here ]------------ [174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs] [174957.407523] Modules linked in: btrfs overlay dm_zero (...) [174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1 [174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs] [174957.429717] Code: 21 48 8b bd (...) [174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206 [174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8 [174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000 [174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000 [174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148 [174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100 [174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000 [174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0 [174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [174957.443948] Call Trace: [174957.444264] <TASK> [174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs] [174957.445238] close_ctree+0x301/0x357 [btrfs] [174957.445803] ? call_rcu+0x16c/0x290 [174957.446250] generic_shutdown_super+0x74/0x120 [174957.446832] kill_anon_super+0x14/0x30 [174957.447305] btrfs_kill_super+0x12/0x20 [btrfs] [174957.447890] deactivate_locked_super+0x31/0xa0 [174957.448440] cleanup_mnt+0x147/0x1c0 [174957.448888] task_work_run+0x5c/0xa0 [174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0 [174957.449934] syscall_exit_to_user_mode+0x16/0x40 [174957.450512] do_syscall_64+0x48/0xc0 [174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae [174957.451605] RIP: 0033:0x7f328fdc4a97 [174957.452059] Code: 03 0c 00 f7 (...) [174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97 [174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0 [174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40 [174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000 [174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000 [174957.460193] </TASK> [174957.460534] irq event stamp: 0 [174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0 [174957.466323] ---[ end trace bc7ee0c490bce3af ]--- [174957.467282] ------------[ cut here ]------------ [174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs] [174957.470066] Modules linked in: btrfs overlay dm_zero (...) [174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1 [174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs] [174957.488050] Code: 00 00 00 ad de (...) [174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206 [174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000 [174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846 [174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000 [174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000 [174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100 [174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000 [174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0 [174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [174957.506167] Call Trace: [174957.506654] <TASK> [174957.507047] close_ctree+0x301/0x357 [btrfs] [174957.507867] ? call_rcu+0x16c/0x290 [174957.508567] generic_shutdown_super+0x74/0x120 [174957.509447] kill_anon_super+0x14/0x30 [174957.510194] btrfs_kill_super+0x12/0x20 [btrfs] [174957.511123] deactivate_locked_super+0x31/0xa0 [174957.511976] cleanup_mnt+0x147/0x1c0 [174957.512610] task_work_run+0x5c/0xa0 [174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0 [174957.514231] syscall_exit_to_user_mode+0x16/0x40 [174957.515069] do_syscall_64+0x48/0xc0 [174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae [174957.516688] RIP: 0033:0x7f328fdc4a97 [174957.517413] Code: 03 0c 00 f7 d8 (...) [174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97 [174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0 [174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40 [174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000 [174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000 [174957.529404] </TASK> [174957.529843] irq event stamp: 0 [174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040 [174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0 [174957.533865] ---[ end trace bc7ee0c490bce3b0 ]--- [174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full [174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0 [174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0 [174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0 [174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0 [174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0 [174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0 This also means that in case we have log tree extent buffers that are still dirty, we can end up not cleaning them up in case we find an extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case we have no way for iterating over the rest of the tree. This issue is very often triggered with test cases generic/475 and generic/648 from fstests. The issue could almost be fixed by iterating over the io tree attached to each log root which keeps tracks of the range of allocated extent buffers, log_root->dirty_log_pages, however that does not work and has some inconveniences: 1) After we sync the log, we clear the range of the extent buffers from the io tree, so we can't find them after writeback. We could keep the ranges in the io tree, with a separate bit to signal they represent extent buffers already written, but that means we need to hold into more memory until the transaction commits. How much more memory is used depends a lot on whether we are able to allocate contiguous extent buffers on disk (and how often) for a log tree - if we are able to, then a single extent state record can represent multiple extent buffers, otherwise we need multiple extent state record structures to track each extent buffer. In fact, my earlier approach did that: https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/ However that can cause a very significant negative impact on performance, not only due to the extra memory usage but also because we get a larger and deeper dirty_log_pages io tree. We got a report that, on beefy machines at least, we can get such performance drop with fsmark for example: https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/ 2) We would be doing it only to deal with an unexpected and exceptional case, which is basically failure to read an extent buffer from disk due to IO failures. On a healthy system we don't expect transaction aborts to happen after all; 3) Instead of relying on iterating the log tree or tracking the ranges of extent buffers in the dirty_log_pages io tree, using the radix tree that tracks extent buffers (fs_info->buffer_radix) to find all log tree extent buffers is not reliable either, because after writeback of an extent buffer it can be evicted from memory by the release page callback of the btree inode (btree_releasepage()). Since there's no way to be able to properly cleanup a log tree without being able to read its extent buffers from disk and without using more memory to track the logical ranges of the allocated extent buffers do the following: 1) When we fail to cleanup a log tree, setup a flag that indicates that failure; 2) Trigger writeback of all log tree extent buffers that are still dirty, and wait for the writeback to complete. This is just to cleanup their state, page states, page leaks, etc; 3) When unmounting the fs, ignore if the number of bytes reserved in a block group and in a space_info is not 0 if, and only if, we failed to cleanup a log tree. Also ignore only for metadata block groups and the metadata space_info object. This is far from a perfect solution, but it serves to silence test failures such as those from generic/475 and generic/648. However having a non-zero value for the reserved bytes counters on unmount after a transaction abort, is not such a terrible thing and it's completely harmless, it does not affect the filesystem integrity in any way. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f26c9238 |
|
14-Dec-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove reada infrastructure Currently there is only one user for btrfs metadata readahead, and that's scrub. But even for the single user, it's not providing the correct functionality it needs, as scrub needs reada for commit root, which current readahead can't provide. (Although it's pretty easy to add such feature). Despite this, there are some extra problems related to metadata readahead: - Duplicated feature with btrfs_path::reada - Partly duplicated feature of btrfs_fs_info::buffer_radix Btrfs already caches its metadata in buffer_radix, while readahead tries to read the tree block no matter if it's already cached. - Poor layer separation Metadata readahead works kinda at device level. This is definitely not the correct layer it should be, since metadata is at btrfs logical address space, it should not bother device at all. This brings extra chance for bugs to sneak in, while brings unnecessary complexity. - Dead code In the very beginning of scrub.c we have #undef DEBUG, rendering all the debug related code useless and unable to test. Thus here I purpose to remove the metadata readahead mechanism completely. [BENCHMARK] There is a full benchmark for the scrub performance difference using the old btrfs_reada_add() and btrfs_path::reada. For the worst case (no dirty metadata, slow HDD), there could be a 5% performance drop for scrub. For other cases (even SATA SSD), there is no distinguishable performance difference. The number is reported scrub speed, in MiB/s. The resolution is limited by the reported duration, which only has a resolution of 1 second. Old New Diff SSD 455.3 466.332 +2.42% HDD 103.927 98.012 -5.69% Comprehensive test methodology is in the cover letter of the patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
54f03ab1 |
|
03-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_truncate_inode_items to inode-item.c This is an inode item related manipulation with a few vfs related adjustments. I'm going to remove the vfs related code from this helper and simplify it a lot, but I want those changes to be easily seen via git blame, so move this function now and then the simplification work can be done. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
26c2c454 |
|
03-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add an inode-item.h We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efc0e69c |
|
25-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: introduce exclusive operation BALANCE_PAUSED state Current set of exclusive operation states is not sufficient to handle all practical use cases. In particular there is a need to be able to add a device to a filesystem that have paused balance. Currently there is no way to distinguish between a running and a paused balance. Fix this by introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2 occasions: 1. When a filesystem is mounted with skip_balance and there is an unfinished balance it will now be into BALANCE_PAUSED instead of simply BALANCE state. 2. When a running balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d96b3424 |
|
21-Nov-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make send work with concurrent block group relocation We don't allow send and balance/relocation to run in parallel in order to prevent send failing or silently producing some bad stream. This is because while send is using an extent (specially metadata) or about to read a metadata extent and expecting it belongs to a specific parent node, relocation can run, the transaction used for the relocation is committed and the extent gets reallocated while send is still using the extent, so it ends up with a different content than expected. This can result in just failing to read a metadata extent due to failure of the validation checks (parent transid, level, etc), failure to find a backreference for a data extent, and other unexpected failures. Besides reallocation, there's also a similar problem of an extent getting discarded when it's unpinned after the transaction used for block group relocation is committed. The restriction between balance and send was added in commit 9e967495e0e0 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), kernel 5.3, while the more general restriction between send and relocation was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs while we have send operations running"), kernel 5.14. Both send and relocation can be very long running operations. Relocation because it has to do a lot of IO and expensive backreference lookups in case there are many snapshots, and send due to read IO when operating on very large trees. This makes it inconvenient for users and tools to deal with scheduling both operations. For zoned filesystem we also have automatic block group relocation, so send can fail with -EAGAIN when users least expect it or send can end up delaying the block group relocation for too long. In the future we might also get the automatic block group relocation for non zoned filesystems. This change makes it possible for send and relocation to run in parallel. This is achieved the following way: 1) For all tree searches, send acquires a read lock on the commit root semaphore; 2) After each tree search, and before releasing the commit root semaphore, the leaf is cloned and placed in the search path (struct btrfs_path); 3) After releasing the commit root semaphore, the changed_cb() callback is invoked, which operates on the leaf and writes commands to the pipe (or file in case send/receive is not used with a pipe). It's important here to not hold a lock on the commit root semaphore, because if we did we could deadlock when sending and receiving to the same filesystem using a pipe - the send task blocks on the pipe because it's full, the receive task, which is the only consumer of the pipe, triggers a transaction commit when attempting to create a subvolume or reserve space for a write operation for example, but the transaction commit blocks trying to write lock the commit root semaphore, resulting in a deadlock; 4) Before moving to the next key, or advancing to the next change in case of an incremental send, check if a transaction used for relocation was committed (or is about to finish its commit). If so, release the search path(s) and restart the search, to where we were before, so that we don't operate on stale extent buffers. The search restarts are always possible because both the send and parent roots are RO, and no one can add, remove of update keys (change their offset) in RO trees - the only exception is deduplication, but that is still not allowed to run in parallel with send; 5) Periodically check if there is contention on the commit root semaphore, which means there is a transaction commit trying to write lock it, and release the semaphore and reschedule if there is contention, so as to avoid causing any significant delays to transaction commits. This leaves some room for optimizations for send to have less path releases and re searching the trees when there's relocation running, but for now it's kept simple as it performs quite well (on very large trees with resulting send streams in the order of a few hundred gigabytes). Test case btrfs/187, from fstests, stresses relocation, send and deduplication attempting to run in parallel, but without verifying if send succeeds and if it produces correct streams. A new test case will be added that exercises relocation happening in parallel with send and then checks that send succeeds and the resulting streams are correct. A final note is that for now this still leaves the mutual exclusion between send operations and deduplication on files belonging to a root used by send operations. A solution for that will be slightly more complex but it will eventually be built on top of this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
abed4aaa |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: track the csum, extent, and free space trees in a rb tree In the future we are going to have multiple copies of these trees. To facilitate this we need a way to lookup the different roots we are looking for. Handle this by adding a global root rb tree that is indexed on the root->root_key. Then instead of loading the roots at mount time with individually targeted keys, simply search the tree_root for anything with the specific objectid we want. This will make it straightforward to support both old style and new style file systems. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7939dd9f |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: stop accessing ->free_space_root directly We're going to have multiple free space roots in the future, so adjust all the users of the free space root to use a helper to access the root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc28b25e |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: stop accessing ->csum_root directly We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
056c8311 |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: set BTRFS_FS_STATE_NO_CSUMS if we fail to load the csum root We have a few places where we skip doing csums if we mounted with one of the rescue options that ignores bad csum roots. In the future when there are multiple csum roots it'll be costly to check and see if there are any missing csum roots, so simply add a flag to indicate the fs should skip loading csums in case of errors. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
29cbcf40 |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: stop accessing ->extent_root directly When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fdfbf020 |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rework async transaction committing Currently we do this awful thing where we get another ref on a trans handle, async off that handle and commit the transaction from that work. Because we do this we have to mess with current->journal_info and the freeze counting stuff. We already have an async thing to kick for the transaction commit, the transaction kthread. Replace this work struct with a flag on the fs_info to tell the kthread to go ahead and commit even if it's before our timeout. Then we can drastically simplify the async transaction commit path. Note: this can be simplified and functionality based on the pending operation COMMIT. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add note ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0af4769d |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove unused BTRFS_FS_BARRIER flag This is no longer used, the -o nobarrier is handled by BTRFS_MOUNT_NOBARRIER. Remove the flag. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
54230013 |
|
09-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: get rid of root->orphan_cleanup_state Now that we don't care about the stage of the orphan_cleanup_state, simply replace it with a bit on ->state to make sure we don't call the orphan cleanup every time we wander into this root. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dc2e724e |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename btrfs_item_end_nr to btrfs_item_data_end The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually the offset of the end of the data the item points to. In fact all of the helpers that we use btrfs_item_end_nr() use data in their name, like BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end() to make it clear what this helper is giving us. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a08663d |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove the btrfs_item_end() helper We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can be collapsed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3212fa14 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: drop the _nr from the item helpers Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
74794207 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce item_nr token variant helpers The last remaining place where we have the pattern of item = btrfs_item_nr(slot) <do something with the item> are the token helpers. Handle this by introducing token helpers that will do the btrfs_item_nr() work inside of the helper itself, and then convert all users of the btrfs_item token helpers to the new _nr() variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
437bd07e |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make btrfs_file_extent_inline_item_len take a slot Instead of getting the btrfs_item for this, simply pass in the slot of the item and then use the btrfs_item_size_nr() helper inside of btrfs_file_extent_inline_item_len(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c91666b1 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add btrfs_set_item_*_nr() helpers We have the pattern of item = btrfs_item_nr(slot); btrfs_set_item_*(leaf, item); in a bunch of places in our code. Fix this by adding btrfs_set_item_*_nr() helpers which will do the appropriate work, and replace those calls with btrfs_set_item_*_nr(leaf, slot); Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a163608 |
|
13-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix invalid delayed ref after subvolume creation failure When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4467af88 |
|
25-Oct-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove root argument from btrfs_unlink_inode() The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
38732474 |
|
20-Oct-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE It's a common practice to avoid use sizeof(struct btrfs_super_block) (3531), but to use BTRFS_SUPER_INFO_SIZE (4096). The problem is that, sizeof(struct btrfs_super_block) doesn't match BTRFS_SUPER_INFO_SIZE from the very beginning. Furthermore, for all call sites except selftests, we always allocate BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason to use the smaller value, and it doesn't really save any space. So let's get rid of such confusing behavior, and unify those two values. This modification also adds a new static_assert() to verify the size, and moves the BTRFS_SUPER_INFO_* macros to the definition of btrfs_super_block for the static_assert(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
84961539 |
|
05-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a BTRFS_FS_ERROR helper We have a few flags that are inconsistently used to describe the fs in different states of failure. As of 5963ffcaf383 ("btrfs: always abort the transaction if we abort a trans handle") we will always set BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED and ERROR to see if things have gone wrong. Add a helper to check BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to use the helper. The TRANS_ABORTED bit check was added in af7227338135 ("Btrfs: clean up resources during umount after trans is aborted") but is not actually specific. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6aabd858 |
|
27-Sep-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove unused function btrfs_bio_fits_in_stripe() As the last caller in compression.c has been removed, we don't need that function anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f0641656 |
|
23-Sep-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: unexport setup_items_for_insert() Since setup_items_for_insert() is not used anymore outside of ctree.c, make it static and remove its prototype from ctree.h. This also requires to move the definition of setup_item_for_insert() from ctree.h to ctree.c and move down btrfs_duplicate_item() so that it's defined after setup_items_for_insert(). Further, since setup_item_for_insert() is used outside ctree.c, rename it to btrfs_setup_item_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 2/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b7ef5f3a |
|
23-Sep-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: loop only once over data sizes array when inserting an item batch When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c3a3b19b |
|
15-Sep-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename struct btrfs_io_bio to btrfs_bio Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2707a25 |
|
08-Sep-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: add a dedicated data relocation block group Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we currently can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delalloc range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } Temporarily create a dedicated block group for the relocation process, so no non-relocation data writes can interfere with the relocation writes. This is needed that we can switch the relocation process on a zoned filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme like in a non-zoned filesystem using REQ_OP_WRITE and preallocation. Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
37f00a6d |
|
08-Sep-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: introduce btrfs_is_data_reloc_root There are several places in our codebase where we check if a root is the root of the data reloc tree and subsequent patches will introduce more. Factor out the check into a small helper function instead of open coding it multiple times. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
afba2bc0 |
|
19-Aug-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: implement active zone tracking Add zone_is_active flag to btrfs_block_group. This flag indicates the underlying zones are all active. Such zone active block groups are tracked by fs_info->active_bg_list. btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying device part. They set/clear the bitmap to indicate zone activeness and count the number of zones we can activate left. btrfs_zone_{activate,finish}() take responsibility for the logical part and the list management. In addition, btrfs_zone_finish() wait for any writes on it and send REQ_OP_ZONE_FINISH to the zone. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1ccc2e8a |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file() Currently btrfs_defrag_file() accepts both "struct inode" and "struct file" as parameter. We can easily grab "struct inode" from "struct file" using file_inode() helper. The reason why we need "struct file" is just to re-use its f_ra. Change this to pass "struct file_ra_state" parameter, so that it's more clear what we really want. Since we're here, also add some comments on the function btrfs_defrag_file(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8481dd80 |
|
17-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: subpage: introduce btrfs_subpage_bitmap_info Currently we use fixed size u16 bitmap for subpage bitmap. This is fine for 4K sectorsize with 64K page size. But for 4K sectorsize and larger page size, the bitmap is too small, while for smaller page size like 16K, u16 bitmaps waste too much space. Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to record the proper bitmap size, and where each bitmap should start at. By this, we can later compact all subpage bitmaps into one u32 bitmap. This patch is the first step. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8dcbc261 |
|
01-Oct-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: unify lookup return value when dir entry is missing btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir entries and both are used during log replay or when updating a log tree during an unlink. However when the dir item does not exists, btrfs_lookup_dir_item() returns NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if the dir item exists but there is no matching entry for a given name or index, both return NULL. This makes the call sites during log replay to be more verbose than necessary and it makes it easy to miss this slight difference. Since we don't need to distinguish between those two cases, make btrfs_lookup_dir_index_item() always return NULL when there is no matching directory entry - either because there isn't any dir entry or because there is one but it does not match the given name and index. Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to 'index' since it is supposed to match an index number, and the name 'objectid' is not very good because it can easily be confused with an inode number (like the inode number a dir entry points to). CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0cad6246 |
|
18-Aug-2021 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: add rcu argument to ->get_acl() callback Add a rcu argument to the ->get_acl() callback to allow get_cached_acl_rcu() to call the ->get_acl() method in the next patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
4d4340c9 |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ff40a91 |
|
29-Jul-2021 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: introduce btrfs_search_backwards function It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14605409 |
|
30-Jun-2021 |
Boris Burkov <boris@bur.io> |
btrfs: initial fsverity support Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
77eea05e |
|
30-Jun-2021 |
Boris Burkov <boris@bur.io> |
btrfs: add ro compat flags to inodes Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
03fe78cc |
|
14-Jul-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc We have been hitting some early ENOSPC issues in production with more recent kernels, and I tracked it down to us simply not flushing delalloc as aggressively as we should be. With tracing I was seeing us failing all tickets with all of the block rsvs at or around 0, with very little pinned space, but still around 120MiB of outstanding bytes_may_used. Upon further investigation I saw that we were flushing around 14 pages per shrink call for delalloc, despite having around 2GiB of delalloc outstanding. Consider the example of a 8 way machine, all CPUs trying to create a file in parallel, which at the time of this commit requires 5 items to do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim size waiting on reservations. Now assume we have 128MiB of delalloc outstanding. With our current math we would set items to 20, and then set to_reclaim to 20 * 256k, or 5MiB. Assuming that we went through this loop all 3 times, for both FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop twice, we'd only flush 60MiB of the 128MiB delalloc space. This could leave a fair bit of delalloc reservations still hanging around by the time we go to ENOSPC out all the remaining tickets. Fix this two ways. First, change the calculations to be a fraction of the total delalloc bytes on the system. Prior to this change we were calculating based on dirty inodes so our math made more sense, now it's just completely unrelated to what we're actually doing. Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've gone through the flush states at least once. This will empty the system of all delalloc so we're sure to be truly out of space when we start failing tickets. I'm tagging stable 5.10 and forward, because this is where we started using the page stuff heavily again. This affects earlier kernel versions as well, but would be a pain to backport to them as the flushing mechanisms aren't the same. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
809d6902 |
|
26-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: make btrfs_next_leaf static inline btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it to header to avoid the function call overhead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
25c1252a |
|
26-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered The uptodate parameter should be bool, change the type. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a129ffb8 |
|
26-Jul-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() Since commit d75855b4518b ("btrfs: Remove extent_io_ops::writepage_start_hook") removes the writepage_start_hook() and adds btrfs_writepage_cow_fixup() function, there is no need to follow the old hook parameters. Remove the @start and @end hook, since currently the fixup check is full page check, it doesn't need @start and @end hook. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a80d1c6 |
|
28-Jun-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: remove max_zone_append_size logic There used to be a patch in the original series for zoned support which limited the extent size to max_zone_append_size, but this patch has been dropped somewhere around v9. We've decided to go the opposite direction, instead of limiting extents in the first place we split them before submission to comply with the device's limits. Remove the related code, btrfs_fs_info::max_zone_append_size and btrfs_zoned_device_info::max_zone_append_size. This also removes the workaround for dm-crypt introduced in 1d68128c107a ("btrfs: zoned: fail mount if the device does not support zone append") because the fix has been merged as f34ee1dce642 ("dm crypt: Fix zoned block device support"). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
629e33a1 |
|
22-Jun-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove unused btrfs_fs_info::total_pinned This got added 14 years ago in 324ae4df00fd ("Btrfs: Add block group pinned accounting back") but it was not ever used. Subsequently its usage got gradually removed in 8790d502e440 ("Btrfs: Add support for mirroring across drives") and 11833d66be94 ("Btrfs: improve async block group caching"). Let's remove it for good! Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c416a30c |
|
22-Jun-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rip out may_commit_transaction may_commit_transaction was introduced before the ticketing infrastructure existed. There was a problem where we'd legitimately be out of space, but every reservation would trigger a transaction commit and then fail. Thus if you had 1000 things trying to make a reservation, they'd all do the flushing loop and thus commit the transaction 1000 times before they'd get their ENOSPC. This helper was introduced to short circuit this, if there wasn't space that could be reclaimed by committing the transaction then simply ENOSPC out. This made true ENOSPC tests much faster as we didn't waste a bunch of time. However many of our bugs over the years have been from cases where we didn't account for some space that would be reclaimed by committing a transaction. The delayed refs rsv space, delayed rsv, many pinned bytes miscalculations, etc. And in the meantime the original problem has been solved with ticketing. We no longer will commit the transaction 1000 times. Instead we'll get 1000 waiters, we will go through the flushing mechanisms, and if there's no progress after 2 loops we ENOSPC everybody out. The ticketing infrastructure gives us a deterministic way to see if we're making progress or not, thus we avoid a lot of extra work. So simplify this step by simply unconditionally committing the transaction. This removes what is arguably our most common source of early ENOSPC bugs and will allow us to drastically simplify many of the things we track because we simply won't need them with this stuff gone. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cea5cf0 |
|
21-Jun-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: ensure relocation never runs while we have send operations running Relocation and send do not play well together because while send is running a block group can be relocated, a transaction committed and the respective disk extents get re-allocated and written to or discarded while send is about to do something with the extents. This was explained in commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due to concurrent relocation"), which prevented balance and send from running in parallel but it did not address one remaining case where chunk relocation can happen: shrinking a device (and device deletion which shrinks a device's size to 0 before deleting the device). We also have now one more case where relocation is triggered: on zoned filesystems partially used block groups get relocated by a background thread, introduced in commit 18bb8bbf13c183 ("btrfs: zoned: automatically reclaim zones"). So make sure that instead of preventing balance from running when there are ongoing send operations, we prevent relocation from happening. This uses the infrastructure recently added by a patch that has the subject: "btrfs: add cancellable chunk relocation support". Also it adds a spinlock used exclusively for the exclusivity between send and relocation, as before fs_info->balance_mutex was used, which would make an attempt to run send to block waiting for balance to finish, which can take a lot of time on large filesystems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cbeaae4f |
|
18-Jun-2021 |
David Sterba <dsterba@suse.com> |
btrfs: shorten integrity checker extent data mount option Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the mount option name. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ccd9395b |
|
18-Jun-2021 |
David Sterba <dsterba@suse.com> |
btrfs: switch mount option bits to enums and use wider type Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are recorded in the debugging information for convenience). There are two more things done but separating them would not make much sense as it's touching the same lines: - Renumber shifts 18..31 to 17..30 to get rid of the hole in the sequence. - Use 1UL as the value that gets shifted because we're approaching the 32bit limit and due to integer promotions the value of (1 << 31) becomes 0xffffffff80000000 when cast to unsigned long (eg. the option manipulating helpers). This is not causing any problems yet as the operations are in-memory and masking the 31st bit works, we don't have more than 31 bits so the ill effects of not masking higher bits don't happen. But once we have more, the problems will emerge. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a9fd417 |
|
21-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: fix typos in comments Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d2a91064 |
|
31-May-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: make btrfs_set_range_writeback() subpage compatible Function btrfs_set_range_writeback() currently just sets the page writeback unconditionally. Change it to call the subpage helper so that we can handle both cases well. Since the subpage helpers needs btrfs_fs_info, also change the parameter to accept btrfs_inode. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f57ad937 |
|
07-Apr-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename PagePrivate2 to PageOrdered inside btrfs Inside btrfs we use Private2 page status to indicate we have an ordered extent with pending IO for the sector. But the page status name, Private2, tells us nothing about the bit itself, so this patch will rename it to Ordered. And with extra comment about the bit added, so reader who is still uncertain about the page Ordered status, will find the comment pretty easily. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
38a39ac7 |
|
08-Apr-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in end_compressed_bio_write(). It passes compressed pages to btrfs_writepage_endio_finish_ordered(), which is only supposed to accept inode pages. Thankfully the important info here is the inode, so let's pass btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and make @page parameter optional. By this, end_compressed_bio_write() can happily pass page=NULL while still getting everything done properly. Also, to cooperate with such modification, replace @page parameter for trace_btrfs_writepage_end_io_hook() with btrfs_inode. Although this removes page_index info, the existing start/len should be enough for most usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
390ed29b |
|
14-Apr-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: refactor submit_extent_page() to make bio and its flag tracing easier There is a lot of code inside extent_io.c needs both "struct bio **bio_ret" and "unsigned long prev_bio_flags", along with some parameters like "unsigned long bio_flags". Such strange parameters are here for bio assembly. For example, we have such inode page layout: 0 4K 8K 12K |<-- Extent A-->|<- EB->| Then what we do is: - Page [0, 4K) *bio_ret = NULL So we allocate a new bio to bio_ret, Add page [0, 4K) to *bio_ret. - Page [4K, 8K) *bio_ret != NULL We found this page is continuous to *bio_ret, and if we're not at stripe boundary, we add page [4K, 8K) to *bio_ret. - Page [8K, 12K) *bio_ret != NULL But we found this page is not continuous, so we submit *bio_ret, then allocate a new bio, and add page [8K, 12K) to the new bio. This means we need to record both the bio and its bio_flag, but we record them manually using those strange parameter list, other than encapsulating them into their own structure. So this patch will introduce a new structure, btrfs_bio_ctrl, to record both the bio, and its bio_flags. Also, in above case, for all pages added to the bio, we need to check if the new page crosses stripe boundary. This check itself can be time consuming, and we don't really need to do that for each page. This patch also integrates the stripe boundary check into btrfs_bio_ctrl. When a new bio is allocated, the stripe and ordered extent boundary is also calculated, so no matter how large the bio will be, we only calculate the boundaries once, to save some CPU time. The following functions/structures are affected: - struct extent_page_data Replace its bio pointer with structure btrfs_bio_ctrl (embedded structure, not pointer) - end_write_bio() - flush_write_bio() Just change how bio is fetched - btrfs_bio_add_page() Use pre-calculated boundaries instead of re-calculating them. And use @bio_ctrl to replace @bio and @prev_bio_flags. - calc_bio_boundaries() New function - submit_extent_page() callers - btrfs_do_readpage() callers - contiguous_readpages() callers To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab bio. - btrfs_bio_fits_in_ordered_extent() Removed, as now the ordered extent size limit is done at bio allocation time, no need to check for each page range. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
578bda9e |
|
18-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: introduce try-lock semantics for exclusive op start Add try-lock for exclusive operation start to allow callers to do more checks. The same operation must already be running. The try-lock and unlock must pair and are a substitute for btrfs_exclop_start, thus it must also pair with btrfs_exclop_finish to release the exclop context. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
907d2710 |
|
17-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: add cancellable chunk relocation support Add support code that will allow canceling relocation on the chunk granularity. This is different and independent of balance, that also uses relocation but is a higher level operation and manages it's own state and pause/cancellation requests. Relocation is used for resize (shrink) and device deletion so this will be a common point to implement cancellation for both. The context is entirely in btrfs_relocate_block_group and btrfs_recover_relocation, enclosing one chunk relocation. The status bit is set and unset between the chunks. As relocation can take long, the effects may not be immediate and the request and actual action can slightly race. The fs_info::reloc_cancel_req is only supposed to be increased and does not pair with decrement like fs_info::balance_cancel_req. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d7ed32c |
|
14-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: protect exclusive_operation by super_lock The exclusive operation is now atomically checked and set using bit operations. Switch it to protection by spinlock. The super block lock is not frequently used and adding a new lock seems like an overkill so it should be safe to reuse it. The reason to use spinlock is to enhance the locking context so more checks can be done, eg. allowing the same exclusive operation enter the exclop section and cancel the running one. This will be used for resize and device delete. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49547068 |
|
15-Sep-2020 |
David Sterba <dsterba@suse.com> |
btrfs: document byte swap optimization of root_item::flags accessors Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d7d3165 |
|
24-May-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: don't set the full sync flag when truncation does not touch extents At btrfs_truncate() where we truncate the inode either to the same size or to a smaller size, we always set the full sync flag on the inode. This is needed in case the truncation drops or trims any file extent items that start beyond or cross the new inode size, so that the next fsync drops all inode items from the log and scans again the fs/subvolume tree to find all items that must be logged. However if the truncation does not drop or trims any file extent items, we do not need to set the full sync flag and force the next fsync to use the slow code path. So do not set the full sync flag in such cases. One use case where it is frequent to do truncations that do not change the inode size and do not drop any extents (no prealloc extents beyond i_size) is when running Microsoft's SQL Server inside a Docker container. One example workload is the one Philipp Fent reported recently, in the thread with a link below. In this workload a large number of fsyncs are preceded by such truncate operations. After this change I constantly get the runtime for that workload from Philipp to be reduced by about -12%, for example from 184 seconds down to 162 seconds. Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/ Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
08508fea |
|
02-May-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: make btrfs_verify_data_csum() to return a bitmap This will provide the basis for later per-sector repair for subpage, while still keeping the existing code happy. As if all csums match, the return value will be 0, same as now. Only when csum mismatches, the return value is different. The new return value will be a bitmap, for 4K sectorsize and 4K page size, it will be either 1, instead of the -EIO (which is not used directly by the callers, no effective change). But for 4K sectorsize and 64K page size, aka subpage case, since the bvec can contain multiple sectors, knowing which sectors are corrupted will allow us to submit repair only for corrupted sectors. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f9baa501 |
|
21-Apr-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when cloning inline extents and using qgroups There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when cloning inline extent and low on free metadata space"). However when using qgroups, a transaction also reserves metadata qgroup space, which can also result in flushing delalloc in case there is not enough available space at the moment. When this happens we deadlock, since flushing delalloc requires locking the file range in the inode's iotree and the range was already locked at the very beginning of the clone operation, before attempting to start the transaction. When this issue happens, stack traces like the following are reported: [72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000 [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142) [72747.556271] Call Trace: [72747.556273] __schedule+0x296/0x760 [72747.556277] schedule+0x3c/0xa0 [72747.556279] io_schedule+0x12/0x40 [72747.556284] __lock_page+0x13c/0x280 [72747.556287] ? generic_file_readonly_mmap+0x70/0x70 [72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs] [72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160 [72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs] [72747.556362] ? update_group_capacity+0x25/0x210 [72747.556366] ? cpumask_next_and+0x1a/0x20 [72747.556391] extent_writepages+0x44/0xa0 [btrfs] [72747.556394] do_writepages+0x41/0xd0 [72747.556398] __writeback_single_inode+0x39/0x2a0 [72747.556403] writeback_sb_inodes+0x1ea/0x440 [72747.556407] __writeback_inodes_wb+0x5f/0xc0 [72747.556410] wb_writeback+0x235/0x2b0 [72747.556414] ? get_nr_inodes+0x35/0x50 [72747.556417] wb_workfn+0x354/0x490 [72747.556420] ? newidle_balance+0x2c5/0x3e0 [72747.556424] process_one_work+0x1aa/0x340 [72747.556426] worker_thread+0x30/0x390 [72747.556429] ? create_worker+0x1a0/0x1a0 [72747.556432] kthread+0x116/0x130 [72747.556435] ? kthread_park+0x80/0x80 [72747.556438] ret_from_fork+0x1f/0x30 [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [72747.566961] Call Trace: [72747.566964] __schedule+0x296/0x760 [72747.566968] ? finish_wait+0x80/0x80 [72747.566970] schedule+0x3c/0xa0 [72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs] [72747.566999] ? finish_wait+0x80/0x80 [72747.567024] lock_extent_bits+0x37/0x90 [btrfs] [72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs] [72747.567051] ? find_get_pages_range_tag+0x2cd/0x380 [72747.567076] __extent_writepage+0x203/0x320 [btrfs] [72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs] [72747.567106] ? update_load_avg+0x7e/0x5f0 [72747.567109] ? enqueue_entity+0xf4/0x6f0 [72747.567134] extent_writepages+0x44/0xa0 [btrfs] [72747.567137] ? enqueue_task_fair+0x93/0x6f0 [72747.567140] do_writepages+0x41/0xd0 [72747.567144] __filemap_fdatawrite_range+0xc7/0x100 [72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs] [72747.567200] process_one_work+0x1aa/0x340 [72747.567202] worker_thread+0x30/0x390 [72747.567205] ? create_worker+0x1a0/0x1a0 [72747.567208] kthread+0x116/0x130 [72747.567211] ? kthread_park+0x80/0x80 [72747.567214] ret_from_fork+0x1f/0x30 [72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000 [72747.569689] Call Trace: [72747.569691] __schedule+0x296/0x760 [72747.569694] schedule+0x3c/0xa0 [72747.569721] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569725] ? finish_wait+0x80/0x80 [72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs] [72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs] [72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs] [72747.569810] ? path_lookupat.isra.48+0x97/0x140 [72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs] [72747.569836] ? __kmalloc+0x16a/0x2c0 [72747.569839] do_iter_readv_writev+0x160/0x1c0 [72747.569843] do_iter_write+0x80/0x1b0 [72747.569847] vfs_writev+0x84/0x140 [72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs] [72747.569873] do_writev+0x65/0x100 [72747.569876] do_syscall_64+0x33/0x40 [72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000 [72747.569903] Call Trace: [72747.569906] __schedule+0x296/0x760 [72747.569909] schedule+0x3c/0xa0 [72747.569936] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569940] ? finish_wait+0x80/0x80 [72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs] [72747.569989] start_transaction+0x279/0x580 [btrfs] [72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs] [72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs] [72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs] [72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs] [72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs] [72747.570126] do_clone_file_range+0xed/0x200 [72747.570131] vfs_clone_file_range+0x37/0x110 [72747.570134] ioctl_file_clone+0x7d/0xb0 [72747.570137] do_vfs_ioctl+0x138/0x630 [72747.570140] __x64_sys_ioctl+0x62/0xc0 [72747.570143] do_syscall_64+0x33/0x40 [72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So fix this by skipping the flush of delalloc for an inode that is flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under such a special case of cloning an inline extent, when flushing delalloc during qgroup metadata reservation. The special cases for cloning inline extents were added in kernel 5.7 by by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents"), while having qgroup metadata space reservation flushing delalloc when low on space was added in kernel 5.9 by commit c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable kernel backports. Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/ Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97fc2977 |
|
07-Apr-2021 |
Miklos Szeredi <mszeredi@redhat.com> |
btrfs: convert to fileattr Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: David Sterba <dsterba@suse.com>
|
#
18bb8bbf |
|
19-Apr-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: automatically reclaim zones When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% by default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f3372065 |
|
19-Apr-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock As a preparation for extending the block group deletion use case, rename the unused_bgs_mutex to reclaim_bgs_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9306ad4 |
|
24-Feb-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: more graceful errors/warnings on 32bit systems when reaching limits Btrfs uses internally mapped u64 address space for all its metadata. Due to the page cache limit on 32bit systems, btrfs can't access metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K page size while 256T for 64K page size. Users can have a filesystem which doesn't have metadata beyond the boundary at mount time, but later balance can cause it to create metadata beyond the boundary. And modification to MM layer is unrealistic just for such minor use case. We can't do more than to prevent mounting such filesystem or warn early when the numbers are still within the limits. To address such problem, this patch will introduce the following checks: - Mount time rejection This will reject any fs which has metadata chunk at or beyond the boundary. - Mount time early warning If there is any metadata chunk beyond 5/8th of the boundary, we do an early warning and hope the end user will see it. - Runtime extent buffer rejection If we're going to allocate an extent buffer at or beyond the boundary, reject such request with EOVERFLOW. This is definitely going to cause problems like transaction abort, but we have no better ways. - Runtime extent buffer early warning If an extent buffer beyond 5/8th of the max file size is allocated, do an early warning. Above error/warning message will only be printed once for each fs to reduce dmesg flood. If the mount is rejected, the filesystem will be mountable only on a 64bit host. Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/ Reported-by: Erik Jensen <erikjensen@rkjnsn.net> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ace75066 |
|
31-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: improve btree readahead for full send operations Currently a full send operation uses the standard btree readahead when iterating over the subvolume/snapshot btree, which despite bringing good performance benefits, it could be improved in a few aspects for use cases such as full send operations, which are guaranteed to visit every node and leaf of a btree, in ascending and sequential order. The limitations of that standard btree readahead implementation are the following: 1) It only triggers readahead for leaves that are physically close to the leaf being read, within a 64K range; 2) It only triggers readahead for the next or previous leaves if the leaf being read is not currently in memory; 3) It never triggers readahead for nodes. So add a new readahead mode that addresses all these points and use it for full send operations. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The durations of the full send operation in seconds were the following: Before this change: 217 seconds After this change: 205 seconds (-5.7%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc03f39e |
|
11-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use a bit to track the existence of tree mod log users The tree modification log functions are called very frequently, basically they are called every time a btree is modified (a pointer added or removed to a node, a new root for a btree is set, etc). Because of that, to avoid heavy lock contention on the lock that protects the list of tree mod log users, we have checks that test the emptiness of the list with a full memory barrier before the checks, so that when there are no tree mod log users we avoid taking the lock. Replace the memory barrier and list emptiness check with a test for a new bit set at fs_info->flags. This bit is used to indicate when there are tree mod log users, set whenever a user is added to the list and cleared when the last user is removed from the list. This makes the intention a bit more obvious and possibly more efficient (assuming test_bit() may be cheaper than a full memory barrier on some architectures). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f3a84ccd |
|
11-Mar-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move the tree mod log code into its own file The tree modification log, which records modifications done to btrees, is quite large and currently spread all over ctree.c, which is a huge file already. To make things better organized, move all that code into its own separate source and header files. Functions and definitions that are used outside of the module (mostly by ctree.c) are renamed so that they start with a "btrfs_" prefix. Everything else remains unchanged. This makes it easier to go over the tree modification log code every time I need to go read it to fix a bug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cea62800 |
|
16-Mar-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: remove duplicated in_range() macro The in_range() macro is defined twice in btrfs' source, once in ctree.h and once in misc.h. Remove the definition in ctree.h and include misc.h in the files depending on it. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8318ba79 |
|
10-Feb-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a i_mmap_lock to our inode We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5e295768 |
|
03-Mar-2021 |
Goldwyn Rodrigues <rgoldwyn@suse.de> |
btrfs: remove mirror argument from btrfs_csum_verify_data() The parameter mirror is not used and does not make sense for checksum verification of the given bio. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
05947ae1 |
|
10-Feb-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: unexport btrfs_extent_readonly() and make it static btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So move it from extent-tree.c to inode.c and declare it as static. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bfc78479 |
|
17-Feb-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_replace_file_extents take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
195a49ea |
|
04-Feb-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between writes to swap files and scrub When we active a swap file, at btrfs_swap_activate(), we acquire the exclusive operation lock to prevent the physical location of the swap file extents to be changed by operations such as balance and device replace/resize/remove. We also call there can_nocow_extent() which, among other things, checks if the block group of a swap file extent is currently RO, and if it is we can not use the extent, since a write into it would result in COWing the extent. However we have no protection against a scrub operation running after we activate the swap file, which can result in the swap file extents to be COWed while the scrub is running and operating on the respective block group, because scrub turns a block group into RO before it processes it and then back again to RW mode after processing it. That means an attempt to write into a swap file extent while scrub is processing the respective block group, will result in COWing the extent, changing its physical location on disk. Fix this by making sure that block groups that have extents that are used by active swap files can not be turned into RO mode, therefore making it not possible for a scrub to turn them into RO mode. When a scrub finds a block group that can not be turned to RO due to the existence of extents used by swap files, it proceeds to the next block group and logs a warning message that mentions the block group was skipped due to active swap files - this is the same approach we currently use for balance. Fixes: ed46ff3d42378 ("Btrfs: support swap files") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
549c7297 |
|
21-Jan-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
fs: make helpers idmap mount aware Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
#
9d294a68 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: enable to mount ZONED incompat flag This final patch adds the ZONED incompat flag to the supported flags and enables to mount ZONED flagged file system. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40ab3be1 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: extend zoned allocator to use dedicated tree-log block group This is the 1/3 patch to enable tree log on zoned filesystems. The tree-log feature does not work on a zoned filesystem as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which has a different timing than a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that zoned filesystems must avoid. Introduce a dedicated block group for tree-log blocks, so that tree-log blocks and other metadata blocks can be separate write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and assigns "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0bc09ca1 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: serialize metadata IO We cannot use zone append for writing metadata, because the B-tree nodes have references to each other using logical address. Without knowing the address in advance, we cannot construct the tree in the first place. So we need to serialize write IOs for metadata. We cannot add a mutex around allocation and submission because metadata blocks are allocated in an earlier stage to build up B-trees. Add a zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this adds a per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can break when attempting to write back blocks in an unfinished transaction. If the writing out failed because of a hole and the write out is for data integrity (WB_SYNC_ALL), it returns EAGAIN. A caller like fsync() code should handle this properly e.g. by falling back to a full transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cacb2cea |
|
04-Feb-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: check if bio spans across an ordered extent To ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Ensure that constructing bio does not span more than an ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
576fa348 |
|
09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: improve preemptive background space flushing Currently if we ever have to flush space because we do not have enough we allocate a ticket and attach it to the space_info, and then systematically flush things in the filesystem that hold space reservations until our space is reclaimed. However this has a latency cost, we must go to sleep and wait for the flushing to make progress before we are woken up and allowed to continue doing our work. In order to address that we used to kick off the async worker to flush space preemptively, so that we could be reclaiming space hopefully before any tasks needed to stop and wait for space to reclaim. When I introduced the ticketed ENOSPC stuff this broke slightly in the fact that we were using tickets to indicate if we were done flushing. No tickets, no more flushing. However this meant that we essentially never preemptively flushed. This caused a write performance regression that Nikolay noticed in an unrelated patch that removed the committing of the transaction during btrfs_end_transaction. The behavior that happened pre that patch was btrfs_end_transaction() would see that we were low on space, and it would commit the transaction. This was bad because in this particular case you could end up with thousands and thousands of transactions being committed during the 5 minute reproducer. With the patch to remove this behavior we got much more sane transaction commits, but we ended up slower because we would write for a while, flush, write for a while, flush again. To address this we need to reinstate a preemptive flushing mechanism. However it is distinctly different from our ticketing flushing in that it doesn't have tickets to base it's decisions on. Instead of bolting this logic into our existing flushing work, add another worker to handle this preemptive flushing. Here we will attempt to be slightly intelligent about the things that we flushing, attempting to balance between whichever pool is taking up the most space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f00c42dd |
|
09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce a FORCE_COMMIT_TRANS flush operation Solely for preemptive flushing, we want to be able to force the transaction commit without any of the ambiguity of may_commit_transaction(). This is because may_commit_transaction() checks tickets and such, and in preemptive flushing we already know it'll be helpful, so use this to keep the code nice and clean and straightforward. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5deb17e1 |
|
09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: track ordered bytes instead of just dio ordered bytes We track dio_bytes because the shrink delalloc code needs to know if we have more DIO in flight than we have normal buffered IO. The reason for this is because we can't "flush" DIO, we have to just wait on the ordered extents to finish. However this is true of all ordered extents. If we have more ordered space outstanding than dirty pages we should be waiting on ordered extents. We already are ok on this front technically, because we always do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in the preemptive flushing code as well, so change this to count all ordered bytes instead of just DIO ordered bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9db4dc24 |
|
10-Jan-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_start_delalloc_root's nr argument a long It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type). Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69948022 |
|
07-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove new_dirid argument from btrfs_create_subvol_root It's no longer used. While at it also remove new_dirid in create_subvol as it's used in a single place and open code it. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b8fad57 |
|
07-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename btrfs_root::highest_objectid to free_objectid This reflects the true purpose of the member as it's being used solely in context where a new objectid is being allocated. Future changes will also change the way it's being used to closely follow this semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f96e402 |
|
15-Jan-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix possible free space tree corruption with online conversion While running btrfs/011 in a loop I would often ASSERT() while trying to add a new free space entry that already existed, or get an EEXIST while adding a new block to the extent tree, which is another indication of double allocation. This occurs because when we do the free space tree population, we create the new root and then populate the tree and commit the transaction. The problem is when you create a new root, the root node and commit root node are the same. During this initial transaction commit we will run all of the delayed refs that were paused during the free space tree generation, and thus begin to cache block groups. While caching block groups the caching thread will be reading from the main root for the free space tree, so as we make allocations we'll be changing the free space tree, which can cause us to add the same range twice which results in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a variety of different errors when running delayed refs because of a double allocation. Fix this by marking the fs_info as unsafe to load the free space tree, and fall back on the old slow method. We could be smarter than this, for example caching the block group while we're populating the free space tree, but since this is a serious problem I've opted for the simplest solution. CC: stable@vger.kernel.org # 4.9+ Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0a1db70 |
|
14-Dec-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between RO remount and the cleaner task When we are remounting a filesystem in RO mode we can race with the cleaner task and result in leaking a transaction if the filesystem is unmounted shortly after, before the transaction kthread had a chance to commit that transaction. That also results in a crash during unmount, due to a use-after-free, if hardware acceleration is not available for crc32c. The following sequence of steps explains how the race happens. 1) The filesystem is mounted in RW mode and the cleaner task is running. This means that currently BTRFS_FS_CLEANER_RUNNING is set at fs_info->flags; 2) The cleaner task is currently running delayed iputs for example; 3) A filesystem RO remount operation starts; 4) The RO remount task calls btrfs_commit_super(), which commits any currently open transaction, and it finishes; 5) At this point the cleaner task is still running and it creates a new transaction by doing one of the following things: * When running the delayed iput() for an inode with a 0 link count, in which case at btrfs_evict_inode() we start a transaction through the call to evict_refill_and_join(), use it and then release its handle through btrfs_end_transaction(); * When deleting a dead root through btrfs_clean_one_deleted_snapshot(), a transaction is started at btrfs_drop_snapshot() and then its handle is released through a call to btrfs_end_transaction_throttle(); * When the remount task was still running, and before the remount task called btrfs_delete_unused_bgs(), the cleaner task also called btrfs_delete_unused_bgs() and it picked and removed one block group from the list of unused block groups. Before the cleaner task started a transaction, through btrfs_start_trans_remove_block_group() at btrfs_delete_unused_bgs(), the remount task had already called btrfs_commit_super(); 6) So at this point the filesystem is in RO mode and we have an open transaction that was started by the cleaner task; 7) Shortly after a filesystem unmount operation starts. At close_ctree() we stop the transaction kthread before it had a chance to commit the transaction, since less than 30 seconds (the default commit interval) have elapsed since the last transaction was committed; 8) We end up calling iput() against the btree inode at close_ctree() while there is an open transaction, and since that transaction was used to update btrees by the cleaner, we have dirty pages in the btree inode due to COW operations on metadata extents, and therefore writeback is triggered for the btree inode. So btree_write_cache_pages() is invoked to flush those dirty pages during the final iput() on the btree inode. This results in creating a bio and submitting it, which makes us end up at btrfs_submit_metadata_bio(); 9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch that calls btrfs_wq_submit_bio(), because check_async_write() returned a value of 1. This value of 1 is because we did not have hardware acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not set in fs_info->flags; 10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the workqueue at fs_info->workers, which was already freed before by the call to btrfs_stop_all_workers() at close_ctree(). This results in an invalid memory access due to a use-after-free, leading to a crash. When this happens, before the crash there are several warnings triggered, since we have reserved metadata space in a block group, the delayed refs reservation, etc: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs] Code: f0 01 00 00 48 39 c2 75 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x17f/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 48 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c6 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Code: 48 83 bb b0 03 00 00 00 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x24c/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c7 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Code: ad de 49 be 22 01 00 (...) RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c8 ]--- BTRFS info (device sdc): space_info 4 has 268238848 free, is not full BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0 And the crash, which only happens when we do not have crc32c hardware acceleration, produces the following trace immediately after those warnings: stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs] Code: 54 55 53 48 89 f3 (...) RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000 FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_wq_submit_bio+0xb3/0xd0 [btrfs] btrfs_submit_metadata_bio+0x44/0xc0 [btrfs] submit_one_bio+0x61/0x70 [btrfs] btree_write_cache_pages+0x414/0x450 [btrfs] ? kobject_put+0x9a/0x1d0 ? trace_hardirqs_on+0x1b/0xf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? free_debug_processing+0x1e1/0x2b0 do_writepages+0x43/0xe0 ? lock_acquired+0x199/0x490 __writeback_single_inode+0x59/0x650 writeback_single_inode+0xaf/0x120 write_inode_now+0x94/0xd0 iput+0x187/0x2b0 close_ctree+0x2c6/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f3cfebabee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60 Modules linked in: btrfs dm_snapshot dm_thin_pool (...) ---[ end trace dd74718fef1ed5cc ]--- Finally when we remove the btrfs module (rmmod btrfs), there are several warnings about objects that were allocated from our slabs but were never freed, consequence of the transaction that was never committed and got leaked: ============================================================================= BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x0000000050cbdd61 @offset=12104 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x0000000086e9b0ff @offset=12776 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs] commit_cowonly_roots+0x248/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 0b (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000001a340018 @offset=4408 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_commit_transaction+0x60/0xc40 [btrfs] create_subvol+0x56a/0x990 [btrfs] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] btrfs_ioctl+0x1a92/0x36f0 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x000000002b46292a @offset=13648 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? __mutex_unlock_slowpath+0x45/0x2a0 kmem_cache_destroy+0x55/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000004cf95ea8 @offset=6264 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1 So fix this by making the remount path to wait for the cleaner task before calling btrfs_commit_super(). The remount path now waits for the bit BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling btrfs_commit_super() and this ensures the cleaner can not start a transaction after that, because it sleeps when the filesystem is in RO mode and we have already flagged the filesystem as RO before waiting for BTRFS_FS_CLEANER_RUNNING to be cleared. This also introduces a new flag BTRFS_FS_STATE_RO to be used for fs_info->fs_state when the filesystem is in RO mode. This is because we were doing the RO check using the flags of the superblock and setting the RO mode simply by ORing into the superblock's flags - those operations are not atomic and could result in the cleaner not seeing the update from the remount task after it clears BTRFS_FS_CLEANER_RUNNING. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a664971 |
|
01-Dec-2020 |
ethanwu <ethanwu@synology.com> |
btrfs: correctly calculate item size used when item key collision happens Item key collision is allowed for some item types, like dir item and inode refs, but the overall item size is limited by the nodesize. item size(ins_len) passed from btrfs_insert_empty_items to btrfs_search_slot already contains size of btrfs_item. When btrfs_search_slot reaches leaf, we'll see if we need to split leaf. The check incorrectly reports that split leaf is required, because it treats the space required by the newly inserted item as btrfs_item + item data. But in item key collision case, only item data is actually needed, the newly inserted item could merge into the existing one. No new btrfs_item will be inserted. And split_leaf return EOVERFLOW from following code: if (extend && data_size + btrfs_item_size_nr(l, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info)) return -EOVERFLOW; In most cases, when callers receive EOVERFLOW, they either return this error or handle in different ways. For example, in normal dir item creation the userspace will get errno EOVERFLOW; in inode ref case INODE_EXTREF is used instead. However, this is not the case for rename. To avoid the unrecoverable situation in rename, btrfs_check_dir_item_collision is called in early phase of rename. In this function, when item key collision is detected leaf space is checked: data_size = sizeof(*di) + name_len; if (data_size + btrfs_item_size_nr(leaf, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info)) the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here refers to existing item size, the condition here correctly calculates the needed size for collision case rather than the wrong case above. The consequence of inconsistent condition check between btrfs_check_dir_item_collision and btrfs_search_slot when item key collision happens is that we might pass check here but fail later at btrfs_search_slot. Rename fails and volume is forced readonly [436149.586170] ------------[ cut here ]------------ [436149.586173] BTRFS: Transaction aborted (error -75) [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1 [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286 [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006 [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0 [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472 [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08 [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe [436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000 [436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0 [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [436149.586296] Call Trace: [436149.586311] vfs_rename+0x383/0x920 [436149.586313] ? vfs_rename+0x383/0x920 [436149.586315] do_renameat2+0x4ca/0x590 [436149.586317] __x64_sys_rename+0x20/0x30 [436149.586324] do_syscall_64+0x5a/0x120 [436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [436149.586332] RIP: 0033:0x7fa3133b1d37 [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37 [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0 [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0 [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00 Thanks to Hans van Kranenburg for information about crc32 hash collision tools, I was able to reproduce the dir item collision with following python script. https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run it under a btrfs volume will trigger the abort transaction. It simply creates files and rename them to forged names that leads to hash collision. There are two ways to fix this. One is to simply revert the patch 878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the condition consistent although that patch is correct about the size. The other way is to handle the leaf space check correctly when collision happens. I prefer the second one since it correct leaf space check in collision case. This fix will not account sizeof(struct btrfs_item) when the item already exists. There are two places where ins_len doesn't contain sizeof(struct btrfs_item), however. 1. extent-tree.c: lookup_inline_extent_backref 2. file-item.c: btrfs_csum_file_blocks to make the logic of btrfs_search_slot more clear, we add a flag search_for_extension in btrfs_path. This flag indicates that ins_len passed to btrfs_search_slot doesn't contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot will use the actual size needed to calculate the required leaf space. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: ethanwu <ethanwu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d45f221 |
|
02-Dec-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when cloning inline extent and low on free metadata space When cloning an inline extent there are cases where we can not just copy the inline extent from the source range to the target range (e.g. when the target range starts at an offset greater than zero). In such cases we copy the inline extent's data into a page of the destination inode and then dirty that page. However, after that we will need to start a transaction for each processed extent and, if we are ever low on available metadata space, we may need to flush existing delalloc for all dirty inodes in an attempt to release metadata space - if that happens we may deadlock: * the async reclaim task queued a delalloc work to flush delalloc for the destination inode of the clone operation; * the task executing that delalloc work gets blocked waiting for the range with the dirty page to be unlocked, which is currently locked by the task doing the clone operation; * the async reclaim task blocks waiting for the delalloc work to complete; * the cloning task is waiting on the waitqueue of its reservation ticket while holding the range with the dirty page locked in the inode's io_tree; * if metadata space is not released by some other task (like delalloc for some other inode completing for example), the clone task waits forever and as a consequence the delalloc work and async reclaim tasks will hang forever as well. Releasing more space on the other hand may require starting a transaction, which will hang as well when trying to reserve metadata space, resulting in a deadlock between all these tasks. When this happens, traces like the following show up in dmesg/syslog: [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [87452.326136] Call Trace: [87452.326737] __schedule+0x5d1/0xcf0 [87452.327390] schedule+0x45/0xe0 [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs] [87452.328894] ? finish_wait+0x90/0x90 [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs] [87452.330133] ? __mod_memcg_state+0x8e/0x160 [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs] [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs] [87452.332007] ? lock_release+0x20e/0x4c0 [87452.332557] ? trace_hardirqs_on+0x1b/0xf0 [87452.333127] extent_writepages+0x43/0x90 [btrfs] [87452.333653] ? lock_acquire+0x1a3/0x490 [87452.334177] do_writepages+0x43/0xe0 [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100 [87452.335720] __filemap_fdatawrite_range+0xc5/0x100 [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs] [87452.337838] process_one_work+0x24e/0x5e0 [87452.338437] worker_thread+0x50/0x3b0 [87452.339137] ? process_one_work+0x5e0/0x5e0 [87452.339884] kthread+0x153/0x170 [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.341153] ret_from_fork+0x22/0x30 [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [87452.345655] Call Trace: [87452.346305] __schedule+0x5d1/0xcf0 [87452.346947] ? kvm_clock_read+0x14/0x30 [87452.347676] ? wait_for_completion+0x81/0x110 [87452.348389] schedule+0x45/0xe0 [87452.349077] schedule_timeout+0x30c/0x580 [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [87452.350340] ? lock_acquire+0x1a3/0x490 [87452.351006] ? try_to_wake_up+0x7a/0xa20 [87452.351541] ? lock_release+0x20e/0x4c0 [87452.352040] ? lock_acquired+0x199/0x490 [87452.352517] ? wait_for_completion+0x81/0x110 [87452.353000] wait_for_completion+0xab/0x110 [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs] [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] [87452.354455] flush_space+0x24f/0x660 [btrfs] [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] [87452.355565] process_one_work+0x24e/0x5e0 [87452.356024] worker_thread+0x20f/0x3b0 [87452.356487] ? process_one_work+0x5e0/0x5e0 [87452.356973] kthread+0x153/0x170 [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.357880] ret_from_fork+0x22/0x30 (...) < stack traces of several tasks waiting for the locks of the inodes of the clone operation > (...) [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97 [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960 [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000 [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000 [92867.447920] Call Trace: [92867.448435] __schedule+0x5d1/0xcf0 [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [92867.449423] schedule+0x45/0xe0 [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs] [92867.450576] ? finish_wait+0x90/0x90 [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs] [92867.452412] start_transaction+0x2d1/0x760 [btrfs] [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs] [92867.453848] ? lock_release+0x20e/0x4c0 [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs] [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs] [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs] [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs] [92867.457213] do_clone_file_range+0xd4/0x1f0 [92867.457828] vfs_clone_file_range+0x4d/0x230 [92867.458355] ? lock_release+0x20e/0x4c0 [92867.458890] ioctl_file_clone+0x8f/0xc0 [92867.459377] do_vfs_ioctl+0x342/0x750 [92867.459913] __x64_sys_ioctl+0x62/0xb0 [92867.460377] do_syscall_64+0x33/0x80 [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 (...) < stack traces of more tasks blocked on metadata reservation like the clone task above, because the async reclaim task has deadlocked > (...) Another thing to notice is that the worker task that is deadlocked when trying to flush the destination inode of the clone operation is at btrfs_invalidatepage(). This is simply because the clone operation has a destination offset greater than the i_size and we only update the i_size of the destination file after cloning an extent (just like we do in the buffered write path). Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger the flushing of delalloc for all inodes that have delalloc, add a runtime flag to an inode to signal it should not be flushed, and for inodes with that flag set, start_delalloc_inodes() will simply skip them. When the cloning code needs to dirty a page to copy an inline extent, set that flag on the inode and then clear it when the clone operation finishes. This could be sporadically triggered with test case generic/269 from fstests, which exercises many fsstress processes running in parallel with several dd processes filling up the entire filesystem. CC: stable@vger.kernel.org # 5.9+ Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6275193e |
|
01-Dec-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs Refactor btrfs_lookup_bio_sums() by: - Remove the @file_offset parameter There are two factors making the @file_offset parameter useless: * For csum lookup in csum tree, file offset makes no sense We only need disk_bytenr, which is unrelated to file_offset * page_offset (file offset) of each bvec is not contiguous. Pages can be added to the same bio as long as their on-disk bytenr is contiguous, meaning we could have pages at different file offsets in the same bio. Thus passing file_offset makes no sense any more. The only user of file_offset is for data reloc inode, we will use a new function, search_file_offset_in_bio(), to handle it. - Extract the csum tree lookup into search_csum_tree() The new function will handle the csum search in csum tree. The return value is the same as btrfs_find_ordered_sum(), returning the number of found sectors which have checksum. - Change how we do the main loop The only needed info from bio is: * the on-disk bytenr * the length After extracting the above info, we can do the search without bio at all, which makes the main loop much simpler: for (cur_disk_bytenr = orig_disk_bytenr; cur_disk_bytenr < orig_disk_bytenr + orig_len; cur_disk_bytenr += count * sectorsize) { /* Lookup csum tree */ count = search_csum_tree(fs_info, path, cur_disk_bytenr, search_len, csum_dst); if (!count) { /* Csum hole handling */ } } - Use single variable as the source to calculate all other offsets Instead of all different type of variables, we use only one main variable, cur_disk_bytenr, which represents the current disk bytenr. All involved values can be calculated from that variable, and all those variable will only be visible in the inner loop. The above refactoring makes btrfs_lookup_bio_sums() way more robust than it used to be, especially related to the file offset lookup. Now file_offset lookup is only related to data reloc inode, otherwise we don't need to bother file_offset at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
884b07d0 |
|
01-Dec-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors To support sectorsize < PAGE_SIZE case, we need to take extra care of extent buffer accessors. Since sectorsize is smaller than PAGE_SIZE, one page can contain multiple tree blocks, we must use eb->start to determine the real offset to read/write for extent buffer accessors. This patch introduces two helpers to do this: - get_eb_page_index() This is to calculate the index to access extent_buffer::pages. It's just a simple wrapper around "start >> PAGE_SHIFT". For sectorsize == PAGE_SIZE case, nothing is changed. For sectorsize < PAGE_SIZE case, we always get index as 0, and the existing page shift also works. - get_eb_offset_in_page() This is to calculate the offset to access extent_buffer::pages. This needs to take extent_buffer::start into consideration. For sectorsize == PAGE_SIZE case, extent_buffer::start is always aligned to PAGE_SIZE, thus adding extent_buffer::start to offset_in_page() won't change the result. For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives us the correct offset to access. This patch will touch the following parts to cover all extent buffer accessors: - BTRFS_SETGET_HEADER_FUNCS() - read_extent_buffer() - read_extent_buffer_to_user() - memcmp_extent_buffer() - write_extent_buffer_chunk_tree_uuid() - write_extent_buffer_fsid() - write_extent_buffer() - memzero_extent_buffer() - copy_extent_buffer_full() - copy_extent_buffer() - memcpy_extent_buffer() - memmove_extent_buffer() - btrfs_get_token_##bits() - btrfs_get_##bits() - btrfs_set_token_##bits() - btrfs_set_##bits() - generic_bin_search() Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
deb67895 |
|
01-Dec-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: calculate inline extent buffer page size based on page size Btrfs only support 64K as maximum node size, thus for 4K page system, we would have at most 16 pages for one extent buffer. For a system using 64K page size, we would really have just one page. While we always use 16 pages for extent_buffer::pages, this means for systems using 64K pages, we are wasting memory for 15 page pointers which will never be used. Calculate the array size based on page size and the node size maximum. - for systems using 4K page size, it will stay 16 pages - for systems using 64K page size, it will be 1 page Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to avoid circular inclusion of ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7ffd27e3 |
|
01-Dec-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: pass bio_offset to check_data_csum() directly Parameter icsum for check_data_csum() is a little hard to understand. So is the phy_offset for btrfs_verify_data_csum(). Both parameters are calculated values for csum lookup. Instead of some calculated value, just pass bio_offset and let the final and only user, check_data_csum(), calculate whatever it needs. Since we are here, also make the bio_offset parameter and some related variables to be u32 (unsigned int). As bio size is limited by its bi_size, which is unsigned int, and has extra size limit check during various bio operations. Thus we are ensured that bio_offset won't overflow u32. Thus for all involved functions, not only rename the parameter from @phy_offset to @bio_offset, but also reduce its width to u32, so we won't have suspicious "u32 = u64 >> sector_bits;" lines anymore. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
94846229 |
|
18-Nov-2020 |
Boris Burkov <boris@bur.io> |
btrfs: keep sb cache_generation consistent with space_cache When mounting, btrfs uses the cache_generation in the super block to determine if space cache v1 is in use. However, by mounting with nospace_cache or space_cache=v2, it is possible to disable space cache v1, which does not result in un-setting cache_generation back to 0. In order to base some logic, like mount option printing in /proc/mounts, on the current state of the space cache rather than just the values of the mount option, keep the value of cache_generation consistent with the status of space cache v1. We ensure that cache_generation > 0 iff the file system is using space_cache v1. This requires committing a transaction on any mount which changes whether we are using v1. (v1->nospace_cache, v1->v2, nospace_cache->v1, v2->v1). Since the mechanism for writing out the cache generation is transaction commit, but we want some finer grained control over when we un-set it, we can't just rely on the SPACE_CACHE mount option, and introduce an fs_info flag that mount can use when it wants to unset the generation. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
47876f7c |
|
24-Nov-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not block inode logging for so long during transaction commit Early on during a transaction commit we acquire the tree_log_mutex and hold it until after we write the super blocks. But before writing the extent buffers dirtied by the transaction and the super blocks we unblock the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting fs_info->running_transaction to NULL. This means that after that and before writing the super blocks, new transactions can start. However if any transaction wants to log an inode, it will block waiting for the transaction commit to write its dirty extent buffers and the super blocks because the tree_log_mutex is only released after those operations are complete, and starting a new log transaction blocks on that mutex (at start_log_trans()). Writing the dirty extent buffers and the super blocks can take a very significant amount of time to complete, but we could allow the tasks wanting to log an inode to proceed with most of their steps: 1) create the log trees 2) log metadata in the trees 3) write their dirty extent buffers They only need to wait for the previous transaction commit to complete (write its super blocks) before they attempt to write their super blocks, otherwise we could end up with a corrupt filesystem after a crash. So change start_log_trans() to use the root tree's log_mutex to serialize for the creation of the log root tree instead of using the tree_log_mutex, and make btrfs_sync_log() acquire the tree_log_mutex before writing the super blocks. This allows for inode logging to wait much less time when there is a previous transaction that is still committing, often not having to wait at all, as by the time when we try to sync the log the previous transaction already wrote its super blocks. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit The following script that uses dbench was used to measure the impact of the whole patchset: $ cat test-dbench.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/btrfs MOUNT_OPTIONS="-o ssd" echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f -m single -d single $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a machine with 12 cores, 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default). Before patch set: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11277211 0.250 85.340 Close 8283172 0.002 6.479 Rename 477515 1.935 86.026 Unlink 2277936 0.770 87.071 Deltree 256 15.732 81.379 Mkdir 128 0.003 0.009 Qpathinfo 10221180 0.056 44.404 Qfileinfo 1789967 0.002 4.066 Qfsinfo 1874399 0.003 9.176 Sfileinfo 918589 0.061 10.247 Find 3951758 0.341 54.040 WriteX 5616547 0.047 85.079 ReadX 17676028 0.005 9.704 LockX 36704 0.003 1.800 UnlockX 36704 0.002 0.687 Flush 790541 14.115 676.236 Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms After patch set: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 12687926 0.171 86.526 Close 9320780 0.002 8.063 Rename 537253 1.444 78.576 Unlink 2561827 0.559 87.228 Deltree 374 11.499 73.549 Mkdir 187 0.003 0.005 Qpathinfo 11500300 0.061 36.801 Qfileinfo 2017118 0.002 7.189 Qfsinfo 2108641 0.003 4.825 Sfileinfo 1033574 0.008 8.065 Find 4446553 0.408 47.835 WriteX 6335667 0.045 84.388 ReadX 19887312 0.003 9.215 LockX 41312 0.003 1.394 UnlockX 41312 0.002 1.425 Flush 889233 13.014 623.259 Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms +12.7% throughput, -8.2% max latency Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5297199a |
|
26-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove inode number cache feature It's been deprecated since commit b547a88ea577 ("btrfs: start deprecation of mount option inode_cache") which enumerates the reasons. A filesystem that uses the feature (mount -o inode_cache) tracks the inode numbers in bitmaps, that data stay on the filesystem after this patch. The size is roughly 5MiB for 1M inodes [1], which is considered small enough to be left there. Removal of the change can be implemented in btrfs-progs if needed. [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
862931c7 |
|
10-Nov-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: introduce max_zone_append_size The zone append write command has a maximum IO size restriction it accepts. This is because a zone append write command cannot be split, as we ask the device to place the data into a specific target zone and the device responds with the actual written location of the data. Introduce max_zone_append_size to zone_info and fs_info to track the value, so we can limit all I/O to a zoned block device that we want to write using the zone append command to the device's limits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b70f5097 |
|
10-Nov-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: check and enable ZONED mode Introduce function btrfs_check_zoned_mode() to check if ZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
729f7961 |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_update_inode_fallback take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b06359a3 |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_cont_expand take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
217f42eb |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_truncate_block take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a56fcd1 |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_update_inode take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50743398 |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_truncate_inode_items take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
76aea537 |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f5239dc |
|
06-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove btrfs_path::recurse With my async free space cache loading patches ("btrfs: load free space cache asynchronously") we no longer have a user of path->recurse and can remove it. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2766ff61 |
|
04-Nov-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update the number of bytes used by an inode atomically There are several occasions where we do not update the inode's number of used bytes atomically, resulting in a concurrent stat(2) syscall to report a value of used blocks that does not correspond to a valid value, that is, a value that does not match neither what we had before the operation nor what we get after the operation completes. In extreme cases it can result in stat(2) reporting zero used blocks, which can cause problems for some userspace tools where they can consider a file with a non-zero size and zero used blocks as completely sparse and skip reading data, as reported/discussed a long time ago in some threads like the following: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html The cases where this can happen are the following: -> Case 1 If we do a write (buffered or direct IO) against a file region for which there is already an allocated extent (or multiple extents), then we have a short time window where we can report a number of used blocks to stat(2) that does not take into account the file region being overwritten. This short time window happens when completing the ordered extent(s). This happens because when we drop the extents in the write range we decrement the inode's number of bytes and later on when we insert the new extent(s) we increment the number of bytes in the inode, resulting in a short time window where a stat(2) syscall can get an incorrect number of used blocks. If we do writes that overwrite an entire file, then we have a short time window where we report 0 used blocks to stat(2). Example reproducer: $ cat reproducer-1.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null expected=$(stat -c %b $MNT/foobar) # Create a process to keep calling stat(2) on the file and see if the # reported number of blocks used (disk space used) changes, it should # not because we are not increasing the file size nor punching holes. stat_loop $MNT/foobar $expected & loop_pid=$! for ((i = 0; i < 50000; i++)); do xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null done kill $loop_pid &> /dev/null wait umount $DEV $ ./reproducer-1.sh ERROR: unexpected used blocks (got: 0 expected: 128) ERROR: unexpected used blocks (got: 0 expected: 128) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 2 If we do a buffered write against a file region that does not have any allocated extents, like a hole or beyond EOF, then during ordered extent completion we have a short time window where a concurrent stat(2) syscall can report a number of used blocks that does not correspond to the value before or after the write operation, a value that is actually larger than the value after the write completes. This happens because once we start a buffered write into an unallocated file range we increment the inode's 'new_delalloc_bytes', to make sure any stat(2) call gets a correct used blocks value before delalloc is flushed and completes. However at ordered extent completion, after we inserted the new extent, we increment the inode's number of bytes used with the size of the new extent, and only later, when clearing the range in the inode's iotree, we decrement the inode's 'new_delalloc_bytes' counter with the size of the extent. So this results in a short time window where a concurrent stat(2) syscall can report a number of used blocks that accounts for the new extent twice. Example reproducer: $ cat reproducer-2.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foobar write_size=$((64 * 1024)) for ((i = 0; i < 16384; i++)); do offset=$(($i * $write_size)) xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null blocks_used=$(stat -c %b $MNT/foobar) # Fsync the file to trigger writeback and keep calling stat(2) on it # to see if the number of blocks used changes. stat_loop $MNT/foobar $blocks_used & loop_pid=$! xfs_io -c "fsync" $MNT/foobar kill $loop_pid &> /dev/null wait $loop_pid done umount $DEV $ ./reproducer-2.sh ERROR: unexpected used blocks (got: 265472 expected: 265344) ERROR: unexpected used blocks (got: 284032 expected: 283904) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 3 Another case where such problems happen is during other operations that replace extents in a file range with other extents. Those operations are extent cloning, deduplication and fallocate's zero range operation. The cause of the problem is similar to the first case. When we drop the extents from a range, we decrement the inode's number of bytes, and later on, after inserting the new extents we increment it. Since this is not done atomically, a concurrent stat(2) call can see and return a number of used blocks that is smaller than it should be, does not match the number of used blocks before or after the clone/deduplication/zero operation. Like for the first case, when doing a clone, deduplication or zero range operation against an entire file, we end up having a time window where we can report 0 used blocks to a stat(2) call. Example reproducer: $ cat reproducer-3.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f -m reflink=1 $DEV > /dev/null mount $DEV $MNT extent_size=$((64 * 1024)) num_extents=16384 file_size=$(($extent_size * $num_extents)) # File foo has many small extents. xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \ > /dev/null # File bar has much less extents and has exactly the same data as foo. xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null expected=$(stat -c %b $MNT/foo) # Now deduplicate bar into foo. While the deduplication is in progres, # the number of used blocks/file size reported by stat should not change xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null & dedupe_pid=$! while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do used=$(stat -c %b $MNT/foo) if [ $used -ne $expected ]; then echo "Unexpected blocks used: $used (expected: $expected)" fi done umount $DEV $ ./reproducer-3.sh Unexpected blocks used: 2076800 (expected: 2097152) Unexpected blocks used: 2097024 (expected: 2097152) Unexpected blocks used: 2079872 (expected: 2097152) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. So fix this by: 1) Making btrfs_drop_extents() not decrement the VFS inode's number of bytes, and instead return the number of bytes; 2) Making any code that drops extents and adds new extents update the inode's number of bytes atomically, while holding the btrfs inode's spinlock, which is also used by the stat(2) callback to get the inode's number of bytes; 3) For ranges in the inode's iotree that are marked as 'delalloc new', corresponding to previously unallocated ranges, increment the inode's number of bytes when clearing the 'delalloc new' bit from the range, in the same critical section that decrements the inode's 'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock. An alternative would be to have btrfs_getattr() wait for any IO (ordered extents in progress) and locking the whole range (0 to (u64)-1) while it it computes the number of blocks used. But that would mean blocking stat(2), which is a very used syscall and expected to be fast, waiting for writes, clone/dedupe, fallocate, page reads, fiemap, etc. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5893dfb9 |
|
04-Nov-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: refactor btrfs_drop_extents() to make it easier to extend There are many arguments for __btrfs_drop_extents() and its wrapper btrfs_drop_extents(), which makes it hard to add more arguments to it and requires changing every caller. I have added a couple myself back in 2014 commit 1acae57b161e ("Btrfs: faster file extent item replace operations") and therefore know firsthand that it is a bit cumbersome to add additional arguments to these functions. Since I will need to add more arguments in a subsequent bug fix, this change is preparatory work and adds a data structure that holds all the arguments, for both input and output, that are passed to this function, with some comments in the structure's definition mentioning what each field is and how it relates to other fields. Callers of this function need only to zero out the content of the structure and setup only the fields they need. This also removes the need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so now we have a single function named btrfs_drop_extents() that takes a pointer to this new data structure (struct btrfs_drop_extents_args). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
df903e5d |
|
04-Nov-2020 |
Pavel Begunkov <asml.silence@gmail.com> |
btrfs: don't miss async discards after scheduled work override If btrfs_discard_schedule_work() is called with override=true, it sets delay anew regardless how much time is left until the timer should have fired. If delays are long (that can happen, for example, with low kbps_limit), they might get constantly overridden without having a chance to run the discard work. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6e88f116 |
|
04-Nov-2020 |
Pavel Begunkov <asml.silence@gmail.com> |
btrfs: discard: store async discard delay as ns not as jiffies Most delay calculations are done in ns or ms, so store discard_ctl->delay in ms and convert the final delay to jiffies only at the end. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
478ef886 |
|
21-Oct-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: make buffer_radix take sector size units For subpage sector size support, one page can contain multiple tree blocks. The entries cannot be based on page size and index must be derived from the sectorsize. No change for page size == sector size. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
27d56e62 |
|
23-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: update last_byte_to_unpin in switch_commit_roots While writing an explanation for the need of the commit_root_sem for btrfs_prepare_extent_commit, I realized we have a slight hole that could result in leaked space if we have to do the old style caching. Consider the following scenario commit root +----+----+----+----+----+----+----+ |\\\\| |\\\\|\\\\| |\\\\|\\\\| +----+----+----+----+----+----+----+ 0 1 2 3 4 5 6 7 new commit root +----+----+----+----+----+----+----+ | | | |\\\\| | |\\\\| +----+----+----+----+----+----+----+ 0 1 2 3 4 5 6 7 Prior to this patch, we run btrfs_prepare_extent_commit, which updates the last_byte_to_unpin, and then we subsequently run switch_commit_roots. In this example lets assume that caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which means that cache->last_byte_to_unpin == 1. Then we go and do the switch_commit_roots(), but in the meantime the caching thread has made some more progress, because we drop the commit_root_sem and re-acquired it. Now caching_ctl->progress == 3. We swap out the commit root and carry on to unpin. The race can happen like: 1) The caching thread was running using the old commit root when it found the extent for [2, 3); 2) Then it released the commit_root_sem because it was in the last item of a leaf and the semaphore was contended, and set ->progress to 3 (value of 'last'), as the last extent item in the current leaf was for the extent for range [2, 3); 3) Next time it gets the commit_root_sem, will start using the new commit root and search for a key with offset 3, so it never finds the hole for [2, 3). So the caching thread never saw [2, 3) as free space in any of the commit roots, and by the time finish_extent_commit() was called for the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the subrange [0, 1) to the free space cache, skipping [2, 3). In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1), but do not unpin [2,3). However because caching_ctl->progress == 3 we do not see the newly freed section of [2,3), and thus do not add it to our free space cache. This results in us missing a chunk of free space in memory (on disk too, unless we have a power failure before writing the free space cache to disk). Fix this by making sure the ->last_byte_to_unpin is set at the same time that we swap the commit roots, this ensures that we will always be consistent. CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ update changelog with Filipe's review comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b9729ce0 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: locking: rip out path->leave_spinning We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fe5ecbe8 |
|
02-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: precalculate checksums per leaf once btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a candidate for optimizations. After the 64bit division has been replaced by shift, there's still a calculation done each time the function is called: checksums per leaf. As this is a constant value for the entire filesystem lifetime, we can calculate it once at mount time and reuse. This also allows to reduce the division to 64bit/32bit as we know the constant will always fit the 32bit type. Replace the open-coded rounding up with a macro that internally handles the 64bit division and as it's now a short function, make it static inline (slight code increase, slight stack usage reduction). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
22b6331d |
|
02-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: store precalculated csum_size in fs_info In many places we need the checksum size and it is inefficient to read it from the raw superblock. Store the value into fs_info, actual use will be in followup patches. The size is u32 as it allows to generate better assembly than with u16. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
265fdfa6 |
|
01-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits The value of super_block::s_blocksize_bits is the same as fs_info::sectorsize_bits, but we don't need to do the extra dereferences in many functions and storing the bits as u32 (in fs_info) generates shorter assembly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab108d99 |
|
01-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: use precalculated sectorsize_bits from fs_info We do a lot of calculations where we divide or multiply by sectorsize. We also know and make sure that sectorsize is a power of two, so this means all divisions can be turned to shifts and avoid eg. expensive u64/u32 divisions. The type is u32 as it's more register friendly on x86_64 compared to u8 and the resulting assembly is smaller (movzbl vs movl). There's also superblock s_blocksize_bits but it's usually one more pointer dereference farther than fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c8422684 |
|
15-Sep-2020 |
David Sterba <dsterba@suse.com> |
btrfs: add set/get accessors for root_item::drop_level The drop_level member is used directly unlike all the other int types in root_item. Add the definition and use it everywhere. The type is u8 so there's no conversion necessary and the helpers are properly inlined, this is for consistency. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ecfdc08b |
|
24-Sep-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: remove dio iomap DSYNC workaround This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC workaround") now that the iomap API has been updated to allow iomap_dio_complete() not to be called under i_rwsem anymore. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a14b78ad |
|
24-Sep-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: introduce btrfs_inode_lock()/unlock() btrfs_inode_lock/unlock() are wrappers around inode locks, separating the type of lock and actual locking. - 0 - default, exclusive lock - BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO - BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence The bits SHARED and TRY can be combined together. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4e4cabec |
|
24-Sep-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: split btrfs_direct_IO to read and write The read and write DIO don't have anything in common except for the call to iomap_dio_rw. Extract the write call into a new function to get rid of conditional statements for direct write. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
882dbe0c |
|
16-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce mount option rescue=ignoredatacsums There are cases where you can end up with bad data csums because of misbehaving applications. This happens when an application modifies a buffer in-flight when doing an O_DIRECT write. In order to recover the file we need a way to turn off data checksums so you can copy the file off, and then you can delete the file and restore it properly later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42437a63 |
|
16-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce mount option rescue=ignorebadroots In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aa8c1a41 |
|
14-Oct-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages() Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as opposed to calling set_extent_bits again later. Fold check for written length within the function. Note: EXTENT_NORESERVE is set before unlocking extents. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a855fbe6 |
|
23-Nov-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix lockdep splat when enabling and disabling qgroups When running test case btrfs/017 from fstests, lockdep reported the following splat: [ 1297.067385] ====================================================== [ 1297.067708] WARNING: possible circular locking dependency detected [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted [ 1297.068322] ------------------------------------------------------ [ 1297.068629] btrfs/189080 is trying to acquire lock: [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.069274] but task is already holding lock: [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.070219] which lock already depends on the new lock. [ 1297.071131] the existing dependency chain (in reverse order) is: [ 1297.071721] -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}: [ 1297.072375] lock_acquire+0xd8/0x490 [ 1297.072710] __mutex_lock+0xa3/0xb30 [ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs] [ 1297.073421] create_subvol+0x194/0x990 [btrfs] [ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] [ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] [ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] [ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs] [ 1297.075245] __x64_sys_ioctl+0x83/0xb0 [ 1297.075617] do_syscall_64+0x33/0x80 [ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.076380] -> #0 (sb_internal#2){.+.+}-{0:0}: [ 1297.077166] check_prev_add+0x91/0xc60 [ 1297.077572] __lock_acquire+0x1740/0x3110 [ 1297.077984] lock_acquire+0xd8/0x490 [ 1297.078411] start_transaction+0x3c5/0x760 [btrfs] [ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.079789] __x64_sys_ioctl+0x83/0xb0 [ 1297.080232] do_syscall_64+0x33/0x80 [ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.081139] other info that might help us debug this: [ 1297.082536] Possible unsafe locking scenario: [ 1297.083510] CPU0 CPU1 [ 1297.084005] ---- ---- [ 1297.084500] lock(&fs_info->qgroup_ioctl_lock); [ 1297.084994] lock(sb_internal#2); [ 1297.085485] lock(&fs_info->qgroup_ioctl_lock); [ 1297.085974] lock(sb_internal#2); [ 1297.086454] *** DEADLOCK *** [ 1297.087880] 3 locks held by btrfs/189080: [ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs] [ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.089771] stack backtrace: [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1 [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1297.092123] Call Trace: [ 1297.092629] dump_stack+0x8d/0xb5 [ 1297.093115] check_noncircular+0xff/0x110 [ 1297.093596] check_prev_add+0x91/0xc60 [ 1297.094076] ? kvm_clock_read+0x14/0x30 [ 1297.094553] ? kvm_sched_clock_read+0x5/0x10 [ 1297.095029] __lock_acquire+0x1740/0x3110 [ 1297.095510] lock_acquire+0xd8/0x490 [ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.096476] start_transaction+0x3c5/0x760 [btrfs] [ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.098904] ? do_user_addr_fault+0x20c/0x430 [ 1297.099382] ? kvm_clock_read+0x14/0x30 [ 1297.099854] ? kvm_sched_clock_read+0x5/0x10 [ 1297.100328] ? sched_clock+0x5/0x10 [ 1297.100801] ? sched_clock_cpu+0x12/0x180 [ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0 [ 1297.101739] __x64_sys_ioctl+0x83/0xb0 [ 1297.102207] do_syscall_64+0x33/0x80 [ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.103148] RIP: 0033:0x7f773ff65d87 This is because during the quota enable ioctl we lock first the mutex qgroup_ioctl_lock and then start a transaction, and starting a transaction acquires a fs freeze semaphore (at the VFS level). However, every other code path, except for the quota disable ioctl path, we do the opposite: we start a transaction and then lock the mutex. So fix this by making the quota enable and disable paths to start the transaction without having the mutex locked, and then, after starting the transaction, lock the mutex and check if some other task already enabled or disabled the quotas, bailing with success if that was the case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e8f147dc |
|
03-Nov-2020 |
Thomas Gleixner <tglx@linutronix.de> |
fs: Remove asm/kmap_types.h includes Historical leftovers from the time where kmap() had fixed slots. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Sterba <dsterba@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20201103095856.870272797@linutronix.de
|
#
66d204a1 |
|
12-Oct-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix readahead hang and use-after-free after removing a device Very sporadically I had test case btrfs/069 from fstests hanging (for years, it is not a recent regression), with the following traces in dmesg/syslog: [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds. [162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000 [162513.514751] Call Trace: [162513.514761] __schedule+0x5ce/0xd00 [162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.514771] schedule+0x46/0xf0 [162513.514844] wait_current_trans+0xde/0x140 [btrfs] [162513.514850] ? finish_wait+0x90/0x90 [162513.514864] start_transaction+0x37c/0x5f0 [btrfs] [162513.514879] transaction_kthread+0xa4/0x170 [btrfs] [162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs] [162513.514894] kthread+0x153/0x170 [162513.514897] ? kthread_stop+0x2c0/0x2c0 [162513.514902] ret_from_fork+0x22/0x30 [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds. [162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000 [162513.515682] Call Trace: [162513.515688] __schedule+0x5ce/0xd00 [162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.515697] schedule+0x46/0xf0 [162513.515712] wait_current_trans+0xde/0x140 [btrfs] [162513.515716] ? finish_wait+0x90/0x90 [162513.515729] start_transaction+0x37c/0x5f0 [btrfs] [162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs] [162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs] [162513.515758] ? __ia32_sys_fdatasync+0x20/0x20 [162513.515761] iterate_supers+0x87/0xf0 [162513.515765] ksys_sync+0x60/0xb0 [162513.515768] __do_sys_sync+0xa/0x10 [162513.515771] do_syscall_64+0x33/0x80 [162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.515781] RIP: 0033:0x7f5238f50bd7 [162513.515782] Code: Bad RIP value. [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2 [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7 [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0 [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340 [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds. [162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000 [162513.516620] Call Trace: [162513.516625] __schedule+0x5ce/0xd00 [162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.516634] schedule+0x46/0xf0 [162513.516647] wait_current_trans+0xde/0x140 [btrfs] [162513.516650] ? finish_wait+0x90/0x90 [162513.516662] start_transaction+0x4d7/0x5f0 [btrfs] [162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs] [162513.516686] __vfs_setxattr+0x66/0x80 [162513.516691] __vfs_setxattr_noperm+0x70/0x200 [162513.516697] vfs_setxattr+0x6b/0x120 [162513.516703] setxattr+0x125/0x240 [162513.516709] ? lock_acquire+0xb1/0x480 [162513.516712] ? mnt_want_write+0x20/0x50 [162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0 [162513.516723] ? preempt_count_add+0x49/0xa0 [162513.516725] ? __sb_start_write+0x19b/0x290 [162513.516727] ? preempt_count_add+0x49/0xa0 [162513.516732] path_setxattr+0xba/0xd0 [162513.516739] __x64_sys_setxattr+0x27/0x30 [162513.516741] do_syscall_64+0x33/0x80 [162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.516745] RIP: 0033:0x7f5238f56d5a [162513.516746] Code: Bad RIP value. [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470 [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700 [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004 [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0 [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds. [162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000 [162513.517780] Call Trace: [162513.517786] __schedule+0x5ce/0xd00 [162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.517796] schedule+0x46/0xf0 [162513.517810] wait_current_trans+0xde/0x140 [btrfs] [162513.517814] ? finish_wait+0x90/0x90 [162513.517829] start_transaction+0x37c/0x5f0 [btrfs] [162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs] [162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs] [162513.517862] ? __ia32_sys_fdatasync+0x20/0x20 [162513.517865] iterate_supers+0x87/0xf0 [162513.517869] ksys_sync+0x60/0xb0 [162513.517872] __do_sys_sync+0xa/0x10 [162513.517875] do_syscall_64+0x33/0x80 [162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.517881] RIP: 0033:0x7f5238f50bd7 [162513.517883] Code: Bad RIP value. [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2 [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7 [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053 [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0 [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053 [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340 [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds. [162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000 [162513.519160] Call Trace: [162513.519165] __schedule+0x5ce/0xd00 [162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.519174] schedule+0x46/0xf0 [162513.519190] wait_current_trans+0xde/0x140 [btrfs] [162513.519193] ? finish_wait+0x90/0x90 [162513.519206] start_transaction+0x4d7/0x5f0 [btrfs] [162513.519222] btrfs_create+0x57/0x200 [btrfs] [162513.519230] lookup_open+0x522/0x650 [162513.519246] path_openat+0x2b8/0xa50 [162513.519270] do_filp_open+0x91/0x100 [162513.519275] ? find_held_lock+0x32/0x90 [162513.519280] ? lock_acquired+0x33b/0x470 [162513.519285] ? do_raw_spin_unlock+0x4b/0xc0 [162513.519287] ? _raw_spin_unlock+0x29/0x40 [162513.519295] do_sys_openat2+0x20d/0x2d0 [162513.519300] do_sys_open+0x44/0x80 [162513.519304] do_syscall_64+0x33/0x80 [162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.519309] RIP: 0033:0x7f5238f4a903 [162513.519310] Code: Bad RIP value. [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903 [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470 [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002 [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013 [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620 [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds. [162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002 [162513.520511] Call Trace: [162513.520516] __schedule+0x5ce/0xd00 [162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.520525] schedule+0x46/0xf0 [162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs] [162513.520548] ? finish_wait+0x90/0x90 [162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs] [162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs] [162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs] [162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs] [162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs] [162513.520643] ? do_sigaction+0xf3/0x240 [162513.520645] ? find_held_lock+0x32/0x90 [162513.520648] ? do_sigaction+0xf3/0x240 [162513.520651] ? lock_acquired+0x33b/0x470 [162513.520655] ? _raw_spin_unlock_irq+0x24/0x50 [162513.520657] ? lockdep_hardirqs_on+0x7d/0x100 [162513.520660] ? _raw_spin_unlock_irq+0x35/0x50 [162513.520662] ? do_sigaction+0xf3/0x240 [162513.520671] ? __x64_sys_ioctl+0x83/0xb0 [162513.520672] __x64_sys_ioctl+0x83/0xb0 [162513.520677] do_syscall_64+0x33/0x80 [162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.520681] RIP: 0033:0x7fc3cd307d87 [162513.520682] Code: Bad RIP value. [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87 [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003 [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003 [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001 [162513.520703] Showing all locks held in the system: [162513.520712] 1 lock held by khungtaskd/54: [162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197 [162513.520728] 1 lock held by in:imklog/596: [162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60 [162513.520782] 1 lock held by btrfs-transacti/1356167: [162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs] [162513.520798] 1 lock held by btrfs/1356190: [162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60 [162513.520805] 1 lock held by fsstress/1356184: [162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0 [162513.520811] 3 locks held by fsstress/1356185: [162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50 [162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120 [162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] [162513.520833] 1 lock held by fsstress/1356196: [162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0 [162513.520838] 3 locks held by fsstress/1356197: [162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50 [162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50 [162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] [162513.520858] 2 locks held by btrfs/1356211: [162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs] [162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] This was weird because the stack traces show that a transaction commit, triggered by a device replace operation, is blocking trying to pause any running scrubs but there are no stack traces of blocked tasks doing a scrub. After poking around with drgn, I noticed there was a scrub task that was constantly running and blocking for shorts periods of time: >>> t = find_task(prog, 1356190) >>> prog.stack_trace(t) #0 __schedule+0x5ce/0xcfc #1 schedule+0x46/0xe4 #2 schedule_timeout+0x1df/0x475 #3 btrfs_reada_wait+0xda/0x132 #4 scrub_stripe+0x2a8/0x112f #5 scrub_chunk+0xcd/0x134 #6 scrub_enumerate_chunks+0x29e/0x5ee #7 btrfs_scrub_dev+0x2d5/0x91b #8 btrfs_ioctl+0x7f5/0x36e7 #9 __x64_sys_ioctl+0x83/0xb0 #10 do_syscall_64+0x33/0x77 #11 entry_SYSCALL_64+0x7c/0x156 Which corresponds to: int btrfs_reada_wait(void *handle) { struct reada_control *rc = handle; struct btrfs_fs_info *fs_info = rc->fs_info; while (atomic_read(&rc->elems)) { if (!atomic_read(&fs_info->reada_works_cnt)) reada_start_machine(fs_info); wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0, (HZ + 9) / 10); } (...) So the counter "rc->elems" was set to 1 and never decreased to 0, causing the scrub task to loop forever in that function. Then I used the following script for drgn to check the readahead requests: $ cat dump_reada.py import sys import drgn from drgn import NULL, Object, cast, container_of, execscript, \ reinterpret, sizeof from drgn.helpers.linux import * mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1" mnt = None for mnt in for_each_mount(prog, dst = mnt_path): pass if mnt is None: sys.stderr.write(f'Error: mount point {mnt_path} not found\n') sys.exit(1) fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info) def dump_re(re): nzones = re.nzones.value_() print(f're at {hex(re.value_())}') print(f'\t logical {re.logical.value_()}') print(f'\t refcnt {re.refcnt.value_()}') print(f'\t nzones {nzones}') for i in range(nzones): dev = re.zones[i].device name = dev.name.str.string_() print(f'\t\t dev id {dev.devid.value_()} name {name}') print() for _, e in radix_tree_for_each(fs_info.reada_tree): re = cast('struct reada_extent *', e) dump_re(re) $ drgn dump_reada.py re at 0xffff8f3da9d25ad8 logical 38928384 refcnt 1 nzones 1 dev id 0 name b'/dev/sdd' $ So there was one readahead extent with a single zone corresponding to the source device of that last device replace operation logged in dmesg/syslog. Also the ID of that zone's device was 0 which is a special value set in the source device of a device replace operation when the operation finishes (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()), confirming again that device /dev/sdd was the source of a device replace operation. Normally there should be as many zones in the readahead extent as there are devices, and I wasn't expecting the extent to be in a block group with a 'single' profile, so I went and confirmed with the following drgn script that there weren't any single profile block groups: $ cat dump_block_groups.py import sys import drgn from drgn import NULL, Object, cast, container_of, execscript, \ reinterpret, sizeof from drgn.helpers.linux import * mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1" mnt = None for mnt in for_each_mount(prog, dst = mnt_path): pass if mnt is None: sys.stderr.write(f'Error: mount point {mnt_path} not found\n') sys.exit(1) fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info) BTRFS_BLOCK_GROUP_DATA = (1 << 0) BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1) BTRFS_BLOCK_GROUP_METADATA = (1 << 2) BTRFS_BLOCK_GROUP_RAID0 = (1 << 3) BTRFS_BLOCK_GROUP_RAID1 = (1 << 4) BTRFS_BLOCK_GROUP_DUP = (1 << 5) BTRFS_BLOCK_GROUP_RAID10 = (1 << 6) BTRFS_BLOCK_GROUP_RAID5 = (1 << 7) BTRFS_BLOCK_GROUP_RAID6 = (1 << 8) BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9) BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10) def bg_flags_string(bg): flags = bg.flags.value_() ret = '' if flags & BTRFS_BLOCK_GROUP_DATA: ret = 'data' if flags & BTRFS_BLOCK_GROUP_METADATA: if len(ret) > 0: ret += '|' ret += 'meta' if flags & BTRFS_BLOCK_GROUP_SYSTEM: if len(ret) > 0: ret += '|' ret += 'system' if flags & BTRFS_BLOCK_GROUP_RAID0: ret += ' raid0' elif flags & BTRFS_BLOCK_GROUP_RAID1: ret += ' raid1' elif flags & BTRFS_BLOCK_GROUP_DUP: ret += ' dup' elif flags & BTRFS_BLOCK_GROUP_RAID10: ret += ' raid10' elif flags & BTRFS_BLOCK_GROUP_RAID5: ret += ' raid5' elif flags & BTRFS_BLOCK_GROUP_RAID6: ret += ' raid6' elif flags & BTRFS_BLOCK_GROUP_RAID1C3: ret += ' raid1c3' elif flags & BTRFS_BLOCK_GROUP_RAID1C4: ret += ' raid1c4' else: ret += ' single' return ret def dump_bg(bg): print() print(f'block group at {hex(bg.value_())}') print(f'\t start {bg.start.value_()} length {bg.length.value_()}') print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}') bg_root = fs_info.block_group_cache_tree.address_of_() for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'): dump_bg(bg) $ drgn dump_block_groups.py block group at 0xffff8f3d673b0400 start 22020096 length 16777216 flags 258 - system raid6 block group at 0xffff8f3d53ddb400 start 38797312 length 536870912 flags 260 - meta raid6 block group at 0xffff8f3d5f4d9c00 start 575668224 length 2147483648 flags 257 - data raid6 block group at 0xffff8f3d08189000 start 2723151872 length 67108864 flags 258 - system raid6 block group at 0xffff8f3db70ff000 start 2790260736 length 1073741824 flags 260 - meta raid6 block group at 0xffff8f3d5f4dd800 start 3864002560 length 67108864 flags 258 - system raid6 block group at 0xffff8f3d67037000 start 3931111424 length 2147483648 flags 257 - data raid6 $ So there were only 2 reasons left for having a readahead extent with a single zone: reada_find_zone(), called when creating a readahead extent, returned NULL either because we failed to find the corresponding block group or because a memory allocation failed. With some additional and custom tracing I figured out that on every further ocurrence of the problem the block group had just been deleted when we were looping to create the zones for the readahead extent (at reada_find_extent()), so we ended up with only one zone in the readahead extent, corresponding to a device that ends up getting replaced. So after figuring that out it became obvious why the hang happens: 1) Task A starts a scrub on any device of the filesystem, except for device /dev/sdd; 2) Task B starts a device replace with /dev/sdd as the source device; 3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently starting to scrub a stripe from block group X. This call to btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add() calls reada_add_block(), it passes the logical address of the extent tree's root node as its 'logical' argument - a value of 38928384; 4) Task A then enters reada_find_extent(), called from reada_add_block(). It finds there isn't any existing readahead extent for the logical address 38928384, so it proceeds to the path of creating a new one. It calls btrfs_map_block() to find out which stripes exist for the block group X. On the first iteration of the for loop that iterates over the stripes, it finds the stripe for device /dev/sdd, so it creates one zone for that device and adds it to the readahead extent. Before getting into the second iteration of the loop, the cleanup kthread deletes block group X because it was empty. So in the iterations for the remaining stripes it does not add more zones to the readahead extent, because the calls to reada_find_zone() returned NULL because they couldn't find block group X anymore. As a result the new readahead extent has a single zone, corresponding to the device /dev/sdd; 4) Before task A returns to btrfs_reada_add() and queues the readahead job for the readahead work queue, task B finishes the device replace and at btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new device /dev/sdg; 5) Task A returns to reada_add_block(), which increments the counter "->elems" of the reada_control structure allocated at btrfs_reada_add(). Then it returns back to btrfs_reada_add() and calls reada_start_machine(). This queues a job in the readahead work queue to run the function reada_start_machine_worker(), which calls __reada_start_machine(). At __reada_start_machine() we take the device list mutex and for each device found in the current device list, we call reada_start_machine_dev() to start the readahead work. However at this point the device /dev/sdd was already freed and is not in the device list anymore. This means the corresponding readahead for the extent at 38928384 is never started, and therefore the "->elems" counter of the reada_control structure allocated at btrfs_reada_add() never goes down to 0, causing the call to btrfs_reada_wait(), done by the scrub task, to wait forever. Note that the readahead request can be made either after the device replace started or before it started, however in pratice it is very unlikely that a device replace is able to start after a readahead request is made and is able to complete before the readahead request completes - maybe only on a very small and nearly empty filesystem. This hang however is not the only problem we can have with readahead and device removals. When the readahead extent has other zones other than the one corresponding to the device that is being removed (either by a device replace or a device remove operation), we risk having a use-after-free on the device when dropping the last reference of the readahead extent. For example if we create a readahead extent with two zones, one for the device /dev/sdd and one for the device /dev/sde: 1) Before the readahead worker starts, the device /dev/sdd is removed, and the corresponding btrfs_device structure is freed. However the readahead extent still has the zone pointing to the device structure; 2) When the readahead worker starts, it only finds device /dev/sde in the current device list of the filesystem; 3) It starts the readahead work, at reada_start_machine_dev(), using the device /dev/sde; 4) Then when it finishes reading the extent from device /dev/sde, it calls __readahead_hook() which ends up dropping the last reference on the readahead extent through the last call to reada_extent_put(); 5) At reada_extent_put() it iterates over each zone of the readahead extent and attempts to delete an element from the device's 'reada_extents' radix tree, resulting in a use-after-free, as the device pointer of the zone for /dev/sdd is now stale. We can also access the device after dropping the last reference of a zone, through reada_zone_release(), also called by reada_extent_put(). And a device remove suffers the same problem, however since it shrinks the device size down to zero before removing the device, it is very unlikely to still have readahead requests not completed by the time we free the device, the only possibility is if the device has a very little space allocated. While the hang problem is exclusive to scrub, since it is currently the only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free problem affects any path that triggers readhead, which includes btree_readahead_hook() and __readahead_hook() (a readahead worker can trigger readahed for the children of a node) for example - any path that ends up calling reada_add_block() can trigger the use-after-free after a device is removed. So fix this by waiting for any readahead requests for a device to complete before removing a device, ensuring that while waiting for existing ones no new ones can be made. This problem has been around for a very long time - the readahead code was added in 2011, device remove exists since 2008 and device replace was introduced in 2013, hard to pick a specific commit for a git Fixes tag. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
905eb88b |
|
18-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove struct extent_io_ops It's no longer used just remove the function and any related code which was initialising it for inodes. No functional changes. Removing 8 bytes from extent_io_tree in turn reduces size of other structures where it is embedded, notably btrfs_inode where it reduces size by 24 bytes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
908930f3 |
|
18-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: stop calling submit_bio_hook for data inodes Instead export and rename the function to btrfs_submit_data_bio and call it directly in submit_one_bio. This avoids paying the cost for speculative attacks mitigations and improves code readability. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a446d6a |
|
18-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: replace readpage_end_io_hook with direct calls Don't call readpage_end_io_hook for the btree inode. Instead of relying on indirect calls to implement metadata buffer validation simply check if the inode whose page we are processing equals the btree inode. If it does call the necessary function. This is an improvement in 2 directions: 1. We aren't paying the penalty of indirect calls in a post-speculation attacks world. 2. The function is now named more explicitly so it's obvious what's going on This is in preparation to removing struct extent_io_ops altogether. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e97659ce |
|
15-Sep-2020 |
David Sterba <dsterba@suse.com> |
btrfs: use unaligned helpers for stack and header set/get helpers In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct pointer assignment but we should use the helpers for unaligned access for clarity. It hasn't been a problem so far because of the natural alignment. Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure from stack that has an aligned start but some members may not be aligned due to packing. This as well hasn't caused problems so far. Move the put/get_unaligned_le8 stubs to ctree.h so we can use them. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc0d82e1 |
|
01-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: sink total_data parameter in setup_items_for_insert That parameter can easily be derived based on the "data_size" and "nr" parameters exploit this fact to simply the function's signature. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3dc9dc89 |
|
01-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: eliminate total_size parameter from setup_items_for_insert The value of this argument can be derived from the total_data as it's simply the value of the data size + size of btrfs_items being touched. Move the parameter calculation inside the function. This results in a simpler interface and also a minor size reduction: ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34) Function old new delta btrfs_duplicate_item 260 259 -1 setup_items_for_insert 1200 1190 -10 btrfs_insert_empty_items 177 154 -23 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
306bfec0 |
|
08-Sep-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename btrfs_punch_hole_range() to a more generic name The function btrfs_punch_hole_range() is now used to replace all the file extents in a given file range with an extent described in the given struct btrfs_replace_extent_info argument. This extent can either be an existing extent that is being cloned or it can be a new extent (namely a prealloc extent). When that argument is NULL it only punches a hole (drops all the existing extents) in the file range. So rename the function to btrfs_replace_file_extents(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf385648 |
|
08-Sep-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename struct btrfs_clone_extent_info to a more generic name Now that we can use btrfs_clone_extent_info to convey information for a new prealloc extent as well, and not just for existing extents that are being cloned, rename it to btrfs_replace_extent_info, which reflects the fact that this is now more generic and it is used to replace all existing extents in a file range with the extent described by the structure. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb870f6c |
|
08-Sep-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove item_size member of struct btrfs_clone_extent_info The value of item_size of struct btrfs_clone_extent_info is always set to the size of a non-inline file extent item, and in fact the infrastructure that uses this structure (btrfs_punch_hole_range()) does not work with inline file extents at all (and it is not supposed to). So just remove that field from the structure and use directly sizeof(struct btrfs_file_extent_item) instead. Also assert that the file extent type is not inline at btrfs_insert_clone_extent(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8fccebfa |
|
08-Sep-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix metadata reservation for fallocate that leads to transaction aborts When doing an fallocate(), specially a zero range operation, we assume that reserving 3 units of metadata space is enough, that at most we touch one leaf in subvolume/fs tree for removing existing file extent items and inserting a new file extent item. This assumption is generally true for most common use cases. However when we end up needing to remove file extent items from multiple leaves, we can end up failing with -ENOSPC and abort the current transaction, turning the filesystem to RO mode. When this happens a stack trace like the following is dumped in dmesg/syslog: [ 1500.620934] ------------[ cut here ]------------ [ 1500.620938] BTRFS: Transaction aborted (error -28) [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...) [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1 [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.621026] Code: 8b 40 50 f0 48 (...) [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286 [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000 [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001 [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000 [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60 [ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000 [ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0 [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1500.621049] Call Trace: [ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs] [ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs] [ 1500.621108] vfs_fallocate+0x14d/0x290 [ 1500.621112] ksys_fallocate+0x3a/0x70 [ 1500.621117] __x64_sys_fallocate+0x1a/0x20 [ 1500.621120] do_syscall_64+0x33/0x80 [ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1500.621126] RIP: 0033:0x7fb5b248c477 [ 1500.621128] Code: 89 7c 24 08 (...) [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477 [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003 [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000 [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010 [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003 [ 1500.621151] irq event stamp: 1026217 [ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0 [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0 [ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606 [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20 [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]--- [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left When we use fallocate() internally, for reserving an extent for a space cache, inode cache or relocation, we can't hit this problem since either there aren't any file extent items to remove from the subvolume tree or there is at most one. When using plain fallocate() it's very unlikely, since that would require having many file extent items representing holes for the target range and crossing multiple leafs - we attempt to increase the range (merge) of such file extent items when punching holes, so at most we end up with 2 file extent items for holes at leaf boundaries. However when using the zero range operation of fallocate() for a large range (100+ MiB for example) that's fairly easy to trigger. The following example reproducer triggers the issue: $ cat reproducer.sh #!/bin/bash umount /dev/sdj &> /dev/null mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null mount /dev/sdj /mnt/sdj # Create a 100M file with many file extent items. Punch a hole every 8K # just to speedup the file creation - we could do 4K sequential writes # followed by fsync (or O_SYNC) as well, but that takes a lot of time. file_size=$((100 * 1024 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar for ((i = 0; i < $file_size; i += 8192)); do xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar done # Force a transaction commit, so the zero range operation will be forced # to COW all metadata extents it need to touch. sync xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar umount /mnt/sdj $ ./reproducer.sh wrote 104857600/104857600 bytes at offset 0 100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec) fallocate: No space left on device $ dmesg <shows the same stack trace pasted before> To fix this use the existing infrastructure that hole punching and extent cloning use for replacing a file range with another extent. This deals with doing the removal of file extent items and inserting the new one using an incremental approach, reserving more space when needed and always ensuring we don't leave an implicit hole in the range in case we need to do multiple iterations and a crash happens between iterations. A test case for fstests will follow up soon. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c3e1f96c |
|
25-Aug-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: enumerate the type of exclusive operation in progress Instead of using a flag bit for exclusive operation, use a variable to store which exclusive operation is being performed. Introduce an API to start and finish an exclusive operation. This would enable another way for tools to check which operation is running on why starting an exclusive operation failed. The followup patch adds a sysfs_notify() to alert userspace when the state changes, so userspace can perform select() on it to get notified of the change. This would enable us to enqueue a command which will wait for current exclusive operation to complete before issuing the next exclusive operation. This has been done synchronously as opposed to a background process, or else error collection (if any) will become difficult. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6fee248d |
|
31-Aug-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: convert btrfs_inode_sectorsize to take btrfs_inode It's counterintuitive to have a function named btrfs_inode_xxx which takes a generic inode. Also move the function to btrfs_inode.h so that it has access to the definition of struct btrfs_inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9631e4cc |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks When we COW a block we are holding a lock on the original block, and then we lock the new COW block. Because our lockdep maps are based on root + level, this will make lockdep complain. We need a way to indicate a subclass for locking the COW'ed block, so plumb through our btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer, and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks. The reason I've added all this extra infrastructure is because there will be need of different nesting classes in follow up patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
51899412 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce btrfs_path::recurse Our current tree locking stuff allows us to recurse with read locks if we're already holding the write lock. This is necessary for the space cache inode, as we could be holding a lock on the root_tree root when we need to cache a block group, and thus need to be able to read down the root_tree to read in the inode cache. We can get away with this in our current locking, but we won't be able to with a rwsem. Handle this by purposefully annotating the places where we require recursion, so that in the future we can maybe come up with a way to avoid the recursion. In the case of the free space inode, this will be superseded by the free space tree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e85fde51 |
|
24-Jul-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations [BUG] When quota is enabled for TEST_DEV, generic/013 sometimes fails like this: generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg) And with the following metadata leak: BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace a6cfd45ba80e4e06 ]--- BTRFS error (device dm-3): qgroup reserved space leaked BTRFS info (device dm-3): disk space caching is enabled BTRFS info (device dm-3): has skinny extents [CAUSE] The qgroup preallocated meta rsv operations of that offending root are: btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 It's pretty obvious that, we reserve qgroup meta rsv in btrfs_subvolume_reserve_metadata(), but doesn't have corresponding release/convert calls in btrfs_subvolume_release_metadata(). This leads to the leakage. [FIX] To fix this bug, we should follow what we're doing in btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and add it to block_rsv->qgroup_rsv_reserved. And free the qgroup reserved metadata space when releasing the block_rsv. To do this, we need to change the btrfs_subvolume_release_metadata() to accept btrfs_root, and record the qgroup_to_release number, and call btrfs_qgroup_convert_reserved_meta() for it. Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f85781fb |
|
17-Aug-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: switch to iomap for direct IO We're using direct io implementation based on buffer heads. This patch switches to the new iomap infrastructure. Switch from __blockdev_direct_IO() to iomap_dio_rw(). Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as iomap_begin() for iomap direct I/O functions. This function allocates and locks all the blocks required for the I/O. btrfs_submit_direct() is used as the submit_io() hook for direct I/O ops. Since we need direct I/O reads to go through iomap_dio_rw(), we change file_operations.read_iter() to a btrfs_file_read_iter() which calls btrfs_direct_IO() for direct reads and falls back to generic_file_buffered_read() for incomplete reads and buffered reads. We don't need address_space.direct_IO() anymore: set it to noop. Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is capable of direct I/O reads from a hole, so we don't need to return -ENOENT. Btrfs direct I/O is now done under i_rwsem, shared in case of reads and exclusive in case of writes. This guards against simultaneous truncates. Use iomap->iomap_end() to check for failed or incomplete direct I/O: - for writes, call __endio_write_update_ordered() - for reads, unlock extents btrfs_dio_data is now hooked in iomap->private and not current->journal_info. It carries the reservation variable and the amount of data submitted, so we can calculate the amount of data to call __endio_write_update_ordered in case of an error. This patch removes last use of struct buffer_head from btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
57056740 |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do async reclaim for data reservations Now that we have the data ticketing stuff in place, move normal data reservations to use an async reclaim helper to satisfy tickets. Before we could have multiple tasks race in and both allocate chunks, resulting in more data chunks than we would necessarily need. Serializing these allocations and making a single thread responsible for flushing will only allocate chunks as needed, as well as cut down on transaction commits and other flush related activities. Priority reservations will still work as they have before, simply trying to allocate a chunk until they can make their reservation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
058e6d1d |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add flushing states for handling data reservations Currently the way we do data reservations is by seeing if we have enough space in our space_info. If we do not and we're a normal inode we'll 1) Attempt to force a chunk allocation until we can't anymore. 2) If that fails we'll flush delalloc, then commit the transaction, then run the delayed iputs. If we are a free space inode we're only allowed to force a chunk allocation. In order to use the normal flushing mechanism we need to encode this into a flush state array for normal inodes. Since both will start with allocating chunks until the space info is full there is no need to add this as a flush state, this will be handled specially. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4912139 |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: change nr to u64 in btrfs_start_delalloc_roots We have btrfs_wait_ordered_roots() which takes a u64 for nr, but btrfs_start_delalloc_roots() that takes an int for nr, which makes using them in conjunction, especially for something like (u64)-1, annoying and inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and adjust start_delalloc_inodes() and it's callers appropriately. This means we've adjusted start_delalloc_inodes() to take a pointer of nr since we want to preserve the ability for start-delalloc_inodes() to return an error, so simply make it do the nr adjusting as necessary. Part of adjusting the callers to this means changing btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be confusing because it seems unrelated, but the caller of btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the function variable that needs to be changed. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a84d5d42 |
|
18-Aug-2020 |
Boris Burkov <boris@bur.io> |
btrfs: detect nocow for swap after snapshot delete can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for detecting a must cow condition which is not exactly accurate, but saves unnecessary tree traversal. The incorrect assumption is that if the extent was created in a generation smaller than the last snapshot generation, it must be referenced by that snapshot. That is true, except the snapshot could have since been deleted, without affecting the last snapshot generation. The original patch claimed a performance win from this check, but it also leads to a bug where you are unable to use a swapfile if you ever snapshotted the subvolume it's in. Make the check slower and more strict for the swapon case, without modifying the general cow checks as a compromise. Turning swap on does not seem to be a particularly performance sensitive operation, so incurring a possibly unnecessary btrfs_search_slot seems worthwhile for the added usability. Note: Until the snapshot is competely cleaned after deletion, check_committed_refs will still cause the logic to think that cow is necessary, so the user must until 'btrfs subvolu sync' finished before activating the swapfile swapon. CC: stable@vger.kernel.org # 5.4+ Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
604997b4 |
|
27-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: use the correct const function attribute for btrfs_get_num_csums The build robot reports compiler: h8300-linux-gcc (GCC) 9.3.0 In file included from fs/btrfs/tests/extent-map-tests.c:8: >> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers] 2166 | size_t __const btrfs_get_num_csums(void); | ^~~~~~~ The function attribute for const does not follow the expected scheme and in this case is confused with a const type qualifier. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f95ebdbe |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: don't WARN if we abort a transaction with EROFS If we got some sort of corruption via a read and call btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and complain. If a subsequent trans handle trips over this it'll get EROFS and then abort. However at that point we're not aborting for the original reason, we're aborting because we've been flipped read only. We do not need to WARN_ON() here. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd7fb634 |
|
12-Jul-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: add comments for btrfs_reserve_flush_enum This enum is the interface exposed to developers. Although we have a detailed comment explaining the whole idea of space flushing at the beginning of space-info.c, the exposed enum interface doesn't have any comment. Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are not explained at all. So add some simple comments for these enums as a quick reference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
adca4d94 |
|
13-Jul-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT") tries to reduce the early EDQUOT problems by checking the qgroup free against threshold and tries to wake up commit kthread to free some space. The problem of that mechanism is, it can only free qgroup per-trans metadata space, can't do anything to data, nor prealloc qgroup space. Now since we have the ability to flush qgroup space, and implemented retry-after-EDQUOT behavior, such mechanism can be completely replaced. So this patch will cleanup such mechanism in favor of retry-after-EDQUOT. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c53e9653 |
|
13-Jul-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: try to flush qgroup space when we get -EDQUOT [PROBLEM] There are known problem related to how btrfs handles qgroup reserved space. One of the most obvious case is the the test case btrfs/153, which do fallocate, then write into the preallocated range. btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad) --- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800 +++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800 @@ -1,2 +1,5 @@ QA output created by 153 +pwrite: Disk quota exceeded +/mnt/scratch/testfile2: Disk quota exceeded +/mnt/scratch/testfile2: Disk quota exceeded Silence is golden ... (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff) [CAUSE] Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"), we always reserve space no matter if it's COW or not. Such behavior change is mostly for performance, and reverting it is not a good idea anyway. For preallcoated extent, we reserve qgroup data space for it already, and since we also reserve data space for qgroup at buffered write time, it needs twice the space for us to write into preallocated space. This leads to the -EDQUOT in buffered write routine. And we can't follow the same solution, unlike data/meta space check, qgroup reserved space is shared between data/metadata. The EDQUOT can happen at the metadata reservation, so doing NODATACOW check after qgroup reservation failure is not a solution. [FIX] To solve the problem, we don't return -EDQUOT directly, but every time we got a -EDQUOT, we try to flush qgroup space: - Flush all inodes of the root NODATACOW writes will free the qgroup reserved at run_dealloc_range(). However we don't have the infrastructure to only flush NODATACOW inodes, here we flush all inodes anyway. - Wait for ordered extents This would convert the preallocated metadata space into per-trans metadata, which can be freed in later transaction commit. - Commit transaction This will free all per-trans metadata space. Also we don't want to trigger flush multiple times, so here we introduce a per-root wait list and a new root status, to ensure only one thread starts the flushing. Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
60f8667b |
|
06-Jul-2020 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: add multi-statement protection to btrfs_set/clear_and_info macros Multi-statement macros should be enclosed in do/while(0) block to make their use safe in single statement if conditions. All current uses of the macros are safe, so this change is for future protection. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a93e0168 |
|
01-Jul-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove no longer needed use of log_writers for the log root tree When syncing the log, we used to update the log root tree without holding neither the log_mutex of the subvolume root nor the log_mutex of log root tree. We used to have two critical sections delimited by the log_mutex of the log root tree, so in the first one we incremented the log_writers of the log root tree and on the second one we decremented it and waited for the log_writers counter to go down to zero. This was because the update of the log root tree happened between the two critical sections. The use of two critical sections allowed a little bit more of parallelism and required the use of the log_writers counter, necessary to make sure we didn't miss any log root tree update when we have multiple tasks trying to sync the log in parallel. However after commit 06989c799f0481 ("Btrfs: fix race updating log root item during fsync") the log root tree update was moved into a critical section delimited by the subvolume's log_mutex. Later another commit moved the log tree update from that critical section into the second critical section delimited by the log_mutex of the log root tree. Both commits addressed different bugs. The end result is that the first critical section delimited by the log_mutex of the log root tree became pointless, since there's nothing done between it and the second critical section, we just have an unlock of the log_mutex followed by a lock operation. This means we can merge both critical sections, as the first one does almost nothing now, and we can stop using the log_writers counter of the log root tree, which was incremented in the first critical section and decremented in the second criticial section, used to make sure no one in the second critical section started writeback of the log root tree before some other task updated it. So just remove the mutex_unlock() followed by mutex_lock() of the log root tree, as well as the use of the log_writers counter for the log root tree. This patch is part of a series that has the following patches: 1/4 btrfs: only commit the delayed inode when doing a full fsync 2/4 btrfs: only commit delayed items at fsync if we are logging a directory 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log 4/4 btrfs: remove no longer needed use of log_writers for the log root tree After the entire patchset applied I saw about 12% decrease on max latency reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of ram, using kvm and using a raw NVMe device directly (no intermediary fs on the host). The test was invoked like the following: mkfs.btrfs -f /dev/sdk mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk dbench -D /mnt/sdk -t 300 8 umount /mnt/dsk CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
28a95795 |
|
01-Jul-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: stop incremening log_batch for the log root tree when syncing log We are incrementing the log_batch atomic counter of the root log tree but we never use that counter, it's used only for the log trees of subvolume roots. We started doing it when we moved the log_batch and log_write counters from the global, per fs, btrfs_fs_info structure, into the btrfs_root structure in commit 7237f1833601dc ("Btrfs: fix tree logs parallel sync"). So just stop doing it for the log root tree and add a comment over the field declaration so inform it's used only for log trees of subvolume roots. This patch is part of a series that has the following patches: 1/4 btrfs: only commit the delayed inode when doing a full fsync 2/4 btrfs: only commit delayed items at fsync if we are logging a directory 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log 4/4 btrfs: remove no longer needed use of log_writers for the log root tree After the entire patchset applied I saw about 12% decrease on max latency reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of ram, using kvm and using a raw NVMe device directly (no intermediary fs on the host). The test was invoked like the following: mkfs.btrfs -f /dev/sdk mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk dbench -D /mnt/sdk -t 300 8 umount /mnt/dsk CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49e5fb46 |
|
27-Jun-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: export qgroups in sysfs This patch will add the following sysfs interface: /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags Which is also available in output of "btrfs qgroup show". /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc The last 3 rsv related members are not visible to users, but can be very useful to debug qgroup limit related bugs. Also, to avoid '/' used in <qgroup_id>, the separator between qgroup level and qgroup id is changed to '_'. The interface is not hidden behind 'debug' as we want this interface to be included into production build and to provide another way to read the qgroup information besides the ioctls. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
088545f6 |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_dirty_pages take btrfs_inode There is a single use of the generic vfs_inode so let's take btrfs_inode as a parameter and remove couple of redundant BTRFS_I() calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2566f22 |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_set_extent_delalloc take btrfs_inode Preparation to make btrfs_dirty_pages take btrfs_inode as parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
98456b9c |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_run_delalloc_range take btrfs_inode All children now take btrfs_inode so convert it to taking it as a parameter as well. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
38d37aa9 |
|
23-Jun-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: refactor btrfs_check_can_nocow() into two variants The function btrfs_check_can_nocow() now has two completely different call patterns. For nowait variant, callers don't need to do any cleanup. While for wait variant, callers need to release the lock if they can do nocow write. This is somehow confusing, and is already a problem for the exported btrfs_check_can_nocow(). So this patch will separate the different patterns into different functions. For nowait variant, the function will be called check_nocow_nolock(). For wait variant, the function pair will be btrfs_check_nocow_lock() btrfs_check_nocow_unlock(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6d4572a9 |
|
23-Jun-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation [BUG] When the data space is exhausted, even if the inode has NOCOW attribute, we will still refuse to truncate unaligned range due to ENOSPC. The following script can reproduce it pretty easily: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev -b 1G mount -o nospace_cache $dev $mnt touch $mnt/foobar chattr +C $mnt/foobar xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null sync xfs_io -c "fpunch 0 2k" $mnt/foobar umount $mnt Currently this will fail at the fpunch part. [CAUSE] Because btrfs_truncate_block() always reserves space without checking the NOCOW attribute. Since the writeback path follows NOCOW bit, we only need to bother the space reservation code in btrfs_truncate_block(). [FIX] Make btrfs_truncate_block() follow btrfs_buffered_write() to try to reserve data space first, and fall back to NOCOW check only when we don't have enough space. Such always-try-reserve is an optimization introduced in btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call. This patch will export check_can_nocow() as btrfs_check_can_nocow(), and use it in btrfs_truncate_block() to fix the problem. Reported-by: Martin Doucha <martin.doucha@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a2570ef3 |
|
23-Jun-2020 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused btrfs_root::defrag_trans_start Last touched in 2013 by commit de78b51a2852 ("btrfs: remove cache only arguments from defrag path") that was the only code that used the value. Now it's only set but never used for anything, so we can remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
906c448c |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make __btrfs_drop_extents take btrfs_inode It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the interface with the non __ prefixed version. Will also makes converting its callers to btrfs_inode easier. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bd242a08 |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_csum_one_bio takae btrfs_inode Will enable converting btrfs_submit_compressed_write to btrfs_inode more easily. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7bfa9535 |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_reloc_clone_csums take btrfs_inode It really wants btrfs_inode and not a vfs inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ce6ef5ab |
|
08-Jun-2020 |
David Sterba <dsterba@suse.com> |
btrfs: add little-endian optimized key helpers The CPU and on-disk keys are mapped to two different structures because of the endianness. There's an intermediate buffer used to do the conversion, but this is not necessary when CPU and on-disk endianness match. Add optimized versions of helpers that take disk_key and use the buffer directly for CPU keys or drop the intermediate buffer and conversion. This saves a lot of stack space accross many functions and removes about 6K of generated binary code: text data bss dec hex filename 1090439 17468 14912 1122819 112203 pre/btrfs.ko 1084613 17456 14912 1116981 110b35 post/btrfs.ko Delta: -5826 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
203f44c5 |
|
09-Jun-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: inode: refactor the parameters of insert_reserved_file_extent() Function insert_reserved_file_extent() takes a long list of parameters, which are all for btrfs_file_extent_item, even including two reserved members, encryption and other_encoding. This makes the parameter list unnecessary long for a function which only gets called twice. This patch will refactor the parameter list, by using btrfs_file_extent_item as parameter directly to hugely reduce the number of parameters. Also, since there are only two callers, one in btrfs_finish_ordered_io() which inserts file extent for ordered extent, and one __btrfs_prealloc_file_range(). These two call sites have completely different context, where ordered extent can be compressed, but will always be regular extent, while the preallocated one is never going to be compressed and always has PREALLOC type. So use two small wrapper for these two different call sites to improve readability. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e7a79811 |
|
15-Jun-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: check if a log root exists before locking the log_mutex on unlink This brings back an optimization that commit e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans") removed, but in a different form. So it's almost equivalent to a revert. That commit removed an optimization where we avoid locking a root's log_mutex when there is no log tree created in the current transaction. The affected code path is triggered through unlink operations. That commit was based on the assumption that the optimization was not necessary because we used to have the following checks when the patch was authored: int btrfs_del_dir_entries_in_log(...) { (...) if (dir->logged_trans < trans->transid) return 0; ret = join_running_log_trans(root); (...) } int btrfs_del_inode_ref_in_log(...) { (...) if (inode->logged_trans < trans->transid) return 0; ret = join_running_log_trans(root); (...) } However before that patch was merged, another patch was merged first which replaced those checks because they were buggy. That other patch corresponds to commit 803f0f64d17769 ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions"). The assumption that if the logged_trans field of an inode had a smaller value then the current transaction's generation (transid) meant that the inode was not logged in the current transaction was only correct if the inode was not evicted and reloaded in the current transaction. So the corresponding bug fix changed those checks and replaced them with the following helper function: static bool inode_logged(struct btrfs_trans_handle *trans, struct btrfs_inode *inode) { if (inode->logged_trans == trans->transid) return true; if (inode->last_trans == trans->transid && test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) && !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags)) return true; return false; } So if we have a subvolume without a log tree in the current transaction (because we had no fsyncs), every time we unlink an inode we can end up trying to lock the log_mutex of the root through join_running_log_trans() twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log()) and once for the parent directory (with btrfs_del_dir_entries_in_log()). This means if we have several unlink operations happening in parallel for inodes in the same subvolume, and the those inodes and/or their parent inode were changed in the current transaction, we end up having a lot of contention on the log_mutex. The test robots from intel reported a -30.7% performance regression for a REAIM test after commit e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans"). So just bring back the optimization to join_running_log_trans() where we check first if a log root exists before trying to lock the log_mutex. This is done by checking for a bit that is set on the root when a log tree is created and removed when a log tree is freed (at transaction commit time). Commit e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans") was merged in the 5.4 merge window while commit 803f0f64d17769 ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions") was merged in the 5.3 merge window. But the first commit was actually authored before the second commit (May 23 2019 vs June 19 2019). Reported-by: kernel test robot <rong.a.chen@intel.com> Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/ Fixes: e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
55e20bd1 |
|
09-Jun-2020 |
David Sterba <dsterba@suse.com> |
Revert "btrfs: switch to iomap_dio_rw() for dio" This reverts commit a43a67a2d715540c1368b9501a22b0373b5874c0. This patch reverts the main part of switching direct io implementation to iomap infrastructure. There's a problem in invalidate page that couldn't be solved as regression in this development cycle. The problem occurs when buffered and direct io are mixed, and the ranges overlap. Although this is not recommended, filesystems implement measures or fallbacks to make it somehow work. In this case, fallback to buffered IO would be an option for btrfs (this already happens when direct io is done on compressed data), but the change would be needed in the iomap code, bringing new semantics to other filesystems. Another problem arises when again the buffered and direct ios are mixed, invalidation fails, then -EIO is set on the mapping and fsync will fail, though there's no real error. There have been discussions how to fix that, but revert seems to be the least intrusive option. Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/ Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f4c48b44 |
|
09-Jun-2020 |
David Sterba <dsterba@suse.com> |
Revert "btrfs: split btrfs_direct_IO to read and write part" This reverts commit d8f3e73587ce574f7a9bc165e0db69b0b148f6f8. The patch is a cleanup of direct IO port to iomap infrastructure, which gets reverted. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8f3e735 |
|
19-May-2020 |
Christoph Hellwig <hch@lst.de> |
btrfs: split btrfs_direct_IO to read and write part The read and write versions don't have anything in common except for the call to iomap_dio_rw. So split this function, and merge each half into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a43a67a2 |
|
19-May-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: switch to iomap_dio_rw() for dio Switch from __blockdev_direct_IO() to iomap_dio_rw(). Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as iomap_begin() for iomap direct I/O functions. This function allocates and locks all the blocks required for the I/O. btrfs_submit_direct() is used as the submit_io() hook for direct I/O ops. Since we need direct I/O reads to go through iomap_dio_rw(), we change file_operations.read_iter() to a btrfs_file_read_iter() which calls btrfs_direct_IO() for direct reads and falls back to generic_file_buffered_read() for incomplete reads and buffered reads. We don't need address_space.direct_IO() anymore so set it to noop. Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is capable of direct I/O reads from a hole, so we don't need to return -ENOENT. BTRFS direct I/O is now done under i_rwsem, shared in case of reads and exclusive in case of writes. This guards against simultaneous truncates. Use iomap->iomap_end() to check for failed or incomplete direct I/O: - for writes, call __endio_write_update_ordered() - for reads, unlock extents btrfs_dio_data is now hooked in iomap->private and not current->journal_info. It carries the reservation variable and the amount of data submitted, so we can calculate the amount of data to call __endio_write_update_ordered in case of an error. This patch removes last use of struct buffer_head from btrfs. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e289f03e |
|
17-May-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents When we have extents shared amongst different inodes in the same subvolume, if we fsync them in parallel we can end up with checksum items in the log tree that represent ranges which overlap. For example, consider we have inodes A and B, both sharing an extent that covers the logical range from X to X + 64KiB: 1) Task A starts an fsync on inode A; 2) Task B starts an fsync on inode B; 3) Task A calls btrfs_csum_file_blocks(), and the first search in the log tree, through btrfs_lookup_csum(), returns -EFBIG because it finds an existing checksum item that covers the range from X - 64KiB to X; 4) Task A checks that the checksum item has not reached the maximum possible size (MAX_CSUM_ITEMS) and then releases the search path before it does another path search for insertion (through a direct call to btrfs_search_slot()); 5) As soon as task A releases the path and before it does the search for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG too, because there is an existing checksum item that has an end offset that matches the start offset (X) of the checksum range we want to log; 6) Task B releases the path; 7) Task A does the path search for insertion (through btrfs_search_slot()) and then verifies that the checksum item that ends at offset X still exists and extends its size to insert the checksums for the range from X to X + 64KiB; 8) Task A releases the path and returns from btrfs_csum_file_blocks(), having inserted the checksums into an existing checksum item that got its size extended. At this point we have one checksum item in the log tree that covers the logical range from X - 64KiB to X + 64KiB; 9) Task B now does a search for insertion using btrfs_search_slot() too, but it finds that the previous checksum item no longer ends at the offset X, it now ends at an of offset X + 64KiB, so it leaves that item untouched. Then it releases the path and calls btrfs_insert_empty_item() that inserts a checksum item with a key offset corresponding to X and a size for inserting a single checksum (4 bytes in case of crc32c). Subsequent iterations end up extending this new checksum item so that it contains the checksums for the range from X to X + 64KiB. So after task B returns from btrfs_csum_file_blocks() we end up with two checksum items in the log tree that have overlapping ranges, one for the range from X - 64KiB to X + 64KiB, and another for the range from X to X + 64KiB. Having checksum items that represent ranges which overlap, regardless of being in the log tree or in the chekcsums tree, can lead to problems where checksums for a file range end up not being found. This type of problem has happened a few times in the past and the following commits fixed them and explain in detail why having checksum items with overlapping ranges is problematic: 27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums" b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync" 40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree" Since this specific instance of the problem can only happen when logging inodes, because it is the only case where concurrent attempts to insert checksums for the same range can happen, fix the issue by using an extent io tree as a range lock to serialize checksum insertion during inode logging. This issue could often be reproduced by the test case generic/457 from fstests. When it happens it produces the following trace: BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610 BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884 item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4 item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4 item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4 item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4 item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4 item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4 (...) BTRFS error (device dm-0): block=30625792 write time tree block corruption detected ------------[ cut here ]------------ WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs] Modules linked in: btrfs dm_thin_pool ... CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs] Code: c7 c7 ... RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001 RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0 R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000 FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x67/0xc0 [btrfs] submit_one_bio+0x31/0x50 [btrfs] btree_write_cache_pages+0x2db/0x4b0 [btrfs] ? __filemap_fdatawrite_range+0xb1/0x110 do_writepages+0x23/0x80 __filemap_fdatawrite_range+0xd2/0x110 btrfs_write_marked_extents+0x15e/0x180 [btrfs] btrfs_sync_log+0x206/0x10a0 [btrfs] ? kmem_cache_free+0x315/0x3b0 ? btrfs_log_inode+0x1e8/0xf90 [btrfs] ? __mutex_unlock_slowpath+0x45/0x2a0 ? lockref_put_or_lock+0x9/0x30 ? dput+0x2d/0x580 ? dput+0xb5/0x580 ? btrfs_sync_file+0x464/0x4d0 [btrfs] btrfs_sync_file+0x464/0x4d0 [btrfs] do_fsync+0x38/0x60 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fb41953a6d0 Code: 48 3d ... RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0 RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003 RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009 R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060 R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace d543fc76f5ad7fd8 ]--- In that trace the tree checker detected the overlapping checksum items at the time when we triggered writeback for the log tree when syncing the log. Another trace that can happen is due to BUG_ON() when deleting checksum items while logging an inode: BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584) BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610 BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473 item 0 key (257 1 0) itemoff 16123 itemsize 160 inode generation 7 size 262144 mode 100600 item 1 key (257 12 256) itemoff 16103 itemsize 20 item 2 key (257 108 0) itemoff 16050 itemsize 53 extent data disk bytenr 13631488 nr 4096 extent data offset 0 nr 131072 ram 131072 (...) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:3153! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs] Code: 0f b6 ... RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001 RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08 R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_del_csums+0x2f4/0x540 [btrfs] copy_items+0x4b5/0x560 [btrfs] btrfs_log_inode+0x910/0xf90 [btrfs] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] ? dget_parent+0x5/0x370 btrfs_log_dentry_safe+0x4a/0x70 [btrfs] btrfs_sync_file+0x42b/0x4d0 [btrfs] __x64_sys_msync+0x199/0x200 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fe586c65760 Code: 00 f7 ... RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760 RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000 RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61 R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1 R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420 Modules linked in: dm_log_writes ... ---[ end trace c92a7f447a8515f5 ]--- CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0202e83f |
|
15-May-2020 |
David Sterba <dsterba@suse.com> |
btrfs: simplify iget helpers The inode lookup starting at btrfs_iget takes the full location key, while only the objectid is used to match the inode, because the lookup happens inside the given root thus the inode number is unique. The entire location key is properly set up in btrfs_init_locked_inode. Simplify the helpers and pass only inode number, renaming it to 'ino' instead of 'objectid'. This allows to remove temporary variables key, saving some stack space. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aeb935a4 |
|
15-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: don't set SHAREABLE flag for data reloc tree SHAREABLE flag is set for subvolumes because users can create snapshot for subvolumes, thus sharing tree blocks of them. But data reloc tree is not exposed to user space, as it's only an internal tree for data relocation, thus it doesn't need the full path replacement handling at all. This patch will make data reloc tree a non-shareable tree, and add btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code can grab it from fs_info directly. This would slightly improve tree relocation, as now data reloc tree can go through regular COW routine to get relocated, without bothering the complex tree reloc tree routine. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92a7cc42 |
|
15-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE The name BTRFS_ROOT_REF_COWS is not very clear about the meaning. In fact, that bit can only be set to those trees: - Subvolume roots - Data reloc root - Reloc roots for above roots All other trees won't get this bit set. So just by the result, it is obvious that, roots with this bit set can have tree blocks shared with other trees. Either shared by snapshots, or by reloc roots (an special snapshot created by relocation). This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to make it easier to understand, and update all comment mentioning "reference counted" to follow the rename. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b48966a |
|
28-Apr-2020 |
David Sterba <dsterba@suse.com> |
btrfs: constify extent_buffer in the API functions There are many helpers around extent buffers, found in extent_io.h and ctree.h. Most of them can be converted to take constified eb as there are no changes to the extent buffer structure itself but rather the pages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
870b388d |
|
29-Apr-2020 |
David Sterba <dsterba@suse.com> |
btrfs: preset set/get token with first page and drop condition All the set/get helpers first check if the token contains a cached address. After first use the address is always valid, but the extra check is done for each call. The token initialization can optimistically set it to the first extent buffer page, that we know always exists. Then the condition in all btrfs_token_*/btrfs_set_token_* can be simplified by removing the address check from the condition, but for development the assertion still makes sure it's valid. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cc4c13d5 |
|
28-Apr-2020 |
David Sterba <dsterba@suse.com> |
btrfs: drop eb parameter from set/get token helpers Now that all set/get helpers use the eb from the token, we don't need to pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some stack space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
684b752b |
|
08-May-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move the block group freeze/unfreeze helpers into block-group.c The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group() used to be named btrfs_get_block_group_trimming() and btrfs_put_block_group_trimming() respectively. At the time they were added to free-space-cache.c, by commit e33e17ee1098 ("btrfs: add missing discards when unpinning extents with -o discard") because all the trimming related functions were in free-space-cache.c. Now that the helpers were renamed and are used in scrub context as well, move them to block-group.c, a much more logical location for them. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b7304af |
|
08-May-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename member 'trimming' of block group to a more generic name Back in 2014, commit 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation"), I added the 'trimming' member to the block group structure. Its purpose was to prevent races between trimming and block group deletion/allocation by pinning the block group in a way that prevents its logical address and device extents from being reused while trimming is in progress for a block group, so that if another task deletes the block group and then another task allocates a new block group that gets the same logical address and device extents while the trimming task is still in progress. After the previous fix for scrub (patch "btrfs: fix a race between scrub and block group removal/allocation"), scrub now also has the same needs that trimming has, so the member name 'trimming' no longer makes sense. Since there is already a 'pinned' member in the block group that refers to space reservations (pinned bytes), rename the member to 'frozen', add a comment on top of it to describe its general purpose and rename the helpers to increment and decrement the counter as well, to match the new member name. The next patch in the series will move the helpers into a more suitable file (from free-space-cache.c to block-group.c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
31344b2f |
|
11-May-2020 |
David Sterba <dsterba@suse.com> |
btrfs: remove more obsolete v0 extent ref declarations The extent references v0 have been superseded long time go, there are some unused declarations of access helpers. We can safely remove them now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct btrfs_extent_item_v0 is still part of a backward compatibility check in relocation.c and thus not removed. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
943aeb0d |
|
09-May-2020 |
YueHaibing <yuehaibing@huawei.com> |
btrfs: remove unused function btrfs_dev_extent_chunk_tree_uuid There's no callers in-tree anymore since commit d24ee97b96db ("btrfs: use new helpers to set uuids in eb") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c047a69 |
|
16-Apr-2020 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of endio_repair_workers This was originally added in commit 8b110e393c5a ("Btrfs: implement repair function when direct read fails") to avoid a deadlock. In that commit, the direct I/O read endio executes on the endio_workers workqueue, submits a repair bio, and waits for it to complete. The repair bio endio must execute on a different workqueue, otherwise it could block on the endio_workers workqueue becoming available, which won't happen because the original endio is blocked on the repair bio. As of the previous commit, the original endio doesn't wait for the repair bio, so this separate workqueue is unnecessary. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3b83361 |
|
17-Apr-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove the redundant parameter level in btrfs_bin_search() All callers pass the eb::level so we can get read it directly inside the btrfs_bin_search and key_search. This is inspired by the work of Marek in U-boot. CC: Marek Behun <marek.behun@nic.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f9fe614 |
|
13-Mar-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: improve global reserve stealing logic For unlink transactions and block group removal btrfs_start_transaction_fallback_global_rsv will first try to start an ordinary transaction and if it fails it will fall back to reserving the required amount by stealing from the global reserve. This is problematic because of all the same reasons we had with previous iterations of the ENOSPC handling, thundering herd. We get a bunch of failures all at once, everybody tries to allocate from the global reserve, some win and some lose, we get an ENSOPC. Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's used to mark unlink reservation. To fix this we need to integrate this logic into the normal ENOSPC infrastructure. We still go through all of the normal flushing work, and at the moment we begin to fail all the tickets we try to satisfy any tickets that are allowed to steal by stealing from the global reserve. If this works we start the flushing system over again just like we would with a normal ticket satisfaction. This serializes our global reserve stealing, so we don't have the thundering herd problem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
55465730 |
|
02-Mar-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: backref: rename and move should_ignore_root() This function is mostly single purpose to relocation backref cache, but since we're moving the main part of backref cache to backref.c, we need to export such function. And to avoid confusion, rename the function to btrfs_should_ignore_reloc_root() make the name a little more clear. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2433bea5 |
|
05-Mar-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: reloc: make reloc root search-specific for relocation backref cache find_reloc_root() searches reloc_control::reloc_root_tree to find the reloc root. This behavior is only useful for relocation backref cache. For the incoming more generic purpose backref cache, we don't care about who owns the reloc root, but only care if it's a reloc root. So this patch makes the following modifications to make the reloc root search more specific to relocation backref: - Add backref_node::is_reloc_root This will be an extra indicator for generic purposed backref cache. User doesn't need to read root key from backref_node::root to determine if it's a reloc root. Also for reloc tree root, it's useless and will be queued to useless list. - Add backref_cache::is_reloc This will allow backref cache code to do different behavior for generic purpose backref cache and relocation backref cache. - Pass fs_info to find_reloc_root() - Export find_reloc_root() So backref.c can utilize this function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c75e8394 |
|
14-Feb-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: kill the subvol_srcu Now that we have proper root ref counting everywhere we can kill the subvol_srcu. * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes * the refcount_t used for the references checks for accidental 0->1 in cases where the root lifetime would not be properly protected * there's a leak detector for roots to catch unfreed roots at umount time * SRCU served us well over the years but is was not a proper synchronization mechanism for some cases Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3fd63727 |
|
14-Feb-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make the extent buffer leak check per fs info I'm going to make the entire destruction of btrfs_root's controlled by their refcount, so it will be helpful to notice if we're leaking their eb's on umount. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a5eeb3d1 |
|
08-Mar-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add helper to get the end offset of a file extent item Getting the end offset for a file extent item requires a bit of code since the extent can be either inline or regular/prealloc. There are some places all over the code base that open code this logic and in another patch later in this series it will be needed again. Therefore encapsulate this logic in a helper function and use it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0078a9f9 |
|
10-Mar-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove block_rsv parameter from btrfs_drop_snapshot It's no longer used following 30d40577e322 ("btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"), so just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
726a3421 |
|
16-Feb-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: add error injection points for cancelling balance Introduce a new error injection point, should_cancel_balance(). It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but allows us to override the return value. Currently there are only one locations using this function: - btrfs_balance() It checks cancel before each block group. There are other locations checking fs_info->balance_cancel_req, but they are not used as an indicator to exit, so there is no need to use the wrapper. But there will be more locations coming, and some locations can cause kernel panic if not handled properly. So introduce this error injection to provide better test interface. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a177381 |
|
28-Feb-2020 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: move all reflink implementation code into its own file The reflink code is quite large and has been living in ioctl.c since ever. It has grown over the years after many bug fixes and improvements, and since I'm planning on making some further improvements on it, it's time to get it better organized by moving into its own file, reflink.c (similar to what xfs does for example). This change only moves the code out of ioctl.c into the new file, it doesn't do any other change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42c9d0b5 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: simplify parameters of btrfs_set_disk_extent_flags All callers pass extent buffer start and length so the extent buffer itself should work fine. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c4ac7541 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: open code trivial helper btrfs_header_chunk_tree_uuid The helper btrfs_header_chunk_tree_uuid follows naming convention of other struct accessors but does something compeletly different. As the offsetof calculation is clear in the context of extent buffer operations we can remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a8658e3 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: open code trivial helper btrfs_header_fsid The helper btrfs_header_fsid follows naming convention of other struct accessors but does something compeletly different. As the offsetof calculation is clear in the context of extent buffer operations we can remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dcc3eb96 |
|
30-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: convert snapshot/nocow exlcusion to drew lock This patch removes all haphazard code implementing nocow writers exclusion from pending snapshot creation and switches to using the drew lock to ensure this invariant still holds. 'Readers' are snapshot creators from create_snapshot and 'writers' are nocow writers from buffered write path or btrfs_setsize. This locking scheme allows for multiple snapshots to happen while any nocow writers are blocked, since writes to page cache in the nocow path will make snapshots inconsistent. So for performance reasons we'd like to have the ability to run multiple concurrent snapshots and also favors readers in this case. And in case there aren't pending snapshots (which will be the majority of the cases) we rely on the percpu's writers counter to avoid cacheline contention. The main gain from using the drew lock is it's now a lot easier to reason about the guarantees of the locking scheme and whether there is some silent breakage lurking. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2992df73 |
|
30-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Implement DREW lock A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive that allows to have multiple readers or multiple writers but not multiple readers and writers holding it concurrently. The code is factored out from the existing open-coded locking scheme used to exclude pending snapshots from nocow writers and vice-versa. Current implementation actually favors Readers (that is snapshot creaters) to writers (nocow writers of the filesystem). The API provides lock/unlock/trylock for reads and writes. Formal specification for TLA+ provided by Valentin Schneider is at https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c0c907a4 |
|
21-Feb-2020 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: export helpers for subvolume name/id resolution The functions will be used outside of export.c and super.c to allow resolving subvolume name from a given id, eg. for subvolume deletion by id ioctl. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ split from the next patch ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
560b7a4a |
|
18-Feb-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate btrfs_uuid_tree_iterate is called from only once place and its 2nd argument is always btrfs_check_uuid_tree_entry. Simplify btrfs_uuid_tree_iterate's signature by removing its 2nd argument and directly calling btrfs_check_uuid_tree_entry. Also move the latter into uuid-tree.h. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fe119a6e |
|
20-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: switch to per-transaction pinned extents This commit flips the switch to start tracking/processing pinned extents on a per-transaction basis. It mostly replaces all references from btrfs_fs_info::(pinned_extents|freed_extents[]) to btrfs_transaction::pinned_extents. Two notable modifications that warrant explicit mention are changing clean_pinned_extents to get a reference to the previously running transaction. The other one is removal of call to btrfs_destroy_pinned_extent since transactions are going to be cleaned in btrfs_cleanup_one_transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9fce5704 |
|
20-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_pin_extent_for_log_replay take transaction handle Preparation for refactoring pinned extents tracking. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7bfc1007 |
|
20-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_pin_reserved_extent take transaction handle btrfs_pin_reserved_extent is now only called with a valid transaction so exploit the fact to take a transaction. This is preparation for tracking pinned extents on a per-transaction basis. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b25c36f8 |
|
20-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_pin_extent take trans handle Preparation for switching pinned extent tracking to a per-transaction basis. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bd647ce3 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a leak check for roots Now that we're going to start relying on getting ref counting right for roots, add a list to track allocated roots and print out any roots that aren't freed up at free_fs_info time. Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for developers to verify they aren't breaking things. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d4b0463 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: export and rename free_fs_info We're going to start freeing roots and doing other complicated things in free_fs_info, so we need to move it to disk-io.c and export it in order to use things lik btrfs_put_fs_root(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
41a2ee75 |
|
17-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce per-inode file extent tree In order to keep track of where we have file extents on disk, and thus where it is safe to adjust the i_size to, we need to have a tree in place to keep track of the contiguous areas we have file extents for. Add helpers to use this tree, as it's not required for NO_HOLES file systems. We will use this by setting DIRTY for areas we know we have file extent item's set, and clearing it when we remove file extent items for truncation. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7227ff4d |
|
21-Jan-2020 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between adding and putting tree mod seq elements and nodes There is a race between adding and removing elements to the tree mod log list and rbtree that can lead to use-after-free problems. Consider the following example that explains how/why the problems happens: 1) Task A has mod log element with sequence number 200. It currently is the only element in the mod log list; 2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to access the tree mod log. When it enters the function, it initializes 'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock' before checking if there are other elements in the mod seq list. Since the list it empty, 'min_seq' remains set to (u64)-1. Then it unlocks the lock 'tree_mod_seq_lock'; 3) Before task A acquires the lock 'tree_mod_log_lock', task B adds itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a sequence number of 201; 4) Some other task, name it task C, modifies a btree and because there elements in the mod seq list, it adds a tree mod elem to the tree mod log rbtree. That node added to the mod log rbtree is assigned a sequence number of 202; 5) Task B, which is doing fiemap and resolving indirect back references, calls btrfs get_old_root(), with 'time_seq' == 201, which in turn calls tree_mod_log_search() - the search returns the mod log node from the rbtree with sequence number 202, created by task C; 6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating the mod log rbtree and finds the node with sequence number 202. Since 202 is less than the previously computed 'min_seq', (u64)-1, it removes the node and frees it; 7) Task B still has a pointer to the node with sequence number 202, and it dereferences the pointer itself and through the call to __tree_mod_log_rewind(), resulting in a use-after-free problem. This issue can be triggered sporadically with the test case generic/561 from fstests, and it happens more frequently with a higher number of duperemove processes. When it happens to me, it either freezes the VM or it produces a trace like the following before crashing: [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1 [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 1245.321287] RIP: 0010:rb_next+0x16/0x50 [ 1245.321307] Code: .... [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202 [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80 [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000 [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038 [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8 [ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000 [ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0 [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1245.321706] Call Trace: [ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs] [ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs] [ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs] [ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs] [ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322066] do_vfs_ioctl+0x45a/0x750 [ 1245.322081] ksys_ioctl+0x70/0x80 [ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 1245.322113] __x64_sys_ioctl+0x16/0x20 [ 1245.322126] do_syscall_64+0x5c/0x280 [ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1245.322155] RIP: 0033:0x7fdee3942dd7 [ 1245.322177] Code: .... [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7 [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004 [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44 [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48 [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50 [ 1245.322423] Modules linked in: .... [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]--- Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum sequence number and iterates the rbtree while holding the lock 'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock' lock, since it is now redundant. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
68c467cb |
|
16-Dec-2019 |
David Sterba <dsterba@suse.com> |
btrfs: separate definition of assertion failure handlers There's a report where objtool detects unreachable instructions, eg.: fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction This seems to be a false positive due to compiler version. The cause is in the ASSERT macro implementation that does the conditional check as IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef. To avoid that, use the ifdefs directly. There are still 2 reports that aren't fixed: fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ddf648f |
|
02-Jan-2020 |
Dennis Zhou <dennis@kernel.org> |
btrfs: keep track of discard reuse stats Keep track of how much we are discarding and how often we are reusing with async discard. The discard_*_bytes values don't need any special protection because the work item provides the single threaded access. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7fe6d45e |
|
02-Jan-2020 |
Dennis Zhou <dennis@kernel.org> |
btrfs: have multiple discard lists Non-block group destruction discarding currently only had a single list with no minimum discard length. This can lead to caravaning more meaningful discards behind a heavily fragmented block group. This adds support for multiple lists with minimum discard lengths to prevent the caravan effect. We promote block groups back up when we exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support only 2 lists with filters of 1MB and 32KB respectively. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
19b2a2c7 |
|
02-Jan-2020 |
Dennis Zhou <dennis@kernel.org> |
btrfs: make max async discard size tunable Expose max_discard_size as a tunable via sysfs and switch the current fixed maximum to the default value. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e93591bb |
|
02-Jan-2020 |
Dennis Zhou <dennis@kernel.org> |
btrfs: add kbps discard rate limit for async discard Provide the ability to rate limit based on kbps in addition to iops as additional guides for the target discard rate. The delay used ends up being max(kbps_delay, iops_delay). Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a2309300 |
|
02-Jan-2020 |
Dennis Zhou <dennis@kernel.org> |
btrfs: calculate discard delay based on number of extents An earlier patch keeps track of discardable_extents. These are undiscarded extents managed by the free space cache. Here, we will use this to dynamically calculate the discard delay interval. There are 3 rate to consider. The first is the target convergence rate, the rate to discard all discardable_extents over the BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5dc7c10b |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: keep track of discardable_bytes for async discard Keep track of this metric so that we can understand how ahead or behind we are in discarding rate. This uses the same accounting method as discardable_extents, deltas between previous/current values and propagating them up. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dfb79ddb |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: track discardable extents for async discard The number of discardable extents will serve as the rate limiting metric for how often we should discard. This keeps track of discardable extents in the free space caches by maintaining deltas and propagating them to the global count. The deltas are calculated from 2 values stored in PREV and CURR entries, then propagated up to the global discard ctl. The current counter value becomes the previous counter value after update. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4faab84 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: sysfs: add UUID/debug/discard directory Setup base sysfs directory for discard stats + tunables. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
93945cb4 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: sysfs: make UUID/debug have its own kobject Btrfs only allowed attributes to be exposed in debug/. Let's let other groups be created by making debug its own kobject. This also makes the per-fs debug options separate from the global features mount attributes. This seems to be needed as sysfs_create_files() requires const struct attribute * while sysfs_create_group() can take struct attribute *. This seems nicer as per file system, you'll probably use to_fs_info(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6e80d4f8 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: handle empty block_group removal for async discard block_group removal is a little tricky. It can race with the extent allocator, the cleaner thread, and balancing. The current path is for a block_group to be added to the unused_bgs list. Then, when the cleaner thread comes around, it starts a transaction and then proceeds with removing the block_group. Extents that are pinned are subsequently removed from the pinned trees and then eventually a discard is issued for the entire block_group. Async discard introduces another player into the game, the discard workqueue. While it has none of the racing issues, the new problem is ensuring we don't leave free space untrimmed prior to forgetting the block_group. This is handled by placing fully free block_groups on a separate discard queue. This is necessary to maintain discarding order as in the future we will slowly trim even fully free block_groups. The ordering helps us make progress on the same block_group rather than say the last fully freed block_group or needing to search through the fully freed block groups at the beginning of a list and insert after. The new order of events is a fully freed block group gets placed on the unused discard queue first. Once it's processed, it will be placed on the unusued_bgs list and then the original sequence of events will happen, just without the final whole block_group discard. The mount flags can change when processing unused_bgs, so when flipping from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt free block groups on the discard_list to the unused_bg queue which will do the final discard for us. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b0643e59 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: add the beginning of async discard, discard workqueue When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the extent. This is an overeager approach when it comes to discarding and helping the SSD maintain enough free space to prevent severe garbage collection situations. This adds the beginning of async discard. Instead of issuing a discard prior to returning it to the free space, it is just marked as untrimmed. The block_group is then added to a LRU which then feeds into a workqueue to issue discards at a much slower rate. Full discarding of unused block groups is still done and will be addressed in a future patch of the series. For now, we don't persist the discard state of extents and bitmaps. Therefore, our failure recovery mode will be to consider extents untrimmed. This lets us handle failure and unmounting as one in the same. On a number of Facebook webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache. discard=sync: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) --------------------------------------------------------------- Drive A | 434 | 1170 Drive B | 880 | 2330 Drive C | 2943 | 3920 Drive D | 4763 | 5701 discard=async: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) -------------------------------------------------------------- Drive A | 134 | 956 Drive B | 64 | 1972 Drive C | 59 | 1032 Drive D | 62 | 1200 While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
46b27f50 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: rename DISCARD mount option to to DISCARD_SYNC This series introduces async discard which will use the flag DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is synchronously done in transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39b07b5d |
|
02-Dec-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: drop create parameter to btrfs_get_extent() We only pass this as 1 from __extent_writepage_io(). The parameter basically means "pretend I didn't pass in a page". This is silly since we can simply not pass in the page. Get rid of the parameter from btrfs_get_extent(), and since it's used as a get_extent_t callback, remove it from get_extent_t and btree_get_extent(), neither of which need it. While we're here, let's document btrfs_get_extent(). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
db72e47f |
|
10-Dec-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums() We can encode this in the offset parameter: -1 means use the page offsets, anything else is a valid offset. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e62958fc |
|
02-Dec-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappers Currently, we have two wrappers for __btrfs_lookup_bio_sums(): btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and btrfs_lookup_bio_sums(), which is used everywhere else. The only difference is that the _dio variant looks up csums starting at the given offset instead of using the page index, which isn't actually direct I/O-specific. Let's clean up the signature and return value of __btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get rid of the trivial helpers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0fbf736 |
|
21-Nov-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extent __btrfs_free_reserved_extent now performs the actions of btrfs_free_and_pin_reserved_extent. But this name is a bit of a misnomer, since the extent is not really freed but just pinned. Reflect this in the new name. No semantics changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40e046ac |
|
05-Dec-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix missing data checksums after replaying a log tree When logging a file that has shared extents (reflinked with other files or with itself), we can end up logging multiple checksum items that cover overlapping ranges. This confuses the search for checksums at log replay time causing some checksums to never be added to the fs/subvolume tree. Consider the following example of a file that shares the same extent at offsets 0 and 256Kb: [ bytenr 13893632, offset 64Kb, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 13893632, offset 0, len 256Kb ] 256Kb 512Kb When logging the inode, at tree-log.c:copy_items(), when processing the file extent item at offset 0, we log a checksum item covering the range 13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 + 64Kb + 64Kb, respectively. Later when processing the extent item at offset 256K, we log the checksums for the range from 13893632 to 14155776 (which corresponds to 13893632 + 256Kb). These checksums get merged with the checksum item for the range from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync. So after this we get the two following checksum items in the log tree: (...) item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512 range start 13631488 end 14155776 length 524288 item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64 range start 13959168 end 14024704 length 65536 The first one covers the range from the second one, they overlap. So far this does not cause a problem after replaying the log, because when replaying the file extent item for offset 256K, we copy all the checksums for the extent 13893632 from the log tree to the fs/subvolume tree, since searching for an checksum item for bytenr 13893632 leaves us at the first checksum item, which covers the whole range of the extent. However if we write 64Kb to file offset 256Kb for example, we will not be able to find and copy the checksums for the last 128Kb of the extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb. After writing 64Kb into file offset 256Kb we get the following extent layout for our file: [ bytenr 13893632, offset 64K, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 14155776, offset 0, len 64Kb ] 256Kb 320Kb [ bytenr 13893632, offset 64Kb, len 192Kb ] 320Kb 512Kb After fsync'ing the file, if we have a power failure and then mount the filesystem to replay the log, the following happens: 1) When replaying the file extent item for file offset 320Kb, we lookup for the checksums for the extent range from 13959168 (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call to btrfs_lookup_csums_range(); 2) btrfs_lookup_csums_range() finds the checksum item that starts precisely at offset 13959168 (item 7 in the log tree, shown before); 3) However that checksum item only covers 64Kb of data, and not 192Kb of data; 4) As a result only the checksums for the first 64Kb of data referenced by the file extent item are found and copied to the fs/subvolume tree. The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get the corresponding data checksums found and copied to the fs/subvolume tree. 5) After replaying the log userspace will not be able to read the file range from 384Kb to 512Kb, because the checksums are missing and resulting in an -EIO error. The following steps reproduce this scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar <power failure> $ mount /dev/sdc /mnt/sdc $ md5sum /mnt/sdc/foobar md5sum: /mnt/sdc/foobar: Input/output error $ dmesg | tail [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408 [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504 [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600 [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696 [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792 [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888 [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984 [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080 [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 Fix this simply by deleting first any checksums, from the log tree, for the range of the extent we are logging at copy_items(). This ensures we do not get checksum items in the log tree that have overlapping ranges. This is a long time issue that has been present since we have the clone (and deduplication) ioctl, and can happen both when an extent is shared between different files and within the same file. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32da5386 |
|
29-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_block_group_cache The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cfbb825c |
|
10-Jul-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add incompat for raid1 with 3, 4 copies The new raid1c3 and raid1c4 profiles are backward incompatible and the name shall be 'raid1c34', the status can be found in the global supported features in /sys/fs/btrfs/features or in the per-filesystem directory. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8d6fac00 |
|
02-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add support for 4-copy replication (raid1c4) Add new block group profile to store 4 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2- and 3-copy RAID1. The minimum number of devices is 4, the maximum number of devices/chunks that can be lost/damaged is 3. There is no comparable traditional RAID level, the profile is added for future needs to accompany triple-parity and beyond. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
47e6f742 |
|
02-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add support for 3-copy replication (raid1c3) Add new block group profile to store 3 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2-copy RAID1. The minimum number of devices is 3, the maximum number of devices/chunks that can be lost/damaged is 2. Like RAID6 but with 33% space utilization. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0222dfdd |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename extent buffer block group item accessors Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and manipulate the items there, there's no special prefix required. The block group accessors had _disk_ because previously the names were occupied by the on-stack accessors. As this has been addressed in the previous patch, we can now unify the naming. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de0dc456 |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename block_group_item on-stack accessors to follow naming All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the name, the block group ones were not following that scheme, so let's switch them. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4e967be |
|
08-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add member for a specific checksum driver Currently all the checksum algorithms generate a fixed size digest size and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes and BLAKE2b produces digest of 512 bits by default. We can't do that and will use the blake2b-256, this needs to be passed to the crypto API. Separate that from the base algorithm name and add a member to request specific driver, in this case with the digest size. The only place that uses the driver name is the crypto API setup. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7cea56c |
|
07-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: sysfs: export supported checksums Export supported checksum algorithms via sysfs in the list of static features: /sys/fs/btrfs/features/supported_checksums Space spearated list of checksum algorithm names. Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba8a9d07 |
|
10-Jul-2019 |
Chris Mason <clm@fb.com> |
Btrfs: delete the entire async bio submission framework Now that we're not using btrfs_schedule_bio() anymore, delete all the code that supported it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e1f60a65 |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add __pure attribute to functions The attribute is more relaxed than const and the functions could dereference pointers, as long as the observable state is not changed. We do have such functions, based on -Wsuggest-attribute=pure . The visible effects of this patch are negligible, there are differences in the assembly but hard to summarize. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4143cb8b |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add const function attribute For some reason the attribute is called __attribute_const__ and not __const, marks functions that have no observable effects on program state, IOW not reading pointers, just the arguments and calculating a value. Allows the compiler to do some optimizations, based on -Wsuggest-attribute=const . The effects are rather small, though, about 60 bytes decrese of btrfs.ko. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c66e0d4 |
|
03-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: drop unused parameter is_new from btrfs_iget The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs: unify subvol= and subvolid= mounting") and the parameter became unused. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1f95ec01 |
|
24-Sep-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move btrfs_unlock_up_safe to other locking functions The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9c7d3a54 |
|
23-Sep-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move extent_io_tree defs to their own header extent_io.c/h are huge, encompassing a bunch of different things. The extent_io_tree code can live on its own, so separate this out. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8702ba93 |
|
14-Oct-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() [Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */ /* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */ /* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8e36a ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
80ed4548 |
|
12-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: don't needlessly create extent-refs kernel thread The patch 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") removed the async delayed refs but the thread has been created, without any use. Remove it to avoid resource consumption. Fixes: 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
af024ed2 |
|
30-Aug-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: create structure to encode checksum type and length Create a structure to encode the type and length for the known on-disk checksums. This makes it easier to add new checksums later. The structure and helpers are moved from ctree.h so they don't occupy space in all headers including ctree.h. This save some space in the final object. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c82f823c |
|
09-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: tie extent buffer and it's token together Further simplifaction of the get/set helpers is possible when the token is uniquely tied to an extent buffer. A condition and an assignment can be avoided. The initializations are moved closer to the first use when the extent buffer is valid. There's one exception in __push_leaf_left where the token is reused. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb495113 |
|
09-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: define separate btrfs_set/get_XX helpers There are helpers for all type widths defined via macro and optionally can use a token which is a cached pointer to avoid repeated mapping of the extent buffer. The token value is known at compile time, when it's valid it's always address of a local variable, otherwise it's NULL passed by the token-less helpers. This can be utilized to remove some branching as the helpers are used frequenlty. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ff49c6a |
|
27-Aug-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it found a backref for the given name. If it returns true then the actual inode_ref struct is returned in one of its parameters. That's pointless, instead refactor the function such that it returns either a pointer to the btrfs_inode_extref or NULL it it didn't find anything. This streamlines the function calling convention. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9bb8407f |
|
27-Aug-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_find_name_in_backref return btrfs_inode_ref struct btrfs_find_name_in_backref returns either 0/1 depending on whether it found a backref for the given name. If it returns true then the actual inode_ref struct is returned in one of its parameters. That's pointless, instead refactor the function such that it returns either a pointer to the btrfs_inode_ref or NULL it it didn't find anything. This streamlines the function calling convention. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1dc990df |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move dev_stats helpers to volumes.c The other dev stats functions are already there and the helpers are not used by anything else. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67b61aef |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move struct io_ctl to free-space-cache.h The io_ctl structure is used for free space management, and used only by the v1 space cache code, but unfortunatlly the full definition is required by block-group.h so it can't be moved to free-space-cache.c without additional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
18d0f5c6 |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move functions for tree compare to send.c Send is the only user of tree_compare, we can move it there along with the other helpers and definitions. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4b231ae4 |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename and export read_node_slot Preparatory work for code that will be moved out of ctree and uses this function. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8a953348 |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move private raid56 definitions from ctree.h Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
602cbe91 |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move cond_wake_up functions out of ctree The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3acd4850 |
|
21-Aug-2019 |
Christophe Leroy <christophe.leroy@c-s.fr> |
btrfs: fix allocation of free space cache v1 bitmap pages Various notifications of type "BUG kmalloc-4096 () : Redzone overwritten" have been observed recently in various parts of the kernel. After some time, it has been made a relation with the use of BTRFS filesystem and with SLUB_DEBUG turned on. [ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten [ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc [ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224 [ 22.811193] __slab_alloc.constprop.26+0x44/0x70 [ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec [ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs] [ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs] [ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs] [ 22.812321] find_free_extent+0x680/0x12a4 [btrfs] [ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs] [ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs] [ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs] [ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs] [ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs] [ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs] [ 22.813973] close_ctree+0xf8/0x2a4 [btrfs] [ 22.814107] generic_shutdown_super+0x80/0x110 [ 22.814250] kill_anon_super+0x18/0x30 [ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs] [ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83 [ 22.814841] proc_cgroup_show+0xc0/0x248 [ 22.814967] proc_single_show+0x54/0x98 [ 22.815086] seq_read+0x278/0x45c [ 22.815190] __vfs_read+0x28/0x17c [ 22.815289] vfs_read+0xa8/0x14c [ 22.815381] ksys_read+0x50/0x94 [ 22.815475] ret_from_syscall+0x0/0x38 Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy") changed the way bitmap blocks are copied. But allthough bitmaps have the size of a page, they were allocated with kzalloc(). Most of the time, kzalloc() allocates aligned blocks of memory, so copy_page() can be used. But when some debug options like SLAB_DEBUG are activated, kzalloc() may return unaligned pointer. On powerpc, memcpy(), copy_page() and other copying functions use 'dcbz' instruction which provides an entire zeroed cacheline to avoid memory read when the intention is to overwrite a full line. Functions like memcpy() are writen to care about partial cachelines at the start and end of the destination, but copy_page() assumes it gets pages. As pages are naturally cache aligned, copy_page() doesn't care about partial lines. This means that when copy_page() is called with a misaligned pointer, a few leading bytes are zeroed. To fix it, allocate bitmaps through kmem_cache instead of using kzalloc() The cache pool is created with PAGE_SIZE alignment constraint. Reported-by: Erhard F. <erhard_f@mailbox.org> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371 Fixes: 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy") Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_free_space_bitmap ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2bd36e7b |
|
22-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename the btrfs_calc_*_metadata_size helpers btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that it doesn't take into account any splitting at the levels, because truncate will never split nodes. However truncate _and_ changing will never split nodes, so rename btrfs_calc_trunc_metadata_size to btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely for inserting items, so rename this to btrfs_calc_insert_metadata_size. Making these clearer will help when I start using them differently in upcoming patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0785a9aa |
|
08-Aug-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: tree-checker: Add EXTENT_DATA_REF check EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its key->offset. This patch will check the following contents: - Key->objectid Basic alignment check. - Hash Hash of each extent_data_ref item must match key->offset. - Offset Basic alignment check. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d3984c90 |
|
01-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce an evict flushing state We have this weird space flushing loop inside inode.c for evict where we'll do the normal LIMIT flush, and then commit the transaction and hope we get our space. This is super janky, and in fact there's really nothing stopping us from using FLUSH_ALL except that we run delayed iputs, which means we could deadlock. So introduce a new flush state for eviction that does the normal priority flushing with all of the states that are safe for eviction. The nice side-effect of this is that we'll try harder for evictions. Previously if (for example generic/269) you had a bunch of other operations happening on the fs you could race with those reservations when committing the transaction, and eventually miss getting a reservation for the evict. With this code we'll have our ticket in place through the transaction commit, so any pinned bytes will go to our pending evictions first. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
844245b4 |
|
01-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a flush step for delayed iputs Delayed iputs could very well free up enough space without needing to commit the transaction, so make this step it's own step. This will allow us to skip the step for evictions in a later patch. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3e43c279 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group cleanup code This can now be easily migrated as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh on top of sysfs cleanups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
878d7b67 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the alloc_profile helpers These feel more at home in block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, adjust btrfs_get_alloc_profile exports ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
07730d87 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the chunk allocation code This feels more at home in block-group.c than in extent-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>i [ refresh ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
77745c05 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the dirty bg writeout code This can be easily migrated over now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
26ce2095 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate inc/dec_block_group_ro code This can easily be moved now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4358d963 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group read/creation code All of the prep work has been done so we can now cleanly move this chunk over. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, add btrfs_get_alloc_profile export, comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3e0520b |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group removal code This is the removal code and the unused bgs code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, move clear_incompat_bg_bits ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f21246d |
|
06-Aug-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group caching code We can now just copy it over to block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67715b20 |
|
01-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup kobject.h includes The kobject should be pulled in via sysfs.h and that needs to include it because it needs various definitions like kobj_attribute or kobject. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
89439109 |
|
01-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move sysfs declarations out of ctree.h As the header for sysfs code already exists, use it to clean up ctree.h. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6f410d1b |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: export the excluded extents helpers We'll need this to move the caching stuff around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3eeb3226 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate nocow and reservation helpers These are relatively straightforward as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cad1284 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group ref counting stuff Another easy set to move over to block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2e405ad8 |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the block group lookup code Move these bits first as they are the easiest to move. Export two of the helpers so they can be moved all at once. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aac0023c |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move basic block_group definitions to their own header This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
82253cb6 |
|
01-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused key type set/get helpers The switch to open coded set/get has happend long time ago in 962a298f3511 ("btrfs: kill the key type accessor helpers"), remove the stray helpers. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
112974d4 |
|
19-Jul-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate() [BUG] Test case btrfs/156 fails since commit 302167c50b32 ("btrfs: don't end the transaction for delayed refs in throttle") with ENOSPC. [CAUSE] The ENOSPC is reported from btrfs_can_relocate(). This function will check: - If this block group is empty, we can relocate - If we can enough free space, we can relocate Above checks are valid but the following check is vague due to its implementation: - If and only if we can allocated a new block group to contain all the used space, we can relocate This design itself is OK, but the way to determine if we can allocate a new block group is problematic. btrfs_can_relocate() uses find_free_dev_extent() to find free space on a device. However find_free_dev_extent() only searches commit root and excludes dev extents allocated in current trans, this makes it unable to use dev extent just freed in current transaction. So for the following example, btrfs_can_relocate() will report ENOSPC: The example block group layout: 1M 129M 257M 385M 513M 550M |///////|///////////|//////////| | | // = Used bg, consider all bg is 100% used for easy calculation. And all block groups are SINGLE, on-disk bytenr is the same as the logical bytenr. 1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100 1M 129M 257M 385M 513M 550M |///////| |//////////|/////////| In transid 100, bg in [129M, 257M) get relocated to [385M, 513M) However transid 100 is not committed yet, so in dev commit tree, we still have the old dev extents layout: 1M 129M 257M 385M 513M 550M |///////|///////////|//////////| | | 2) Try to relocate bg [257M, 385M) We goes into btrfs_can_relocate(), no free space in current bgs, so we check if we can find large enough free dev extents. The first slot is [385M, 513M), but that is already used by new bg at [385M, 513M), so we continue search. The remaining slot is [512M, 550M), smaller than the bg's length 128M. So btrfs_can_relocate report ENOSPC. However this is over killed, in fact if we just skip btrfs_can_relocate() check, and go into regular relocation routine, at extent reservation time, if we can't find free extent, then we fallback to commit transaction, which will free up the dev extents and allow new block group to be created. [FIX] The fix here is to remove btrfs_can_relocate() completely. If we hit the false ENOSPC case just like btrfs/156, extent allocator will push harder by committing transaction and we will have space for new block group, avoiding the false ENOSPC. If we really ran out of space, we will hit ENOSPC at relocate_block_group(), and btrfs will just reports the ENOSPC error as usual. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
330a5827 |
|
17-Jul-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove leftover of in-band dedupe It's unlikely in-band dedupe is going to land so just remove any leftovers - dedupe.h header as well as the 'dedupe' parameter to btrfs_set_extent_delalloc. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
690a5dbf |
|
05-Jul-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents When cloning extents (or deduplicating) we create a transaction with a space reservation that considers we will drop or update a single file extent item of the destination inode (that we modify a single leaf). That is fine for the vast majority of scenarios, however it might happen that we need to drop many file extent items, and adjust at most two file extent items, in the destination root, which can span multiple leafs. This will lead to either the call to btrfs_drop_extents() to fail with ENOSPC or the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode() (called through clone_finish_inode_update()) to fail with ENOSPC. Such failure results in a transaction abort, leaving the filesystem in a read-only mode. In order to fix this we need to follow the same approach as the hole punching code, where we create a local reservation with 1 unit and keep ending and starting transactions, after balancing the btree inode, when __btrfs_drop_extents() returns ENOSPC. So fix this by making the extent cloning call calls the recently added btrfs_punch_hole_range() helper, which is what does the mentioned work for hole punching, and make sure whenever we drop extent items in a transaction, we also add a replacing file extent item, to avoid corruption (a hole) if after ending a transaction and before starting a new one, the old transaction gets committed and a power failure happens before we finish cloning. A test case for fstests follows soon. Reported-by: David Goodwin <david@codepoets.co.uk> Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/ Reported-by: Sam Tygier <sam@tygier.co.uk> Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/ Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d7cd4dd9 |
|
07-Aug-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix sysfs warning and missing raid sysfs directories In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups"), we started using the member "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads to a series of warnings when running some test cases of fstests, such as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those warnings are like the following: [116648.059212] kernfs: can not remove 'total_bytes', no directory [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000 [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001 [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.080585] Call Trace: [116648.081262] remove_files+0x31/0x70 [116648.081929] sysfs_remove_group+0x38/0x80 [116648.082596] sysfs_remove_groups+0x34/0x70 [116648.083258] kobject_del+0x20/0x60 [116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.084608] close_ctree+0x19a/0x380 [btrfs] [116648.085278] generic_shutdown_super+0x6c/0x110 [116648.085951] kill_anon_super+0xe/0x30 [116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.087289] deactivate_locked_super+0x3a/0x70 [116648.087956] cleanup_mnt+0xb4/0x160 [116648.088620] task_work_run+0x7e/0xc0 [116648.089285] exit_to_usermode_loop+0xfa/0x100 [116648.089933] do_syscall_64+0x1cb/0x220 [116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.091197] RIP: 0033:0x7f9cdc073b37 (...) [116648.100046] ---[ end trace 22e24db328ccadf8 ]--- [116648.100618] ------------[ cut here ]------------ [116648.101175] kernfs: can not remove 'used_bytes', no directory [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000 [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001 [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.116649] Call Trace: [116648.117326] remove_files+0x31/0x70 [116648.117997] sysfs_remove_group+0x38/0x80 [116648.118671] sysfs_remove_groups+0x34/0x70 [116648.119342] kobject_del+0x20/0x60 [116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.120707] close_ctree+0x19a/0x380 [btrfs] [116648.121396] generic_shutdown_super+0x6c/0x110 [116648.122057] kill_anon_super+0xe/0x30 [116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.123335] deactivate_locked_super+0x3a/0x70 [116648.123961] cleanup_mnt+0xb4/0x160 [116648.124586] task_work_run+0x7e/0xc0 [116648.125210] exit_to_usermode_loop+0xfa/0x100 [116648.125830] do_syscall_64+0x1cb/0x220 [116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.127080] RIP: 0033:0x7f9cdc073b37 (...) [116648.135923] ---[ end trace 22e24db328ccadf9 ]--- These happen because, during the unmount path, we call kobject_del() for raid kobjects that are not fully initialized, meaning that we set their ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set their parent kobject, which is done through btrfs_add_raid_kobjects(). We have this split raid kobject setup since commit 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") in order to avoid triggering reclaim during contextes where we can not (either we are holding a transaction handle or some lock required by the transaction commit path), so that we do the calls to kobject_add(), which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects() in contextes where it is safe to trigger reclaim. That change expected that a new raid kobject can only be created either when mounting the filesystem or after raid profile conversion through the relocation path. However, we can have new raid kobject created in other two cases at least: 1) During device replace (or scrub) after adding a device a to the filesystem. The replace procedure (and scrub) do calls to btrfs_inc_block_group_ro() which can allocate a new block group with a new raid profile (because we now have more devices). This can be triggered by test cases btrfs/027 and btrfs/176. 2) During a degraded mount trough any write path. This can be triggered by test case btrfs/124. Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only makes things more complex and fragile, can also introduce deadlocks with reclaim the following way: 1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or anywhere in the replace/scrub path will cause a deadlock with reclaim because if reclaim happens and a transaction commit is triggered, the transaction commit path will block at btrfs_scrub_pause(). 2) During degraded mounts it is essentially impossible to figure out where to add extra calls to btrfs_add_raid_kobjects(), because allocation of a block group with a new raid profile can happen anywhere, which means we can't safely figure out which contextes are safe for reclaim, as we can either hold a transaction handle or some lock needed by the transaction commit path. So it is too complex and error prone to have this split setup of raid kobjects. So fix the issue by consolidating the setup of the kobjects in a single place, at link_block_group(), and setup a nofs context there in order to prevent reclaim being triggered by the memory allocations done through the call chain of kobject_add(). Besides fixing the sysfs warnings during kobject_del(), this also ensures the sysfs directories for the new raid profiles end up created and visible to users (a bug that existed before the 5.3 commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")). Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
86736342 |
|
19-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the delalloc space stuff to it's own home We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've separated it out to it's own file. inode.c and file.c are already pretty large, and this code is complicated enough to live in its own space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb6dea26 |
|
19-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate btrfs_trans_release_chunk_metadata Move this into transaction.c with the rest of the transaction related code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ef03deb |
|
19-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the delayed refs rsv code These belong with the delayed refs related code, not in extent-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d12ffdd1 |
|
19-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_block_rsv definitions into it's own header Prep work for separating out all of the block_rsv related code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2a67a76 |
|
18-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: export block_rsv_use_bytes We are going to need this to move the metadata reservation stuff to space_info.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc471cb0 |
|
18-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename do_chunk_alloc to btrfs_chunk_alloc Really we just need the enum, but as we break more things up it'll help to have this external to extent-tree.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8719aaae |
|
18-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move space_info to space-info.h Migrate the struct definition and the one helper that's in ctree.h into space-info.h Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c9d713d5 |
|
13-Jun-2019 |
David Sterba <dsterba@suse.com> |
btrfs: improve messages when updating feature flags Currently the messages printed after setting an incompat feature are cryptis, we can easily make it better as the textual description is passed to the helpers. Old: setting 128 feature flag updated: setting incompat feature flag for RAID56 (0x80) Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9e967495 |
|
22-Apr-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: prevent send failures and crashes due to concurrent relocation Send always operates on read-only trees and always expected that while it is in progress, nothing changes in those trees. Due to that expectation and the fact that send is a read-only operation, it operates on commit roots and does not hold transaction handles. However relocation can COW nodes and leafs from read-only trees, which can cause unexpected failures and crashes (hitting BUG_ONs). while send using a node/leaf, it gets COWed, the transaction used to COW it is committed, a new transaction starts, the extent previously used for that node/leaf gets allocated, possibly for another tree, and the respective extent buffer' content changes while send is still using it. When this happens send normally fails with EIO being returned to user space and messages like the following are found in dmesg/syslog: [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253 [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984 Other times, less often, we hit a BUG_ON() because an extent buffer that send is using used to be a node, and while send is still using it, it got COWed and got reused as a leaf while send is still using, producing the following trace: [ 3478.466280] ------------[ cut here ]------------ [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806! [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1 [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246 [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002 [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000 [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708 [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000 [ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0 [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3478.479003] Call Trace: [ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0 [ 3478.480202] tree_advance+0x173/0x1d0 [btrfs] [ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs] [ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs] [ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs] [ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs] [ 3478.483581] ? rq_clock_task+0x2e/0x60 [ 3478.484086] ? wake_up_new_task+0x1f3/0x370 [ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0 [ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 3478.485552] do_vfs_ioctl+0xa2/0x6f0 [ 3478.486016] ? __fget+0x113/0x200 [ 3478.486467] ksys_ioctl+0x70/0x80 [ 3478.486911] __x64_sys_ioctl+0x16/0x20 [ 3478.487337] do_syscall_64+0x60/0x1b0 [ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7 (...) [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7 [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005 [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700 [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005 [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001 (...) [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]--- Another possibility, much less likely to happen, is that send will not fail but the contents of the stream it produces may not be correct. To avoid this, do not allow send and relocation (balance) to run in parallel. In the long term the goal is to allow for both to be able to run concurrently without any problems, but that will take a significant effort in development and testing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
71a9c488 |
|
31-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: document BTRFS_MAX_MIRRORS The real meaning of that constant is not clear from the context due to the target device inclusion. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6f8e4fd4 |
|
30-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: use file:line format for assertion report The filename:line format is commonly understood by editors and can be copy&pasted more easily than the current format. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6d97c6e3 |
|
03-Jun-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: add boilerplate code for directly including the crypto framework Add boilerplate code for directly including the crypto framework. This helps us flipping the switch for new algorithms. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1e25a2e3 |
|
22-May-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: don't assume ordered sums to be 4 bytes BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums is 4 bytes. While this is true for CRC32C, it is not for any other checksum. Change the data type to be a byte array and adjust loop index calculation accordingly. This includes moving the adjustment of 'index' by 'ins_size' in btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum size, because before this patch the 'sums' member of 'struct btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one byte. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
65019df8 |
|
22-May-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: resurrect btrfs_crc32c() Commit 9678c54388b6 ("btrfs: Remove custom crc32c init code") removed the btrfs_crc32c() function, because it was a duplicate of the crc32c() library function we already have in the kernel. Resurrect it as a shim wrapper over crc32c() to make following transformations of the checksumming code in btrfs easier. Also provide a btrfs_crc32_final() to ease following transformations. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c8bf1b67 |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove mapping tree structures indirection fs_info::mapping_tree is the physical<->logical mapping tree and uses the same underlying structure as extents, but is embedded to another structure. There are no other members and this indirection is useless. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b4e675a |
|
16-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: detect fast implementation of crc32c on all architectures Currently, there's only check for fast crc32c implementation on X86, based on the CPU flags. This is used to decide if checksumming should be offloaded to worker threads or can be calculated by the caller. As there are more architectures that implement a faster version of crc32c (ARM, SPARC, s390, MIPS, PowerPC), also there are specialized hw cards. The detection is based on driver name, all generic C implementations contain 'generic', while the specialized versions do not. Alternatively the priority could be used, but this is not currently provided by the crypto API. The flag is set per-filesystem at mount time and used for the offloading decisions. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
26602cab |
|
10-Apr-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: use ->free_inode() a lot of stuff remains in ->destroy_inode() Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4297ff84 |
|
10-Apr-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: track DIO bytes in flight When diagnosing a slowdown of generic/224 I noticed we were not doing anything when calling into shrink_delalloc(). This is because all writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes counter is 0, which short circuits most of the work inside of shrink_delalloc(). However O_DIRECT writes still consume metadata resources and generate ordered extents, which we can still wait on. Fix this by tracking outstanding DIO write bytes, and use this as well as the delalloc bytes counter to decide if we need to lookup and wait on any ordered extents. If we have more DIO writes than delalloc bytes we'll go ahead and wait on any ordered extents regardless of our flush state as flushing delalloc is likely to not gain us anything. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ use dio instead of odirect in identifiers ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62d54f3a |
|
22-Apr-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between send and deduplication that lead to failures and crashes Send operates on read only trees and expects them to never change while it is using them. This is part of its initial design, and this expection is due to two different reasons: 1) When it was introduced, no operations were allowed to modifiy read-only subvolumes/snapshots (including defrag for example). 2) It keeps send from having an impact on other filesystem operations. Namely send does not need to keep locks on the trees nor needs to hold on to transaction handles and delay transaction commits. This ends up being a consequence of the former reason. However the deduplication feature was introduced later (on September 2013, while send was introduced in July 2012) and it allowed for deduplication with destination files that belong to read-only trees (subvolumes and snapshots). That means that having a send operation (either full or incremental) running in parallel with a deduplication that has the destination inode in one of the trees used by the send operation, can result in tree nodes and leaves getting freed and reused while send is using them. This problem is similar to the problem solved for the root nodes getting freed and reused when a snapshot is made against one tree that is currenly being used by a send operation, fixed in commits [1] and [2]. These commits explain in detail how the problem happens and the explanation is valid for any node or leaf that is not the root of a tree as well. This problem was also discussed and explained recently in a thread [3]. The problem is very easy to reproduce when using send with large trees (snapshots) and just a few concurrent deduplication operations that target files in the trees used by send. A stress test case is being sent for fstests that triggers the issue easily. The most common error to hit is the send ioctl return -EIO with the following messages in dmesg/syslog: [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 The first one is very easy to hit while the second one happens much less frequently, except for very large trees (in that test case, snapshots with 100000 files having large xattrs to get deep and wide trees). Less frequently, at least one BUG_ON can be hit: [1631742.130080] ------------[ cut here ]------------ [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1631742.141272] Call Trace: [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] [1631742.146179] ? account_entity_enqueue+0xd3/0x100 [1631742.146662] ? reweight_entity+0x154/0x1a0 [1631742.147135] ? update_curr+0x20/0x2a0 [1631742.147593] ? check_preempt_wakeup+0x103/0x250 [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 [1631742.149361] ? __fget+0x113/0x200 [1631742.149767] ksys_ioctl+0x70/0x80 [1631742.150159] __x64_sys_ioctl+0x16/0x20 [1631742.150543] do_syscall_64+0x60/0x1b0 [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe [1631742.151326] RIP: 0033:0x7f4cd9f5add7 (...) [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 (...) [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- That BUG_ON happens because while send is using a node, that node is COWed by a concurrent deduplication, gets freed and gets reused as a leaf (because a transaction commit happened in between), so when it attempts to read a slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer contents were wiped out and it now matches a leaf (which can even belong to some other tree now), hitting the BUG_ON(level == 0). Fix this concurrency issue by not allowing send and deduplication to run in parallel if both operate on the same readonly trees, returning EAGAIN to user space and logging an exlicit warning in dmesg/syslog. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5c8daa5 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flags Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c71dd880 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from btrfs_extend_item Signed-off-by: David Sterba <dsterba@suse.com>
|
#
78ac4f9e |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter fs_info from btrfs_truncate_item Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ffd4bb2a |
|
04-Apr-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent() Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long parameter list and the confusing @owner parameter. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
82fa113f |
|
04-Apr-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
163e97ee |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from device in btrfs_scrub_cancel_dev We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32b593bf |
|
17-Apr-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: remove no longer used function to run delayed refs asynchronously It used to be called from only two places (truncate path and releasing a transaction handle), but commits 28bad2125767c5 ("btrfs: fix truncate throttling") and db2462a6ad3dc4 ("btrfs: don't run delayed refs in the end transaction logic") removed their calls to this function, so it's not used anymore. Just remove it and all its helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5742d15f |
|
19-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in btrfs_write_dirty_block_groups We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bbebb3e0 |
|
19-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in btrfs_setup_space_cache We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e74e3993 |
|
27-Mar-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Factor out in_range macro This is used in more than one places so let's factor it out in ctree.h. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1c11b63e |
|
27-Mar-2019 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: replace pending/pinned chunks lists with io tree The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
496245ca |
|
13-Mar-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: tree-checker: Verify inode item There is a report in kernel bugzilla about mismatch file type in dir item and inode item. This inspires us to check inode mode in inode item. This patch will check the following members: - inode key objectid Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO. - inode key offset Should be 0 - inode item generation - inode item transid No newer than sb generation + 1. The +1 is for log tree. - inode item mode No unknown bits. No invalid S_IF* bit. NOTE: S_IFMT check is not enough, need to check every know type. - inode item nlink Dir should have no more link than 1. - inode item flags Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
90b1377d |
|
27-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: qgroup: remove obsolete fs_info members The commit fcebe4562dec ("Btrfs: rework qgroup accounting") reworked qgroups and added some new structures. Another rework of qgroup mechanics e69bcee37692 ("btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.") stopped using them and left uncleaned. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e902baac |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in btrfs_leaf_free_space We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bcdc428c |
|
19-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in btrfs_exclude_logged_extents We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8f881e8c |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in leaf_data_end We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
80fbc341 |
|
19-Mar-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Make btrfs_(set|clear)_header_flag return void From the introduction of btrfs_(set|clear)_header_flag, there is no usage of its return value. So just make it return void. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
afe1a715 |
|
07-Mar-2019 |
Rasmus Villemoes <linux@rasmusvillemoes.dk> |
btrfs: implement btrfs_debug* in terms of helper macro First, the btrfs_debug macros open-code (one possible definition of) DYNAMIC_DEBUG_BRANCH, so they don't benefit from the CONFIG_JUMP_LABEL optimization. Second, a planned change of struct _ddebug (to reduce its size on 64 bit machines) requires that all descriptors in a translation unit use distinct identifiers. Using the new _dynamic_func_call_no_desc helper macro from dynamic_debug.h takes care of both of these. No functional change. Link: http://lkml.kernel.org/r/20190212214150.4807-12-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Jason Baron <jbaron@akamai.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
78c52d9e |
|
06-Feb-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: check for refs on snapshot delete resume There's a bug in snapshot deletion where we won't update the drop_progress key if we're in the UPDATE_BACKREF stage. This is a problem because we could drop refs for blocks we know don't belong to ours. If we crash or umount at the right time we could experience messages such as the following when snapshot deletion resumes BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258 owner 1 offset 0 ------------[ cut here ]------------ WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs] CPU: 3 PID: 16052 Comm: umount Tainted: G W OE 5.0.0-rc4+ #147 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011 RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs] RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18 RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000 R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0 FS: 00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0 Call Trace: ? _raw_spin_unlock+0x27/0x40 ? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs] __btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs] ? join_transaction+0x2b/0x460 [btrfs] btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs] btrfs_commit_transaction+0x52/0xa50 [btrfs] ? start_transaction+0xa6/0x510 [btrfs] btrfs_sync_fs+0x79/0x1c0 [btrfs] sync_filesystem+0x70/0x90 generic_shutdown_super+0x27/0x120 kill_anon_super+0x12/0x30 btrfs_kill_super+0x16/0xa0 [btrfs] deactivate_locked_super+0x43/0x70 deactivate_super+0x40/0x60 cleanup_mnt+0x3f/0x80 __cleanup_mnt+0x12/0x20 task_work_run+0x8b/0xc0 exit_to_usermode_loop+0xce/0xd0 do_syscall_64+0x20b/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe To fix this simply mark dead roots we read from disk as DEAD and then set the walk_control->restarted flag so we know we have a restarted deletion. From here whenever we try to drop refs for blocks we check to verify our ref is set on them, and if it is not we skip it. Once we find a ref that is set we unset walk_control->restarted since the tree should be in a normal state from then on, and any problems we run into from there are different issues. I tested this with an existing broken fs and my reproducer that creates a broken fs and it fixed both file systems. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ea82076 |
|
12-Feb-2019 |
David Sterba <dsterba@suse.com> |
btrfs: scrub: remove unused nocow worker pointer The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow optimization was removed from scrub in 9bebe665c3e4 ("btrfs: scrub: Remove unused copy_nocow_pages and its callchain"). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c8352942 |
|
12-Feb-2019 |
David Sterba <dsterba@suse.com> |
btrfs: scrub: add assertions for worker pointers The scrub worker pointers are not NULL iff the scrub is running, so reset them back once the last reference is dropped. Add assertions to the initial phase of scrub to verify that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ff09c4ca |
|
29-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: scrub: convert scrub_workers_refcnt to refcount_t Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so we get the extra checks. All reference changes are still done under scrub_lock. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
450114fc |
|
21-Nov-2018 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: don't use global reserve for chunk allocation We've done this forever because of the voodoo around knowing how much space we have. However, we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. This has a side effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves issues I was seeing with the mixed bg tests in xfstests without the new flushing state. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
034f784d |
|
03-Dec-2018 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: replace cleaner_delayed_iput_mutex with a waitqueue The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. The waiting is killable as it could be indirectly called from user operations like fallocate or zero-range. Such call sites should handle the error but otherwise it's not necessary. Eg. flush_space just needs to attempt to make space by waiting on iputs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add killable comment and changelog parts ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2eec5f00 |
|
29-Jan-2019 |
Anders Roxell <anders.roxell@linaro.org> |
btrfs: let the assertion expression compile in all configs A compiler warning (in a patch in development) pointed to a variable that was used only inside and ASSERT: u64 root_objectid = root->root_key.objectid; ASSERT(root_objectid == ...); fs/btrfs/relocation.c: In function ‘insert_dirty_subv’: fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable] u64 root_objectid = root->root_key.objectid; ^~~~~~~~~~~~~ When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used. Rework the assertion helper by adding a runtime check instead of the '#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the condition being passed into an inline function after preprocessing. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
370a11b8 |
|
23-Jan-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Introduce per-root swapped blocks infrastructure To allow delayed subtree swap rescan, btrfs needs to record per-root information about which tree blocks get swapped. This patch introduces the required infrastructure. The designed workflow will be: 1) Record the subtree root block that gets swapped. During subtree swap: O = Old tree blocks N = New tree blocks reloc tree subvolume tree X Root Root / \ / \ NA OB OA OB / | | \ / | | \ NC ND OE OF OC OD OE OF In this case, NA and OA are going to be swapped, record (NA, OA) into subvolume tree X. 2) After subtree swap. reloc tree subvolume tree X Root Root / \ / \ OA OB NA OB / | | \ / | | \ OC OD OE OF NC ND OE OF 3a) COW happens for OB If we are going to COW tree block OB, we check OB's bytenr against tree X's swapped_blocks structure. If it doesn't fit any, nothing will happen. 3b) COW happens for NA Check NA's bytenr against tree X's swapped_blocks, and get a hit. Then we do subtree scan on both subtrees OA and NA. Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND). Then no matter what we do to subvolume tree X, qgroup numbers will still be correct. Then NA's record gets removed from X's swapped_blocks. 4) Transaction commit Any record in X's swapped_blocks gets removed, since there is no modification to swapped subtrees, no need to trigger heavy qgroup subtree rescan for them. This will introduce 128 bytes overhead for each btrfs_root even qgroup is not enabled. This is to reduce memory allocations and potential failures. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d2311e69 |
|
23-Jan-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots Relocation code will drop btrfs_root::reloc_root as soon as merge_reloc_root() finishes. However later qgroup code will need to access btrfs_root::reloc_root after merge_reloc_root() for delayed subtree rescan. So alter the timming of resetting btrfs_root:::reloc_root, make it happens after transaction commit. With this patch, we will introduce a new btrfs_root::state, BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user that although btrfs_root::reloc_tree is still non-NULL, but still it's not used any more. The lifespan of btrfs_root::reloc tree will become: Old behavior | New ------------------------------------------------------------------------ btrfs_init_reloc_root() --- | btrfs_init_reloc_root() --- set reloc_root | | set reloc_root | | | | | | | merge_reloc_root() | | merge_reloc_root() | |- btrfs_update_reloc_root() --- | |- btrfs_update_reloc_root() -+- clear btrfs_root::reloc_root | set ROOT_DEAD_RELOC_TREE | | record root into dirty | | roots rbtree | | | | reloc_block_group() Or | | btrfs_recover_relocation() | | | After transaction commit | | |- clean_dirty_subvols() --- | clear btrfs_root::reloc_root During ROOT_DEAD_RELOC_TREE set lifespan, the only user of btrfs_root::reloc_tree should be qgroup. Since reloc root needs a longer life-span, this patch will also delay btrfs_drop_snapshot() call. Now btrfs_drop_snapshot() is called in clean_dirty_subvols(). This patch will increase the size of btrfs_root by 16 bytes. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4ab47a8d |
|
12-Dec-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove unused arguments from btrfs_get_extent_fiemap This function is a simple wrapper over btrfs_get_extent that returns either: a) A real extent in the passed range or b) Adjusted extent based on whether delalloc bytes are found backing up a hole. To support these semantics it doesn't need the page/pg_offset/create arguments which are passed to btrfs_get_extent in case an extent is to be created. So simplify the function by removing the unused arguments. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc9a8bf7 |
|
19-Dec-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make first argument of btrfs_run_delalloc_range directly an inode Since this function is no longer a callback there is no need to have its first argument obfuscated with a void *. Change it directly to a pointer to an inode. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd340d0f |
|
11-Jan-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: wakeup cleaner thread when adding delayed iput The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. Delaying iputs means we are potentially delaying the eviction of an inode and it's respective space. The cleaner thread only gets woken up every 30 seconds, or when we require space. If there are a lot of inodes that need to be deleted we could induce a serious amount of latency while we wait for these inodes to be evicted. So instead wakeup the cleaner if it's not already awake to process any new delayed iputs we add to the list. If we suddenly need space we will less likely be backed up behind a bunch of inodes that are waiting to be deleted, and we could possibly free space before we need to get into the flushing logic which will save us some latency. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
31890da0 |
|
21-Nov-2018 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: handle delayed ref head accounting cleanup in abort We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. The test generic/475 on a 2G VM can trigger the problems eg.: [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs] [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394 [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014 [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs] [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206 [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000 [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400 [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000 [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108 [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100 [ 8502.170296] FS: 00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000 [ 8502.172439] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0 [ 8502.175094] Call Trace: [ 8502.175759] close_ctree+0x17f/0x350 [btrfs] [ 8502.176721] generic_shutdown_super+0x64/0x100 [ 8502.177702] kill_anon_super+0x14/0x30 [ 8502.178607] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8502.179602] deactivate_locked_super+0x29/0x60 [ 8502.180595] cleanup_mnt+0x3b/0x70 [ 8502.181406] task_work_run+0x98/0xc0 [ 8502.182255] exit_to_usermode_loop+0x83/0x90 [ 8502.183113] do_syscall_64+0x15b/0x180 [ 8502.183919] entry_SYSCALL_64_after_hwframe+0x49/0xbe Corresponding to release_global_block_rsv() { ... WARN_ON(fs_info->delayed_refs_rsv.reserved > 0); CC: stable@vger.kernel.org Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add log dump ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a65001e8 |
|
10-Dec-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: sanitize security_mnt_opts use 1) keeping a copy in btrfs_fs_info is completely pointless - we never use it for anything. Getting rid of that allows for simpler calling conventions for setup_security_options() (caller is responsible for freeing mnt_opts in all cases). 2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(), same as we would if not for FS_BINARY_MOUNTDATA. Behaviours *are* close (in fact, selinux sb_set_mnt_opts() ought to punt to sb_remount() in "already initialized" case), but let's handle that uniformly. And the only reason why the original btrfs changes didn't go for security_sb_remount() in btrfs_remount() case is that it hadn't been exported. Let's export it for a while - it'll be going away soon anyway. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
83354f07 |
|
30-Nov-2018 |
Josef Bacik <jbacik@fb.com> |
btrfs: catch cow on deleting snapshots When debugging some weird extent reference bug I suspected that we were changing a snapshot while we were deleting it, which could explain my bug. This was indeed what was happening, and this patch helped me verify my theory. It is never correct to modify the snapshot once it's being deleted, so mark the root when we are deleting it and make sure we complain about it when it happens. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
64403612 |
|
03-Dec-2018 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rework btrfs_check_space_for_delayed_refs Now with the delayed_refs_rsv we can now know exactly how much pending delayed refs space we need. This means we can drastically simplify btrfs_check_space_for_delayed_refs by simply checking how much space we have reserved for the global rsv (which acts as a spill over buffer) and the delayed refs rsv. If our total size is beyond that amount then we know it's time to commit the transaction and stop any more delayed refs from being generated. With the introduction of dealyed_refs_rsv infrastructure, namely btrfs_update_delayed_refs_rsv we now know exactly how much pending delayed refs space is required. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
413df725 |
|
03-Dec-2018 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add new flushing states for the delayed refs rsv A nice thing we gain with the delayed refs rsv is the ability to flush the delayed refs on demand to deal with enospc pressure. Add states to flush delayed refs on demand, and this will allow us to remove a lot of ad-hoc work around checking to see if we should commit the transaction to run our delayed refs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba2c4d4e |
|
03-Dec-2018 |
Josef Bacik <jbacik@fb.com> |
btrfs: introduce delayed_refs_rsv Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Historical background: The global reserve has grown to cover everything we don't reserve space explicitly for, and we've grown a lot of weird ad-hoc heuristics to know if we're running short on space and when it's time to force a commit. A failure rate of 20-40 file systems when we run hundreds of thousands of them isn't super high, but cleaning up this code will make things less ugly and more predictible. Thus the delayed refs rsv. We always know how many delayed refs we have outstanding, and although running them generates more we can use the global reserve for that spill over, which fits better into it's desired use than a full blown reservation. This first approach is to simply take how many times we're reserving space for and multiply that by 2 in order to save enough space for the delayed refs that could be generated. This is a niave approach and will probably evolve, but for now it works. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> # high-level review [ added background notes from the cover letter ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
53176dde |
|
04-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: dev-replace: remove custom read/write blocking scheme After the rw semaphore has been added, the custom blocking using ::blocking_readers and ::read_lock_wq is redundant. The blocking logic in __btrfs_map_block is replaced by extending the time the semaphore is held, that has the same blocking effect on writes as the previous custom scheme that waited until ::blocking_readers was zero. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
129827e3 |
|
04-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: dev-replace: swich locking to rw semaphore This is the first part of removing the custom locking and waiting scheme used for device replace. It was probably copied from extent buffer locking, but there's nothing that would require more than is provided by the common locking primitives. The rw spinlock protects waiting tasks counter in case of incompatible locks and the waitqueue. Same as rw semaphore. This patch only switches the locking primitive, for better bisectability. There should be no functional change other than the overhead of the locking and potential sleeping instead of spinning when the lock is contended. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bbe339cc |
|
27-Nov-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop extra enum initialization where using defaults The first auto-assigned value to enum is 0, we can use that and not initialize all members where the auto-increment does the same. This is used for values that are not part of on-disk format. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
61fa90c1 |
|
27-Nov-2018 |
David Sterba <dsterba@suse.com> |
btrfs: switch BTRFS_ROOT_* to enums We can use simple enum for values that are not part of on-disk format: root tree flags. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eb1a524c |
|
27-Nov-2018 |
David Sterba <dsterba@suse.com> |
btrfs: switch BTRFS_FS_* to enums We can use simple enum for values that are not part of on-disk format: internal filesystem states. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
688a75b9 |
|
27-Nov-2018 |
David Sterba <dsterba@suse.com> |
btrfs: switch BTRFS_BLOCK_RSV_* to enums We can use simple enum for values that are not part of on-disk format: block reserve types. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b00146b5 |
|
27-Nov-2018 |
David Sterba <dsterba@suse.com> |
btrfs: switch BTRFS_FS_STATE_* to enums We can use simple enum for values that are not part of on-disk format: global filesystem states. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da12fe54 |
|
27-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Refactor btrfs_merge_bio_hook This function really checks whether adding more data to the bio will straddle a stripe/chunk. So first let's give it a more appropraite name - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never used to just remove it. Thirdly, pages are submitted to either btree or data inodes so it's guaranteed that tree->ops is set so replace the check with an ASSERT. Finally, document the parameters of the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de37aa51 |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fsid/metadata_fsid fields from btrfs_info Currently btrfs_fs_info structure contains a copy of the fsid/metadata_uuid fields. Same values are also contained in the btrfs_fs_devices structure which fs_info has a reference to. Let's reduce duplication by removing the fields from fs_info and always refer to the ones in fs_devices. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7239ff4b |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Introduce support for FSID change without metadata rewrite This field is going to be used when the user wants to change the UUID of the filesystem without having to rewrite all metadata blocks. This field adds another level of indirection such that when the FSID is changed what really happens is the current UUID (the one with which the fs was created) is copied to the 'metadata_uuid' field in the superblock as well as a new incompat flag is set METADATA_UUID. When the kernel detects this flag is set it knows that the superblock in fact has 2 UUIDs: 1. Is the UUID which is user-visible, currently known as FSID. 2. Metadata UUID - this is the UUID which is stamped into all on-disk datastructures belonging to this file system. When the new incompat flag is present device scanning checks whether both fsid/metadata_uuid of the scanned device match any of the registered filesystems. When the flag is not set then both UUIDs are equal and only the FSID is retained on disk, metadata_uuid is set only in-memory during mount. Additionally a new metadata_uuid field is also added to the fs_info struct. It's initialised either with the FSID in case METADATA_UUID incompat flag is not set or with the metdata_uuid of the superblock otherwise. This commit introduces the new fields as well as the new incompat flag and switches all users of the fsid to the new logic. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates in comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f8f591df |
|
19-Nov-2018 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: introduce EXPORT_FOR_TESTS macro Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS functions are either local to the file they are implemented in and thus should be declared static or are called from within the test implementation defined in a different file. Introduce an EXPORT_FOR_TESTS macro which depending on CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a function or not. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cd24c69 |
|
01-Nov-2018 |
Ethan Lien <ethanlien@synology.com> |
btrfs: use tagged writepage to mitigate livelock of snapshot Snapshot is expected to be fast. But if there are writers steadily creating dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can omit pages dirtied after the snapshot command. This does not change the semantics regarding which data get to the snapshot, if there are pages being dirtied during the snapshotting operation. There's a sync called before snapshot is taken in old/new case, any IO in flight just after that may be in the snapshot but this depends on other system effects that might still sync the IO. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is smaller compared to the sequential write case, since we omit only half of the pages dirtied after snapshot command. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c629732d |
|
08-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered This parameter was never used, yet was part of the interface of the function ever since its introduction as extent_io_ops::writepage_end_io_hook in e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Now that NULL is passed everywhere as a value for this parameter let's remove it for good. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eede2bf3 |
|
03-Nov-2016 |
Omar Sandoval <osandov@fb.com> |
Btrfs: prevent ioctls from interfering with a swap file A later patch will implement swap file support for Btrfs, but before we do that, we need to make sure that the various Btrfs ioctls cannot change a swap file. When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag Don't allow those to happen on an active swap file. Additionally, balance, resize, device remove, and device replace are also unsafe if they affect an active swapfile. Add a red-black tree of block groups and devices which contain an active swapfile. Relocation checks each block group against this tree and skips it or errors out for balance or resize, respectively. Device remove and device replace check the tree for the device they will operate on. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
abbb55f4 |
|
01-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extent_io_ops::split_extent_hook callback This is the counterpart to merge_extent_hook, similarly, it's used only for data/freespace inodes so let's remove it, rename it and call it directly where necessary. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c848198 |
|
01-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extent_io_ops::merge_extent_hook callback This callback is used only for data and free space inodes. Such inodes are guaranteed to have their extent_io_tree::private_data set to the inode struct. Exploit this fact to directly call the function. Also give it a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a36bb5f9 |
|
01-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extent_io_ops::clear_bit_hook callback This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent), similar to what was done before remove clear_bit_hook and rename the function. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e06a1fc9 |
|
01-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extent_io_ops::set_bit_hook extent_io callback This callback is used to properly account delalloc extents for data inodes (ordinary file inodes and freespace v1 inodes). Those can be easily identified since they have their extent_io trees ->private_data member point to the inode. Let's exploit this fact to remove the needless indirection through extent_io_hooks and directly call the function. Also give the function a name which reflects its purpose - btrfs_set_delalloc_extent. This patch also modified test_find_delalloc so that the extent_io_tree used for testing doesn't have its ->private_data set which would have caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root member not being initialised. The old version of the code also didn't call set_bit_hook since the extent_io ops weren't set for the inode. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7087a9d8 |
|
01-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extent_io_ops::writepage_end_io_hook This callback is ony ever called for data page writeout so there is no need to actually abstract it via extent_io_ops. Lets just export it, remove the definition of the callback and call it directly in the functions that invoke the callback. Also rename the function to btrfs_writepage_endio_finish_ordered since what it really does is account finished io in the ordered extent data structures. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d75855b4 |
|
01-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extent_io_ops::writepage_start_hook This hook is called only from __extent_writepage_io which is already called only from the data page writeout path. So there is no need to make an indirect call via extent_io_ops. This patch just removes the callback definition, exports the callback function and calls it directly at the only call site. Also give the function a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5eaad97a |
|
01-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extent_io_ops::fill_delalloc This callback is called only from writepage_delalloc which in turn is guaranteed to be called from the data page writeout path. In the end there is no reason to have the call to this function to be indrected via the extent_io_ops structure. This patch removes the callback definition, exports the function and calls it directly. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_run_delalloc_range ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4222ea71 |
|
24-Oct-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix deadlock on tree root leaf when finding free extent When we are writing out a free space cache, during the transaction commit phase, we can end up in a deadlock which results in a stack trace like the following: schedule+0x28/0x80 btrfs_tree_read_lock+0x8e/0x120 [btrfs] ? finish_wait+0x80/0x80 btrfs_read_lock_root_node+0x2f/0x40 [btrfs] btrfs_search_slot+0xf6/0x9f0 [btrfs] ? evict_refill_and_join+0xd0/0xd0 [btrfs] ? inode_insert5+0x119/0x190 btrfs_lookup_inode+0x3a/0xc0 [btrfs] ? kmem_cache_alloc+0x166/0x1d0 btrfs_iget+0x113/0x690 [btrfs] __lookup_free_space_inode+0xd8/0x150 [btrfs] lookup_free_space_inode+0x5b/0xb0 [btrfs] load_free_space_cache+0x7c/0x170 [btrfs] ? cache_block_group+0x72/0x3b0 [btrfs] cache_block_group+0x1b3/0x3b0 [btrfs] ? finish_wait+0x80/0x80 find_free_extent+0x799/0x1010 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] __btrfs_cow_block+0x11d/0x500 [btrfs] btrfs_cow_block+0xdc/0x180 [btrfs] btrfs_search_slot+0x3bd/0x9f0 [btrfs] btrfs_lookup_inode+0x3a/0xc0 [btrfs] ? kmem_cache_alloc+0x166/0x1d0 btrfs_update_inode_item+0x46/0x100 [btrfs] cache_save_setup+0xe4/0x3a0 [btrfs] btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] btrfs_commit_transaction+0xcb/0x8b0 [btrfs] At cache_save_setup() we need to update the inode item of a block group's cache which is located in the tree root (fs_info->tree_root), which means that it may result in COWing a leaf from that tree. If that happens we need to find a free metadata extent and while looking for one, if we find a block group which was not cached yet we attempt to load its cache by calling cache_block_group(). However this function will try to load the inode of the free space cache, which requires finding the matching inode item in the tree root - if that inode item is located in the same leaf as the inode item of the space cache we are updating at cache_save_setup(), we end up in a deadlock, since we try to obtain a read lock on the same extent buffer that we previously write locked. So fix this by using the tree root's commit root when searching for a block group's free space cache inode item when we are attempting to load a free space cache. This is safe since block groups once loaded stay in memory forever, as well as their caches, so after they are first loaded we will never need to read their inode items again. For new block groups, once they are created they get their ->cached field set to BTRFS_CACHE_FINISHED meaning we will not need to read their inode item. Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/ Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") Tested-by: Andrew Nelson <andrew.s.nelson@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42ec3d4c |
|
29-Oct-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
vfs: make remap_file_range functions take and return bytes completed Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2e5dfc99 |
|
29-Oct-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
vfs: combine the clone and dedupe into a single remap_file_range Combine the clone_file_range and dedupe_file_range operations into a single remap_file_range file operation dispatch since they're fundamentally the same operation. The differences between the two can be made in the prep functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
7c861627 |
|
10-Oct-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: remove fs_info from btrfs_should_throttle_delayed_refs The avg_delayed_ref_runtime can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
af9b8a0e |
|
10-Oct-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: remove fs_info from btrfs_check_space_for_delayed_refs It can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
52398340 |
|
21-Aug-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: kill btrfs_clear_path_blocking Btrfs's btree locking has two modes, spinning mode and blocking mode, while searching btree, locking is always acquired in spinning mode and then converted to blocking mode if necessary, and in some hot paths we may switch the locking back to spinning mode by btrfs_clear_path_blocking(). When acquiring locks, both of reader and writer need to wait for blocking readers and writers to complete before doing read_lock()/write_lock(). The problem is that btrfs_clear_path_blocking() needs to switch nodes in the path to blocking mode at first (by btrfs_set_path_blocking) to make lockdep happy before doing its actual clearing blocking job. When switching to blocking mode from spinning mode, it consists of step 1) bumping up blocking readers counter and step 2) read_unlock()/write_unlock(), this has caused serious ping-pong effect if there're a great amount of concurrent readers/writers, as waiters will be woken up and go to sleep immediately. 1) Killing this kind of ping-pong results in a big improvement in my 1600k files creation script, MNT=/mnt/btrfs mkfs.btrfs -f /dev/sdf mount /dev/def $MNT time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 w/o patch: real 2m27.307s user 0m12.839s sys 13m42.831s w/ patch: real 1m2.273s user 0m15.802s sys 8m16.495s 1.1) latency histogram from funclatency[1] Overall with the patch, there're ~50% less write lock acquisition and the 95% max latency that write lock takes also reduces to ~100ms from >500ms. -------------------------------------------- w/o patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 2385222 |****************************************| 2 -> 3 : 37147 | | 4 -> 7 : 20452 | | 8 -> 15 : 13131 | | 16 -> 31 : 3877 | | 32 -> 63 : 3900 | | 64 -> 127 : 2612 | | 128 -> 255 : 974 | | 256 -> 511 : 165 | | 512 -> 1023 : 13 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6743860 |****************************************| 2 -> 3 : 2146 | | 4 -> 7 : 190 | | 8 -> 15 : 38 | | 16 -> 31 : 4 | | -------------------------------------------- w/ patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 1318454 |****************************************| 2 -> 3 : 6800 | | 4 -> 7 : 3664 | | 8 -> 15 : 2145 | | 16 -> 31 : 809 | | 32 -> 63 : 219 | | 64 -> 127 : 10 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6854317 |****************************************| 2 -> 3 : 2383 | | 4 -> 7 : 601 | | 8 -> 15 : 92 | | 2) dbench also proves the improvement, dbench -t 120 -D /mnt/btrfs 16 w/o patch: Throughput 158.363 MB/sec w/ patch: Throughput 449.52 MB/sec 3) xfstests didn't show any additional failures. One thing to note is that callers may set path->leave_spinning to have all nodes in the path stay in spinning mode, which means callers are ready to not sleep before releasing the path, but it won't cause problems if they don't want to sleep in blocking mode. [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f8d236a |
|
04-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: dev-replace: move replace members out of fs_info The replace_wait and bio_counter were mistakenly added to fs_info in commit c404e0dc2c843b154f ("Btrfs: fix use-after-free in the finishing procedure of the device replace"), but they logically belong to fs_info::dev_replace. Besides, bio_counter is a very generic name and is confusing in bare fs_info context. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3280f874 |
|
24-Aug-2018 |
David Sterba <dsterba@suse.com> |
btrfs: remove btrfs_dev_replace::read_locks This member seems to be copied from the extent_buffer locking scheme and is at least used to assert that the read lock/unlock is properly nested. In some way. While the _inc/_dec are called inside the read lock section, the asserts are both inside and outside, so the ordering is not guaranteed and we can see read/inc/dec ordered in any way (theoretically). A missing call of btrfs_dev_replace_clear_lock_blocking could cause unexpected read_locks count, so this at least looks like a valid assertion, but this will become unnecessary with later updates. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b2fa1154 |
|
17-Aug-2018 |
David Sterba <dsterba@suse.com> |
btrfs: tests: polish ifdefs around testing helper Avoid the inline ifdefs and use two sections for self-tests enabled and disabled. Though there could be no ifdef and unconditional test_bit of BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out any code that would depend on conditions using btrfs_is_testing. As this is only for the testing code, drop unlikely(). Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a654666a |
|
17-Aug-2018 |
David Sterba <dsterba@suse.com> |
btrfs: tests: group declarations of self-test helpers Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
57ec5fb4 |
|
17-Aug-2018 |
David Sterba <dsterba@suse.com> |
btrfs: tests: move testing members of struct btrfs_root to the end The data used only for tests are better placed at the end of the structure so that they don't change the structure layout. All new members of btrfs_root should be placed before. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9c36396c |
|
17-Aug-2018 |
David Sterba <dsterba@suse.com> |
btrfs: tests: add separate stub for find_lock_delalloc_range The helper find_lock_delalloc_range is now conditionally built static, dpending on whether the self-tests are enabled or not. There's a macro that is supposed to hide the export, used only once. To discourage further use, drop it an add a public wrapper for the helper needed by tests. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de2c6615 |
|
24-Aug-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: fix alignment in declaration and prototype of btrfs_get_extent This fixes btrfs_get_extent to be consistent with our existing declaration style. Note: For the record, indentation styles that are accepted are both, aligning under the opening ( and tab or double tab indentation on the next line. Preferrably not spliting the type or long expressions in the argument lists. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4fd786e6 |
|
05-Aug-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Remove 'objectid' member from struct btrfs_root There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
684572df |
|
04-Aug-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove root parameter from btrfs_insert_dir_item All callers pass the root tree of dir, we can push that down to the function itself. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3a584174 |
|
04-Aug-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes Using true and false here is closer to the expected semantic than using 0 and 1. No functional change. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b6fdfbff |
|
23-Aug-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu Commit 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove duplicate code") replaces some open coded RCU string handling with macro. It turns out that btrfs_debug_in_rcu() is used for the first time and the macro lacks lock/unlock of RCU string for non-debug case (i.e. when the message is not printed), leading to suspicious RCU usage warning when CONFIG_PROVE_RCU is on. Fix this by adding a wrapper to call lock/unlock for the non-debug case too. Fixes: 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove duplicate code") Reported-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ecebf4d |
|
05-Aug-2018 |
Robbie Ko <robbieko@synology.com> |
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced nocow writes to fallback to COW, during writeback, when a snapshot is created. This resulted in writes made before creating the snapshot to unexpectedly fail with ENOSPC during writeback when success (0) was returned to user space through the write system call. The steps leading to this problem are: 1. When it's not possible to allocate data space for a write, the buffered write path checks if a NOCOW write is possible. If it is, it will not reserve space and success (0) is returned to user space. 2. Then when a snapshot is created, the root's will_be_snapshotted atomic is incremented and writeback is triggered for all inode's that belong to the root being snapshotted. Incrementing that atomic forces all previous writes to fallback to COW during writeback (running delalloc). 3. This results in the writeback for the inodes to fail and therefore setting the ENOSPC error in their mappings, so that a subsequent fsync on them will report the error to user space. So it's not a completely silent data loss (since fsync will report ENOSPC) but it's a very unexpected and undesirable behaviour, because if a clean shutdown/unmount of the filesystem happens without previous calls to fsync, it is expected to have the data present in the files after mounting the filesystem again. So fix this by adding a new atomic named snapshot_force_cow to the root structure which prevents this behaviour and works the following way: 1. It is incremented when we start to create a snapshot after triggering writeback and before waiting for writeback to finish. 2. This new atomic is now what is used by writeback (running delalloc) to decide whether we need to fallback to COW or not. Because we incremented this new atomic after triggering writeback in the snapshot creation ioctl, we ensure that all buffered writes that happened before snapshot creation will succeed and not fallback to COW (which would make them fail with ENOSPC). 3. The existing atomic, will_be_snapshotted, is kept because it is used to force new buffered writes, that start after we started snapshotting, to reserve data space even when NOCOW is possible. This makes these writes fail early with ENOSPC when there's no available space to allocate, preventing the unexpected behaviour of writeback later failing with ENOSPC due to a fallback to COW mode. Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
85c39548 |
|
25-Jul-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: extent-tree: Remove unused __btrfs_free_block_rsv There is no user of this function anymore. This was forgotten to be removed in commit a575ceeb1338 ("Btrfs: get rid of unused orphan infrastructure"). Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6025c19f |
|
31-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info from btrfs_add_root_ref It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3ee1c553 |
|
31-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info from btrfs_del_root_ref It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab9ce7d4 |
|
31-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info from btrfs_del_root It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2ffad70e |
|
20-Jul-2018 |
David Sterba <dsterba@suse.com> |
btrfs: constify strings passed to assertion helper Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9539cff |
|
20-Jul-2018 |
David Sterba <dsterba@suse.com> |
btrfs: dev-replace: remove unused members of btrfs_dev_replace Lock owner and nesting level have been unused since day 1, probably copy&pasted from the extent_buffer locking scheme without much thinking. The locking of device replace is simpler and does not need any lock nesting. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e17385ca |
|
20-Jul-2018 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused member btrfs_root::name Added in 58176a9604c ("Btrfs: Add per-root block accounting and sysfs entries") in 2007, the roots had names exported in sysfs. The code was commented out in 4df27c4d5cc1dda54ed ("Btrfs: change how subvolumes are organized") and cleaned by 182608c8294b5fe9 ("btrfs: remove old unused commented out code"). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5cdc84bf |
|
18-Jul-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop extent_io_ops::set_range_writeback callback The data and metadata callback implementation both use the same function. We can remove the call indirection and intermediate helper completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
031f24da |
|
22-May-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Use btrfs_mark_bg_unused to replace open code Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and add a block group to unused_bgs list. No functional modification, and only 3 callers are involved. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dec59fa3 |
|
13-Jul-2018 |
Ethan Lien <ethanlien@synology.com> |
btrfs: use customized batch size for total_bytes_pinned In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to allocate new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We tested the patch on a 4-cores x86 machine: 1. fallocate a 16GiB size test file 2. take snapshot (so all following writes will be COW) 3. run a 180 sec, 4 jobs, 4K random write fio on test file We also added a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track it by lock_stat. [Results] unpatched: lock_stat version 0.4 ----------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 ----------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 [Analysis] Since the spin lock only protects a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock gets transferred between different cpus, so the patch can really reduce cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba3c2b19 |
|
26-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Add graceful handling of V0 extents Following the removal of the v0 handling code let's be courteous and print an error message when such extents are handled. In the cases where we have a transaction just abort it, otherwise just call btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO. In case the error handling would be too intrusive, leave the BUG_ON in place, like extent_data_ref_count, other proper handling would catch that earlier. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a79865c6 |
|
21-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove V0 extent support The v0 compat code was introduced in commit 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9 years ago, which was merged in 2.6.31. This means that the code is there to support filesystems which are _VERY_ old and if you are using btrfs on such an old kernel, you have much bigger problems. This coupled with the fact that no one is likely testing/maintining this code likely means it has bugs lurking. All things considered I think 43 kernel releases later it's high time this remnant of the past got removed. This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case we have a v0 with no support intact as a sort of safety-net. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e41ca589 |
|
06-Jun-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Get rid of the confusing btrfs_file_extent_inline_len We used to call btrfs_file_extent_inline_len() to get the uncompressed data size of an inlined extent. However this function is hiding evil, for compressed extent, it has no choice but to directly read out ram_bytes from btrfs_file_extent_item. While for uncompressed extent, it uses item size to calculate the real data size, and ignoring ram_bytes completely. In fact, for corrupted ram_bytes, due to above behavior kernel btrfs_print_leaf() can't even print correct ram_bytes to expose the bug. Since we have the tree-checker to verify all EXTENT_DATA, such mismatch can be detected pretty easily, thus we can trust ram_bytes without the evil btrfs_file_extent_inline_len(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43a7e99d |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_force_chunk_alloc It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c83488af |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_inc_block_group_ro It can be referenced from the passed bg cache. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
61da2abf |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_alloc_logged_file_extent It can be referenced from trans since the function is always called within a valid transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
451a2c13 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from check_system_chunk It can be referenced from trans since the function is always called within a transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a98ec01 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_remove_block_group This function is always called with a valid transaction handle from where we can reference fs_info. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e7e02096 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_make_block_group This function is always called with a valid transaction handle from where we can reference the fs_info. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a944442c |
|
12-Jun-2018 |
Allen Pais <allen.lkml@gmail.com> |
btrfs: replace get_seconds with new 64bit time API The get_seconds() function is deprecated as it truncates the timestamp to 32 bits. Change it to or ktime_get_real_seconds(). Signed-off-by: Allen Pais <allen.lkml@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
87eb5eb2 |
|
06-Jul-2018 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: dedupe: rationalize args Clean up f_op->dedupe_file_range() interface. 1) Use loff_t for offsets and length instead of u64 2) Order the arguments the same way as {copy|clone}_file_range(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
5740c99e |
|
06-Jul-2018 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: dedupe: return int Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
a528a241 |
|
06-Jun-2018 |
Souptick Joarder <jrdr.linux@gmail.com> |
btrfs: change return type of btrfs_page_mkwrite to vm_fault_t Use the new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Reference commit 1c8f422059ae ("mm: change return type to vm_fault_t") vmf_error() is the newly introduced inline function in 4.17-rc6. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c4c129db |
|
29-May-2018 |
Gu JinXiang <gujx@cn.fujitsu.com> |
btrfs: drop unused parameter qgroup_reserved Since commit 7775c8184ec0 ("btrfs: remove unused parameter from btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used by caller of function btrfs_subvolume_reserve_metadata. So remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d1957791 |
|
29-May-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info argument from btrfs_uuid_tree_rem This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ rename the function ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cdb345a8 |
|
29-May-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info argument from btrfs_uuid_tree_add This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a575ceeb |
|
11-May-2018 |
Omar Sandoval <osandov@fb.com> |
Btrfs: get rid of unused orphan infrastructure Now that we don't keep long-standing reservations for orphan items, root->orphan_block_rsv isn't used. We can git rid of it, along with: - root->orphan_lock, which was used to protect root->orphan_block_rsv - root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv - BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations in root->orphan_block_rsv - btrfs_orphan_commit_root(), which was the last user of any of these and does nothing else Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b6a221e |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_update_iflags to reflect which flags it touches The btrfs inode flag flavour is now simply called 'inode flags' and the vfs inode are i_flags. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
20a68004 |
|
27-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Unexport and rename btrfs_invalidate_inodes This function is no longer used outside of inode.c so just make it static. At the same time give a more becoming name, since it's not really invalidating the inodes but just calling d_prune_alias. Last, but not least - move the function above the sole caller to avoid introducing yet-another-pointless forward declaration. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
110a21fe |
|
26-Feb-2018 |
David Sterba <dsterba@suse.com> |
btrfs: introduce conditional wakeup helpers Add convenience wrappers for the waitqueue management that involves memory barriers to prevent deadlocks. The helpers will let us remove barriers and the necessary comments in several places. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4457c1c7 |
|
10-May-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info argument from add_new_free_space This function also takes a btrfs_block_group_cache which contains a referene to the fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3a2f8c07 |
|
24-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Unexport btrfs_alloc_delalloc_work It's used only in inode.c so makes no sense to have it exported. Also move the definition of btrfs_delalloc_work to inode.c since it's used only this file. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
076da91c |
|
23-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove delayed_iput member from btrfs_delalloc_work When allocating a delalloc work we are always setting the delayed_iput to 0. So remove the delay_iput member of btrfs_delalloc_work, as a result also remove it as a parameter from btrfs_alloc_delalloc_work since it's not used anymore. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
76f32e24 |
|
23-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes It's always set to 0, so just remove it and collapse the constant value to the only function we are passing it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
82b3e53b |
|
23-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots This parameter was introduced alongside the function in eb73c1b7cea7 ("Btrfs: introduce per-subvolume delalloc inode list") to avoid deadlocks since this function was used in the transaction commit path. However, commit 8d875f95da43 ("btrfs: disable strict file flushes for renames and truncates") removed that usage, rendering the parameter obsolete. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
008ef096 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop lock parameter from update_ioctl_balance_args and rename The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3009a62f |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: track running balance in a simpler way Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dccdb07b |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: kill btrfs_fs_info::volume_mutex Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit 5ac00addc7ac091109 ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40012f96 |
|
19-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove btrfs_wait_and_free_delalloc_work This function is called from only 1 place and is effectively a wrapper over wait_completion/kfree. It doesn't really bring any value having those two calls in a separate function. Just open code it and remove it. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f60a2364 |
|
17-Apr-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() Factor out the second half of btrfs_ioctl_snap_destroy() as btrfs_delete_subvolume(), which performs some subvolume specific checks before deletion: 1. send is not in progress 2. the subvolume is not the default subvolume 3. the subvolume does not contain other subvolumes and actual deletion process. btrfs_delete_subvolume() requires inode_lock for both @dir and inode of @dentry. The remaining part of btrfs_ioctl_snap_destroy() is mainly permission checks. Note that call of d_delete() is not included in btrfs_delete_subvolume() as this function will also be used by btrfs_rmdir() to delete an empty subvolume and in that case d_delete() is called in VFS layer. As a result, btrfs_unlink_subvol() and may_destroy_subvol() become static functions. No functional changes. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ec42f167 |
|
17-Apr-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Move may_destroy_subvol() from ioctl.c to inode.c This is a preparation work to refactor btrfs_ioctl_snap_destroy() and to allow rmdir(2) to delete an empty subvolume. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor update of the function comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c065f5b1 |
|
02-Apr-2018 |
Su Yue <suy.fnst@cn.fujitsu.com> |
btrfs: rename btrfs_get_block_group_info and make it static The function btrfs_get_block_group_info() was introduced by the commit 5af3e8cce8b7 ("Btrfs: make filesystem read-only when submitting barrier fails") which used it in disk-io.c. However, the function is only called in ioctl.c now. Its parameter type btrfs_ioctl_space_info* is only for ioctl. So, make it static and rename it to be original name get_block_group_info. No functional change. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b877331 |
|
26-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Split btrfs_del_delalloc_inode into 2 functions This is in preparation of fixing delalloc inodes leakage on transaction abort. Also export the new function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ff6bc37e |
|
20-Dec-2017 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Use independent and accurate per inode qgroup rsv Unlike reservation calculation used in inode rsv for metadata, qgroup doesn't really need to care about things like csum size or extent usage for the whole tree COW. Qgroups care more about net change of the extent usage. That's to say, if we're going to insert one file extent, it will mostly find its place in COWed tree block, leaving no change in extent usage. Or causing a leaf split, resulting in one new net extent and increasing qgroup number by nodesize. Or in an even more rare case, increase the tree level, increasing qgroup number by 2 * nodesize. So here instead of using the complicated calculation for extent allocator, which cares more about accuracy and no error, qgroup doesn't need that over-estimated reservation. This patch will maintain 2 new members in btrfs_block_rsv structure for qgroup, using much smaller calculation for qgroup rsv, reducing false EDQUOT. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com>
|
#
a514d638 |
|
22-Dec-2017 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT Unlike previous method that tries to commit transaction inside qgroup_reserve(), this time we will try to commit transaction using fs_info->transaction_kthread to avoid nested transaction and no need to worry about locking context. Since it's an asynchronous function call and we won't wait for transaction commit, unlike previous method, we must call it before we hit the qgroup limit. So this patch will use the ratio and size of qgroup meta_pertrans reservation as indicator to check if we should trigger a transaction commit. (meta_prealloc won't be cleaned in transaction committ, it's useless anyway) Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9888c340 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace GPL boilerplate by SPDX -- headers Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Unify the include protection macros to match the file names. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8287475a |
|
12-Dec-2017 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space For quota disabled->enable case, it's possible that at reservation time quota was not enabled so no bytes were really reserved, while at release time, quota was enabled so we will try to release some bytes we didn't really own. Such situation can cause metadata reserveation underflow, for both types, also less possible for per-trans type since quota enable will commit transaction. To address this, record qgroup meta reserved bytes into root::qgroup_meta_rsv_pertrans and ::prealloc. So at releasing time we won't free any bytes we didn't reserve. For DATA, it's already handled by io_tree, so nothing needs to be done there. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43b18595 |
|
12-Dec-2017 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Use separate meta reservation type for delalloc Before this patch, btrfs qgroup is mixing per-transcation meta rsv with preallocated meta rsv, making it quite easy to underflow qgroup meta reservation. Since we have the new qgroup meta rsv types, apply it to delalloc reservation. Now for delalloc, most of its reserved space will use META_PREALLOC qgroup rsv type. And for callers reducing outstanding extent like btrfs_finish_ordered_io(), they will convert corresponding META_PREALLOC reservation to META_PERTRANS. This is mainly due to the fact that current qgroup numbers will only be updated in btrfs_commit_transaction(), that's to say if we don't keep such placeholder reservation, we can exceed qgroup limitation. And for callers freeing outstanding extent in error handler, we will just free META_PREALLOC bytes. This behavior makes callers of btrfs_qgroup_release_meta() or btrfs_qgroup_convert_meta() to be aware of which type they are. So in this patch, btrfs_delalloc_release_metadata() and its callers get an extra parameter to info qgroup to do correct meta convert/release. The good news is, even we use the wrong type (convert or free), it won't cause obvious bug, as prealloc type is always in good shape, and the type only affects how per-trans meta is increased or not. So the worst case will be at most metadata limitation can be sometimes exceeded (no convert at all) or metadata limitation is reached too soon (no free at all). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e1211d0e |
|
12-Dec-2017 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup Since qgroup has seperate metadata reservation types now, we can completely get rid of the old root->qgroup_meta_rsv, which mostly acts as current META_PERTRANS reservation type. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4408ea7c |
|
20-Mar-2018 |
Misono, Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: ctree.h: Fix wrong comment position about csum size Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
75cb379d |
|
20-Mar-2018 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: defer adding raid type kobject until after chunk relocation Any time the first block group of a new type is created, we add a new kobject to sysfs to hold the attributes for that type. Kobject-internal allocations always use GFP_KERNEL, making them prone to fs-reclaim races. While it appears as if this can occur any time a block group is created, the only times the first block group of a new type can be created in memory is at mount and when we create the first new block group during raid conversion. This patch adds a new list to track pending kobject additions and then handles them after we do chunk relocation. Between relocating the target chunk (or forcing allocation of a new chunk in the case of data) and removing the old chunk, we're in a safe place for fs-reclaim to occur. We're holding the volume mutex, which is already held across page faults, and the delete_unused_bgs_mutex, which will only stall the cleaner thread. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dc2d3005 |
|
20-Mar-2018 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: remove dead create_space_info calls Since commit 2be12ef79 (btrfs: Separate space_info create/update), we've separated out the creation and updating of the space info structures. That commit was a straightforward refactoring of the two parts of update_space_info, but we can go a step further. Since commits c59021f84 (Btrfs: fix OOPS of empty filesystem after balance) and b742bb82f (Btrfs: Link block groups of different raid types), we know that the space_info structures will be created at mount and there will only ever be, at most, three of them. This patch cleans out the create_space_info calls after __find_space_info returns NULL since __find_space_info *can't* return NULL. The initial cause for reviewing this was the kobject_add calls from create_space_info occuring in sites where fs-reclaim wasn't allowed. Now we are certain they occur only early in the mount process and are safe. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5ead2dd0 |
|
15-Mar-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Drop fs_info parameter from btrfs_finish_extent_commit It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c79a70b1 |
|
15-Mar-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: drop fs_info parameter from btrfs_run_delayed_refs It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92e2f7e3 |
|
05-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove btrfs_fs_info::open_ioctl_trans Since userspace transaction have been removed we no longer have use for this field so delete it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
859e682d |
|
05-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove btrfs_file_private::trans Now that the userspace transaction IOCTL have been removed, this member is no longer used so just remove it Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a5a07a8 |
|
05-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove userspace transaction ioctls Commit 3558d4f88ec8 ("btrfs: Deprecate userspace transaction ioctls") marked the beginning of the end of userspace transaction. This commit finishes the job! There are no known users and ceph does not use the ioctl anymore. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Sage Weil <sage@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c829b72 |
|
07-Mar-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add define for oldest generation Some functions can filter metadata by the generation. Add a define that will annotate such arguments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
738c93d4 |
|
27-Feb-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move btrfs_listxattr prototype to xattr.h There's a proper header for xattr handlers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d612ac59 |
|
26-Feb-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: unify types for metadata_ratio and data_chunk_allocations We have btrfs_fs_info::data_chunk_allocations and btrfs_fs_info::metadata_ratio declared as unsigned which would be unsinged int and kernel style prefers unsigned int over bare unsigned. So this patch changes them to u32. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e67c718b |
|
19-Feb-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add more __cold annotations The __cold functions are placed to a special section, as they're expected to be called rarely. This could help i-cache prefetches or help compiler to decide which branches are more/less likely to be taken without any other annotations needed. Though we can't add more __exit annotations, it's still possible to add __cold (that's also added with __exit). That way the following function categories are tagged: - printf wrappers, error messages - exit helpers Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9678c543 |
|
08-Jan-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove custom crc32c init code The custom crc32 init code was introduced in 14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to enable using btrfs as a built-in. However, later as pointed out by 60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this wasn't enough and finally btrfs was switched to late_initcall which comes after the generic crc32c implementation is initiliased. The latter commit superseeded the former. Now that we don't have to maintain our own code let's just remove it and switch to using the generic implementation. Despite touching a lot of files the patch is really simple. Here is the gist of the changes: 1. Select LIBCRC32C rather than the low-level modules. 2. s/btrfs_crc32c/crc32c/g 3. replace hash.h with linux/crc32c.h 4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly. I've tested this with btrfs being both a module and a built-in and xfstest doesn't complain. Does seem to fix the longstanding problem of not automatically selectiong the crc32c module when btrfs is used. Possibly there is a workaround in dracut. The modinfo confirms that now all the module dependencies are there: before: depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate after: depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add more info to changelog from mails ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3e72ee88 |
|
30-Jan-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index() Function __get_raid_index() is used to convert block group flags into raid index, which can be used to get various info directly from btrfs_raid_array[]. Refactor this function a little: 1) Rename to btrfs_bg_flags_to_raid_index() Double underline prefix is normally for internal functions, while the function is used by both extent-tree and volumes. Although the name is a little longer, but it should explain its usage quite well. 2) Move it to volumes.h and make it static inline Just several if-else branches, really no need to define it as a normal function. This also makes later code re-use between kernel and btrfs-progs easier. 3) Remove function get_block_group_index() Really no need to do such a simple thing as an exported function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5d23515b |
|
31-Jan-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable Currently btrfs_run_qgroups is doing a bit too much. Not only is it responsible for synchronizing in-memory state of qgroups to disk but it also contains code to trigger the initial qgroup rescan when quota is enabled initially. This condition is detected by checking that BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set. Nothing really requires from the code to be structured (and scattered) the way it is so let's streamline things. First move the quota rescan code into btrfs_quota_enable, where its invocation is closer to the use. This also makes the FS_QUOTA_ENABLING flag redundant so let's remove it as well. This has been tested with a full xfstest run with qgroups enabled on the scratch device of every xfstest and no regressions were observed. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d3740608 |
|
13-Feb-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: manage commit mount option as %u As the commit mount option is unsigned so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7b885be |
|
13-Feb-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: manage thread_pool mount option as %u The mount option thread_pool is always unsigned. Manage it that way all around. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
21217054 |
|
07-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Don't pass fs_info arg to btrfs_start_dirty_block_groups It can be referenced from the passed transaction so no point in passing it as a function argument. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6c686b35 |
|
07-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info argument from btrfs_create_pending_block_groups It can be referenced from the passed transaciton so no point in passing it as function argument. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0e34693f |
|
07-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_trans_release_metadata private to transaction.c This function is only ever used in __btrfs_end_transaction and btrfs_commit_transaction so there is no need to export it via header. Let's move it closer to where it's used, make it static and remove it from the header. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1f250e92 |
|
28-Feb-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix log replay failure after unlink and link combination If we have a file with 2 (or more) hard links in the same directory, remove one of the hard links, create a new file (or link an existing file) in the same directory with the name of the removed hard link, and then finally fsync the new file, we end up with a log that fails to replay, causing a mount failure. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir $ touch /mnt/testdir/foo $ ln /mnt/testdir/foo /mnt/testdir/bar $ sync $ unlink /mnt/testdir/bar $ touch /mnt/testdir/bar $ xfs_io -c "fsync" /mnt/testdir/bar <power failure> $ mount /dev/sdb /mnt mount: mount(2) failed: /mnt: No such file or directory When replaying the log, for that example, we also see the following in dmesg/syslog: [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257 [71813.674204] ------------[ cut here ]------------ [71813.675694] BTRFS: Transaction aborted (error -2) [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs] [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs] [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1 [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs] [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286 [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001 [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001 [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8 [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101 [71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 [71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0 [71813.679669] Call Trace: [71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs] [71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs] [71813.679669] add_inode_ref+0x71e/0x851 [btrfs] [71813.679669] ? __lock_is_held+0x39/0x71 [71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs] [71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs] [71813.679669] ? rcu_read_unlock+0x3a/0x57 [71813.679669] ? __lock_is_held+0x39/0x71 [71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs] [71813.679669] walk_log_tree+0xad/0x188 [btrfs] [71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs] [71813.679669] ? replay_one_extent+0x544/0x544 [btrfs] [71813.679669] open_ctree+0x1cf6/0x2209 [btrfs] [71813.679669] btrfs_mount_root+0x368/0x482 [btrfs] [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6 [71813.679669] ? __lockdep_init_map+0x176/0x1c2 [71813.679669] ? mount_fs+0x64/0x10b [71813.679669] mount_fs+0x64/0x10b [71813.679669] vfs_kern_mount+0x68/0xce [71813.679669] btrfs_mount+0x13e/0x772 [btrfs] [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6 [71813.679669] ? __lockdep_init_map+0x176/0x1c2 [71813.679669] ? mount_fs+0x64/0x10b [71813.679669] mount_fs+0x64/0x10b [71813.679669] vfs_kern_mount+0x68/0xce [71813.679669] do_mount+0x6e5/0x973 [71813.679669] ? memdup_user+0x3e/0x5c [71813.679669] SyS_mount+0x72/0x98 [71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b [71813.679669] RIP: 0033:0x7f7cedf150ba [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206 [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c [71813.679669] ---[ end trace 83bd473fc5b4663b ]--- [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree) [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30 [71814.128078] BTRFS error (device dm-0): open_ctree failed This happens because the log has inode reference items for both inode 258 (the first file we created) and inode 259 (the second file created), and when processing the reference item for inode 258, we replace the corresponding item in the subvolume tree (which has two names, "foo" and "bar") witht he one in the log (which only has one name, "foo") without removing the corresponding dir index keys from the parent directory. Later, when processing the inode reference item for inode 259, which has a name of "bar" associated to it, we notice that dir index entries exist for that name and for a different inode, so we attempt to unlink that name, which fails because the inode reference item for inode 258 no longer has the name "bar" associated to it, making a call to btrfs_unlink_inode() fail with a -ENOENT error. Fix this by unlinking all the names in an inode reference item from a subvolume tree that are not present in the inode reference item found in the log tree, before overwriting it with the item from the log tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a8fd1f71 |
|
15-Feb-2018 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: use kvzalloc to allocate btrfs_fs_info The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On kernels built with NR_CPUS=8192, this can result in kmalloc failures that prevent mounting. There is work in progress to try to resolve this for every user of srcu_struct but using kvzalloc will work around the failures until that is complete. As an example with NR_CPUS=512 on x86_64: the overall size of subvol_srcu is 3460 bytes, fs_info is 6496. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c04e61b5 |
|
05-Jan-2018 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: move extent map specific code to extent_map.c These helpers are extent map specific, move them to extent_map.c. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b4df058 |
|
05-Jan-2018 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: add helper for em merge logic This is a prepare work for the following extent map selftest, which runs tests against em merge logic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
203e02d9 |
|
22-Dec-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove unused wait in btrfs_stripe_hash In fact nobody is waiting on @wait's waitqueue, it can be safely removed. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bae15d95 |
|
07-Nov-2017 |
Qu Wenruo <wqu@suse.com> |
btrfs: Cleanup existing name_len checks Since tree-checker has verified leaf when reading from disk, we don't need the existing verify_dir_item() or btrfs_is_name_len_valid() checks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5c29bd9 |
|
02-Nov-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: add __init macro to btrfs init functions Adding __init macro gives kernel a hint that this function is only used during the initialization phase and its memory resources can be freed up after. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1751e8a6 |
|
27-Nov-2017 |
Linus Torvalds <torvalds@linux-foundation.org> |
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e3b8a485 |
|
03-Nov-2017 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix reported number of inode blocks after buffered append writes The patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of inode blocks") introduced a regression where if we do a buffered write starting at position equal to or greater than the file's size and then stat(2) the file before writeback is triggered, the number of used blocks does not change (unless there's a prealloc/unwritten extent). Example: $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar $ du -h foobar 0 foobar $ sync $ du -h foobar 64K foobar The first version of that patch didn't had this regression and the second version, which was the one committed, was made only to address some performance regression detected by the intel test robots using fs_mark. This fixes the regression by setting the new delaloc bit in the range, and doing it at btrfs_dirty_pages() while setting the regular dealloc bit as well, so that this way we set both bits at once avoiding navigation of the inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most meaninful place, as we should set the new dellaloc bit when if we set the delalloc bit, which happens only if we copied bytes into the pages at __btrfs_buffered_write(). This was making some of LTP's du tests fail, which can be quickly run using a command line like the following: $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt Fixes: a7e3b975a0f9 ("Btrfs: fix reported number of inode blocks") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69fe2d75 |
|
19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make the delalloc block rsv per inode The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand. Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b62f87b |
|
19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
Btrfs: rework outstanding_extents Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents. Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this btrfs_delalloc_reserve_metadata - outstanding_extents = 1 btrfs_set_extent_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1 for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_extent_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_extent_hook - outstanding_extents = 2 btrfs_delalloc_release_extents - outstanding_extnets = 1 In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with btrfs_add_ordered_extent - outstanding_extents = 2 clear_extent_bit - outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0 This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_delalloc_reserve_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter. The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat. I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f51d2b59 |
|
15-Sep-2017 |
David Sterba <dsterba@suse.com> |
btrfs: allow to set compression level for zlib Preliminary support for setting compression level for zlib, the following works: $ mount -o compess=zlib # default $ mount -o compess=zlib0 # same $ mount -o compess=zlib9 # level 9, slower sync, less data $ mount -o compess=zlib1 # level 1, faster sync, more data $ mount -o remount,compress=zlib3 # level set by remount The compress-force works the same as compress'. The level is visible in the same format in /proc/mounts. Level set via file property does not work yet. Required patch: "btrfs: prepare for extensions in compression options" Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d4417e22 |
|
16-Oct-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Replace opencoded sizes with their symbolic constants Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h. Let's unifiy the code base to always use the symbolic constants. No functional changes Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd708b81 |
|
29-Sep-2017 |
Josef Bacik <josef@toxicpanda.com> |
Btrfs: add a extent ref verify tool We were having corruption issues that were tied back to problems with the extent tree. In order to track them down I built this tool to try and find the culprit, which was pretty successful. If you compile with this tool on it will live verify every ref update that the fs makes and make sure it is consistent and valid. I've run this through with xfstests and haven't gotten any false positives. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update error messages, add fixup from Dan Carpenter to handle errors of read_tree_block ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
84f7d8e6 |
|
29-Sep-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: pass root to various extent ref mod functions We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb592373 |
|
29-Sep-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add ref-verify mount option This adds the infrastructure for turning ref verify on and off for a mount, to be used by a later patch. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhnance btrfs_print_mod_info to print if ref-verify is compiled in ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
736cd52e |
|
07-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove nr_async_submits and async_submit_draining Now that we have the combo of flushing twice, which can make sure IO have started since the second flush will wait for page lock which won't be unlocked unless setting page writeback and queuing ordered extents, we don't need %async_submit_draining, %async_delalloc_pages and %nr_async_submits to tell whether the IO has actually started. Moreover, all the flushers in use are followed by functions that wait for ordered extents to complete, so %nr_async_submits, which tracks whether bio's async submit has made progress, doesn't really make sense. However, %async_delalloc_pages is still required by shrink_delalloc() as that function doesn't flush twice in the normal case (just issues a writeback with WB_REASON_FS_FREE_SPACE). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f851689b |
|
07-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove nr_async_bios This was intended to congest higher layers to not send bios, but as 1) the congested bit has been taken by writeback Async bios come from buffered writes and DIO writes. For DIO writes, we want to submit them ASAP, while for buffered writes, writeback uses balance_dirty_pages() to throttle how much dirty pages we can have. 2) and no one is waiting for %nr_async_bios down to zero, Historically, it was introduced along with changes which let checksumming workload spread accross different cpus. And at that time, pdflush was used instead of per-bdi flushing, perhaps pdflush did not have the necessary information for writeback to do throttling. We can safely remove them now. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> [ additional explanation from mails, removed unused variable 'limit' ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee8c494f |
|
20-Aug-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove unused arguments from btrfs_changed_cb_t btrfs_changed_cb_t represents the signature of the callback being passed to btrfs_compare_trees. Currently there is only one such callback, namely changed_cb in send.c. This function doesn't really uses the first 2 parameters, i.e. the roots. Since there are not other functions implementing the btrfs_changed_cb_t let's remove the unused parameters from the prototype and implementation. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69ad5976 |
|
03-Oct-2017 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: fix overlap of fs_info::flags values Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap, we should change the value. First, BTRFS_FS_EXCL_OP was set to 14. commit 171938e52807 ("btrfs: track exclusive filesystem operation in flags") Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14. commit f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE") As a result, the value 14 overlapped, by accident. This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16, the flags are internal. Fixes: f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE") CC: stable@vger.kernel.org # 4.13+ Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minimize the change, update only BTRFS_FS_EXCL_OP ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2faff79 |
|
30-Aug-2017 |
Misono, Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: remove BTRFS_FS_QUOTA_DISABLING flag Currently, "btrfs quota enable" would fail after "btrfs quota disable" on the first time with syslog output "qgroup_rescan_init failed with -22", but it would succeed on the second time. When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd in btrfs_run_qgroups() (which is called in btrfs_commit_transaction()) because quota_root has already been freed. If "quota enable" is called after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag would be dropped in the btrfs_run_qgroups() since quota_root is not NULL. This leads to the failure of "quota enable" on the first time. BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable" context and is equivalent to whether quota_root is NULL or not. btrfs_run_qgroups() checks whether quota_root is NULL or not in the first place. So, let's remove BTRFS_FS_QUOTA_DISABLING flag. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cd5447e |
|
17-Aug-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: pass fs_info to btrfs_del_root instead of tree_root btrfs_del_roots always uses the tree_root. Let's pass fs_info instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4335958d |
|
18-Aug-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove BUG() in btrfs_extent_inline_ref_size Now that btrfs_get_extent_inline_ref_type() can report if type is a valid one and all callers can gracefully deal with that, we don't need to crash here. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
167ce953 |
|
18-Aug-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: add a helper to retrive extent inline ref type An invalid value of extent inline ref type may be read from a malicious image which may force btrfs to crash. This adds a helper which does sanity check for the ref type, so we can know if it's sane, return he type, otherwise return an error. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minimal tweak const types, causing warnings due to other cleanup patches ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0174484d |
|
27-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove chunk_objectid argument from btrfs_make_block_group btrfs_make_block_group is always called with chunk_objectid set to BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will change anytime soon, so let's remove the argument and decrease the cognitive load when reading the code path. No functional change Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
583b7231 |
|
28-Jul-2017 |
Hans van Kranenburg <hans.van.kranenburg@mendix.com> |
btrfs: Do not use data_alloc_cluster in ssd mode This patch provides a band aid to improve the 'out of the box' behaviour of btrfs for disks that are detected as being an ssd. In a general purpose mixed workload scenario, the current ssd mode causes overallocation of available raw disk space for data, while leaving behind increasing amounts of unused fragmented free space. This situation leads to early ENOSPC problems which are harming user experience and adoption of btrfs as a general purpose filesystem. This patch modifies the data extent allocation behaviour of the ssd mode to make it behave identical to nossd mode. The metadata behaviour and additional ssd_spread option stay untouched so far. Recommendations for future development are to reconsider the current oversimplified nossd / ssd distinction and the broken detection mechanism based on the rotational attribute in sysfs and provide experienced users with a more flexible way to choose allocator behaviour for data and metadata, optimized for certain use cases, while keeping sane 'out of the box' default settings. The internals of the current btrfs code have more potential than what currently gets exposed to the user to choose from. The SSD story... In the first year of btrfs development, around early 2008, btrfs gained a mount option which enables specific functionality for filesystems on solid state devices. The first occurance of this functionality is in commit e18e4809, labeled "Add mount -o ssd, which includes optimizations for seek free storage". The effect on allocating free space for doing (data) writes is to 'cluster' writes together, writing them out in contiguous space, as opposed to a 'tetris' way of putting all separate writes into any free space fragment that fits (which is what the -o nossd behaviour does). A somewhat simplified explanation of what happens is that, when for example, the 'cluster' size is set to 2MiB, when we do some writes, the data allocator will search for a free space block that is 2MiB big, and put the writes in there. The ssd mode itself might allow a 2MiB cluster to be composed of multiple free space extents with some existing data in between, while the additional ssd_spread mount option kills off this option and requires fully free space. The idea behind this is (commit 536ac8ae): "The [...] clusters make it more likely a given IO will completely overwrite the ssd block, so it doesn't have to do an internal rwm cycle."; ssd block meaning nand erase block. So, effectively this means applying a "locality based algorithm" and trying to outsmart the actual ssd. Since then, various changes have been made to the involved code, but the basic idea is still present, and gets activated whenever the ssd mount option is active. This also happens by default, when the rotational flag as seen at /sys/block/<device>/queue/rotational is set to 0. However, there's a number of problems with this approach. First, what the optimization is trying to do is outsmart the ssd by assuming there is a relation between the physical address space of the block device as seen by btrfs and the actual physical storage of the ssd, and then adjusting data placement. However, since the introduction of the Flash Translation Layer (FTL) which is a part of the internal controller of an ssd, these attempts are futile. The use of good quality FTL in consumer ssd products might have been limited in 2008, but this situation has changed drastically soon after that time. Today, even the flash memory in your automatic cat feeding machine or your grandma's wheelchair has a full featured one. Second, the behaviour as described above results in the filesystem being filled up with badly fragmented free space extents because of relatively small pieces of space that are freed up by deletes, but not selected again as part of a 'cluster'. Since the algorithm prefers allocating a new chunk over going back to tetris mode, the end result is a filesystem in which all raw space is allocated, but which is composed of underutilized chunks with a 'shotgun blast' pattern of fragmented free space. Usually, the next problematic thing that happens is the filesystem wanting to allocate new space for metadata, which causes the filesystem to fail in spectacular ways. Third, the default mount options you get for an ssd ('ssd' mode enabled, 'discard' not enabled), in combination with spreading out writes over the full address space and ignoring freed up space leads to worst case behaviour in providing information to the ssd itself, since it will never learn that all the free space left behind is actually free. There are two ways to let an ssd know previously written data does not have to be preserved, which are sending explicit signals using discard or fstrim, or by simply overwriting the space with new data. The worst case behaviour is the btrfs ssd_spread mount option in combination with not having discard enabled. It has a side effect of minimizing the reuse of free space previously written in. Fourth, the rotational flag in /sys/ does not reliably indicate if the device is a locally attached ssd. For example, iSCSI or NBD displays as non-rotational, while a loop device on an ssd shows up as rotational. The combination of the second and third problem effectively means that despite all the good intentions, the btrfs ssd mode reliably causes the ssd hardware and the filesystem structures and performance to be choked to death. The clickbait version of the title of this story would have been "Btrfs ssd optimizations considered harmful for ssds". The current nossd 'tetris' mode (even still without discard) allows a pattern of overwriting much more previously used space, causing many more implicit discards to happen because of the overwrite information the ssd gets. The actual location in the physical address space, as seen from the point of view of btrfs is irrelevant, because the actual writes to the low level flash are reordered anyway thanks to the FTL. Changes made in the code 1. Make ssd mode data allocation identical to tetris mode, like nossd. 2. Adjust and clean up filesystem mount messages so that we can easily identify if a kernel has this patch applied or not, when providing support to end users. Also, make better use of the *_and_info helpers to only trigger messages on actual state changes. Backporting notes Notes for whoever wants to backport this patch to their 4.9 LTS kernel: * First apply commit 951e7966 "btrfs: drop the nossd flag when remounting with -o ssd", or fixup the differences manually. * The rest of the conflicts are because of the fs_info refactoring. So, for example, instead of using fs_info, it's root->fs_info in extent-tree.c Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23b5ec74 |
|
24-Jul-2017 |
Josef Bacik <jbacik@fb.com> |
btrfs: fix readdir deadlock with pagefault Readdir does dir_emit while under the btree lock. dir_emit can trigger the page fault which means we can deadlock. Fix this by allocating a buffer on opening a directory and copying the readdir into this buffer and doing dir_emit from outside of the tree lock. Thread A readdir <holding tree lock> dir_emit <page fault> down_read(mmap_sem) Thread B mmap write down_write(mmap_sem) page_mkwrite wait_ordered_extents Process C finish_ordered_extent insert_reserved_file_extent try to lock leaf <hang> Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy the deadlock scenario to changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d3c0bab5 |
|
21-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove trivial wrapper btrfs_force_ra It's a simple call page_cache_sync_readahead, same arguments in the same order. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
35dc3130 |
|
21-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: drop ancient page flag mappings There's no PageFsMisc. Added by patch 4881ee5a2e995 in 2008, the flag is not present in current kernels. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ea14b57f |
|
21-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: fix spelling of snapshotting Signed-off-by: David Sterba <dsterba@suse.com>
|
#
19aee8de |
|
18-Jul-2017 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: btrfs_inherit_iflags() can be static btrfs_new_inode() is the only consumer move it to inode.c, from ioctl.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc3cce23 |
|
08-Mar-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup num_tolerated_disk_barrier_failures As we use per-chunk degradable check, the global num_tolerated_disk_barrier_failures is of no use. We can now remove it. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cbb1f45 |
|
28-Jun-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: struct-funcs, constify readers We have reader helpers for most of the on-disk structures that use an extent_buffer and pointer as offset into the buffer that are read-only. We should mark them as const and, in turn, allow consumers of these interfaces to mark the buffers const as well. No impact on code, but serves as documentation that a buffer is intended not to be modified. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23d1f737 |
|
28-Jun-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove unused sectorsize member The sectorsize member of btrfs_block_group_cache is unused. So remove it, this reduces the number of holes in the struct. With patch: /* size: 856, cachelines: 14, members: 40 */ /* sum members: 837, holes: 4, sum holes: 19 */ /* bit holes: 1, sum bit holes: 29 bits */ /* last cacheline: 24 bytes */ Without patch: /* size: 864, cachelines: 14, members: 41 */ /* sum members: 841, holes: 5, sum holes: 23 */ /* bit holes: 1, sum bit holes: 29 bits */ /* last cacheline: 32 bytes */ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c1aab1d |
|
09-Aug-2017 |
Nick Terrell <terrelln@fb.com> |
btrfs: Add zstd support Add zstd compression and decompression support to BtrFS. zstd at its fastest level compresses almost as well as zlib, while offering much faster compression and decompression, approaching lzo speeds. I benchmarked btrfs with zstd compression against no compression, lzo compression, and zlib compression. I benchmarked two scenarios. Copying a set of files to btrfs, and then reading the files. Copying a tarball to btrfs, extracting it to btrfs, and then reading the extracted files. After every operation, I call `sync` and include the sync time. Between every pair of operations I unmount and remount the filesystem to avoid caching. The benchmark files can be found in the upstream zstd source repository under `contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}` [1] [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. The first compression benchmark is copying 10 copies of the unzipped Silesia corpus [3] into a BtrFS filesystem mounted with `-o compress-force=Method`. The decompression benchmark times how long it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is measured by comparing the output of `df` and `du`. See the benchmark file [1] for details. I benchmarked multiple zstd compression levels, although the patch uses zstd level 1. | Method | Ratio | Compression MB/s | Decompression speed | |---------|-------|------------------|---------------------| | None | 0.99 | 504 | 686 | | lzo | 1.66 | 398 | 442 | | zlib | 2.58 | 65 | 241 | | zstd 1 | 2.57 | 260 | 383 | | zstd 3 | 2.71 | 174 | 408 | | zstd 6 | 2.87 | 70 | 398 | | zstd 9 | 2.92 | 43 | 406 | | zstd 12 | 2.93 | 21 | 408 | | zstd 15 | 3.01 | 11 | 354 | The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it measures the compression ratio, extracts the tar, and deletes the tar. Then it measures the compression ratio again, and `tar`s the extracted files into `/dev/null`. See the benchmark file [2] for details. | Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) | |--------|-----------|---------------|----------|------------|----------| | None | 0.97 | 0.78 | 0.981 | 5.501 | 8.807 | | lzo | 2.06 | 1.38 | 1.631 | 8.458 | 8.585 | | zlib | 3.40 | 1.86 | 7.750 | 21.544 | 11.744 | | zstd 1 | 3.57 | 1.85 | 2.579 | 11.479 | 9.389 | [1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh [3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia [4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz zstd source repository: https://github.com/facebook/zstd Signed-off-by: Nick Terrell <terrelln@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
bc42bda2 |
|
27-Feb-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges [BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
364ecf36 |
|
27-Feb-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e79a3327 |
|
06-Jun-2017 |
Su Yue <suy.fnst@cn.fujitsu.com> |
btrfs: Check name_len with boundary in verify dir_item Originally, verify_dir_item verifies name_len of dir_item with fixed values but not item boundary. If corrupted name_len was not bigger than the fixed value, for example 255, the function will think the dir_item is fine. And then reading beyond boundary will cause crash. Example: 1. Corrupt one dir_item name_len to be 255. 2. Run 'ls -lar /mnt/test/ > /dev/null' dmesg: [ 48.451449] BTRFS info (device vdb1): disk space caching is enabled [ 48.451453] BTRFS info (device vdb1): has skinny extents [ 48.489420] general protection fault: 0000 [#1] SMP [ 48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq [ 48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5 [ 48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 [ 48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000 [ 48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs] [ 48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202 [ 48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000 [ 48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368 [ 48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000 [ 48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288 [ 48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20 [ 48.490008] FS: 00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000 [ 48.490008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0 [ 48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 48.490008] Call Trace: [ 48.490008] btrfs_real_readdir+0x3b7/0x4a0 [btrfs] [ 48.490008] iterate_dir+0x181/0x1b0 [ 48.490008] SyS_getdents+0xa7/0x150 [ 48.490008] ? fillonedir+0x150/0x150 [ 48.490008] entry_SYSCALL_64_fastpath+0x18/0xad [ 48.490008] RIP: 0033:0x7fef5032546b [ 48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e [ 48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b [ 48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003 [ 48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000 [ 48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38 [ 48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e [ 48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98 [ 48.499455] ---[ end trace 321920d8e8339505 ]--- Fix it by adding a parameter @slot and check name_len with item boundary by calling btrfs_is_name_len_valid. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> rev Signed-off-by: David Sterba <dsterba@suse.com>
|
#
19c6dcbf |
|
06-Jun-2017 |
Su Yue <suy.fnst@cn.fujitsu.com> |
btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary Introduce function btrfs_is_name_len_valid. The function compares parameter @name_len with item boundary then returns true if name_len is valid. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7dfb8be1 |
|
16-Jun-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Round down values which are written for total_bytes_size We got an internal report about a file system not wanting to mount following 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 Subtracting the numbers we get a difference of less than a 4kb. Upon closer inspection it became apparent that mkfs actually rounds down the size of the device to a multiple of sector size. However, the same cannot be said for various functions which modify the total size and are called from btrfs_balance as well as when adding a new device. So this patch ensures that values being saved into on-disk data structures are always rounded down to a multiple of sectorsize. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eca152ed |
|
16-Jun-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Manually implement device_total_bytes getter/setter The device->total_bytes member needs to always be rounded down to sectorsize so that it corresponds to the value of super->total_bytes. However, there are multiple places where the setter is fed a value which is not rounded which can cause a fs to be unmountable due to the check introduced in 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). This patch implements the getter/setter manually so that in a later patch I can add necessary code to catch offenders. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d0c71b3 |
|
14-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: obsolete and remove mount option alloc_start The mount option alloc_start was used in the past for debugging and stressing the chunk allocator. Not meant to be used by users, so we're not breaking anybody's setup. There was some added complexity handling changes of the value and when it was not same as default. Such code has likely been untested and I think it's better to remove it. This patch kills all use of alloc_start, and by doing that also fixes a bug when alloc_size is set, potentially called from statfs: in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU protection is temporarily dropped so btrfs_account_dev_extents_size can be called and then RCU is locked again! Doing that inside list_for_each_entry_rcu is just asking for trouble, but unlikely to be observed in practice. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fac03c8d |
|
15-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: move fs_info::fs_frozen to the flags We can keep the state among the other fs_info flags, there's no reason why fs_frozen would need to be separate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4b5faeac |
|
27-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: use generic slab for for btrfs_transaction Observing the number of slab objects of btrfs_transaction, there's just one active on an almost quiescent filesystem, and the number of objects goes to about ten when sync is in progress. Then the nubmer goes down to 1. This matches the expectations of the transaction lifetime. For such use the separate slab cache is not justified, as we do not reuse objects frequently. For the shortlived transaction, the generic slab (size 512) should be ok. We can optimistically expect that the 512 slabs are not all used (fragmentation) and there are free slots to take when we do the allocation, compared to potentially allocating a whole new page for the separate slab. We'll lose the stats about the object use, which could be added later if we really need them. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
118c701e |
|
22-May-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove __BTRFS_LEAF_DATA_SIZE __BTRFS_LAF_DATA_SIZE is used only by BTRFS_LEAF_DATA_SIZE. Make the latter subsume the former. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d9ec8c4 |
|
29-May-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET Commit 5f39d397dfbe ("Btrfs: Create extent_buffer interface for large blocksizes") refactored btrfs_leaf_data function to take extent_buffer rather than struct btrfs_leaf. However, as it turns out the parameter being passed is never used. Furthermore this function no longer returns the leaf data but rather the offset to it. So rename the function to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_* helpers and turn it into a macro. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ removed () from the macro ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1b86826d |
|
17-May-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: cleanup root usage by btrfs_get_alloc_profile There are two places where we don't already know what kind of alloc profile we need before calling btrfs_get_alloc_profile, but we need access to a root everywhere we call it. This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile() and relegates btrfs_system_alloc_profile to a static for use in those two cases. The next patch will eliminate one of those. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c6100a4b |
|
05-May-2017 |
Josef Bacik <josef@toxicpanda.com> |
Btrfs: replace tree->mapping with tree->private_data For extent_io tree's we have carried the address_mapping of the inode around in the io tree in order to pull the inode back out for calling into various tree ops hooks. This works fine when everything that has an extent_io_tree has an inode. But we are going to remove the btree_inode, so we need to change this. Instead just have a generic void * for private data that we can initialize with, and have all the tree ops use that instead. This had a lot of cascading changes but should be relatively straightforward. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor reordering of the callback prototypes ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f29efe29 |
|
11-May-2017 |
Sargun Dhillon <sargun@sargun.me> |
btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE This patch introduces the quota override flag to btrfs_fs_info, and a change to quota limit checking code to temporarily allow for quota to be overridden for processes with CAP_SYS_RESOURCE. It's useful for administrative programs, such as log rotation, that may need to temporarily use more disk space in order to free up a greater amount of overall disk space without yielding more disk space to the rest of userland. Eventually, we may want to add the idea of an operator-specific quota, operator reserved space, or something else to allow for administrative override, but this is perhaps the simplest solution. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: David Sterba <dsterba@suse.com> [ minor changelog edits ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a5ed45f8 |
|
11-May-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Convert fs_info->free_chunk_space to atomic64_t The ->free_chunk_space variable is used to track the unallocated space and access to it is protected by a spinlock, which is not used for anything else. Make the code a bit self-explanatory by switching the variable to an atomic64_t type and kill the spinlock. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ not a performance critical code, use of atomic type is ok ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
70e7af24 |
|
02-Jun-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: fix delalloc accounting leak caused by u32 overflow btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication, which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2). For a nodesize of 16kB, this overflow happens at 16k items. Usually, num_items is a small constant passed to btrfs_start_transaction(), but we also use btrfs_calc_trans_metadata_size() for metadata reservations for extent items in btrfs_delalloc_{reserve,release}_metadata(). In drop_outstanding_extents(), num_items is calculated as inode->reserved_extents - inode->outstanding_extents. The difference between these two counters is usually small, but if many delalloc extents are reserved and then the outstanding extents are merged in btrfs_merge_extent_hook(), the difference can become large enough to overflow in btrfs_calc_trans_metadata_size(). The overflow manifests itself as a leak of a multiple of 4 GB in delalloc_block_rsv and the metadata bytes_may_use counter. This in turn can cause early ENOSPC errors. Additionally, these WARN_ONs in extent-tree.c will be hit when unmounting: WARN_ON(fs_info->delalloc_block_rsv.size > 0); WARN_ON(fs_info->delalloc_block_rsv.reserved > 0); WARN_ON(space_info->bytes_pinned > 0 || space_info->bytes_reserved > 0 || space_info->bytes_may_use > 0); Fix it by casting nodesize to a u64 so that btrfs_calc_trans_metadata_size() does a full 64-bit multiplication. While we're here, do the same in btrfs_calc_trunc_metadata_size(); this can't overflow with any existing uses, but it's better to be safe here than have another hard-to-debug problem later on. Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4e4cbee9 |
|
03-Jun-2017 |
Christoph Hellwig <hch@lst.de> |
block: switch bios to blk_status_t Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
9e11ceee |
|
11-Apr-2017 |
Jan Kara <jack@suse.cz> |
btrfs: Convert to separately allocated bdi Allocate struct backing_dev_info separately instead of embedding it inside superblock. This unifies handling of bdi among users. CC: Chris Mason <clm@fb.com> CC: Josef Bacik <jbacik@fb.com> CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
0966a7b1 |
|
13-Apr-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: scrub: Introduce full stripe lock for RAID56 Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protection, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which store a rb_tree of mutexes for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
171938e5 |
|
28-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: track exclusive filesystem operation in flags There are several operations, usually started from ioctls, that cannot run concurrently. The status is tracked in mutually_exclusive_operation_running as an atomic_t. We can easily track the status as one of the per-filesystem flag bits with same synchronization guarantees. The conversion replaces: * atomic_xchg(..., 1) -> test_and_set_bit(FLAG, ...) * atomic_set(..., 0) -> clear_bit(FLAG, ...) Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de47c9d3 |
|
16-Mar-2017 |
Edmund Nadolski <enadolski@suse.com> |
btrfs: replace hardcoded value with SEQ_LAST macro Define the SEQ_LAST macro to replace (u64)-1 in places where said value triggers a special-case ref search behavior. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d48d71aa |
|
02-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove redundant parameter from btree_readahead_hook We can read fs_info from eb. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0700cea7 |
|
03-Mar-2017 |
Elena Reshetova <elena.reshetova@intel.com> |
btrfs: convert btrfs_root.refs from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1e4f4714 |
|
03-Mar-2017 |
Elena Reshetova <elena.reshetova@intel.com> |
btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ce0dcee6 |
|
14-Mar-2017 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: Change qgroup_meta_rsv to 64bit Using an int value is causing qg->reserved to become negative and exclusive -EDQUOT to be reached prematurely. This affects exclusive qgroups only. TEST CASE: DEVICE=/dev/vdb MOUNTPOINT=/mnt SUBVOL=$MOUNTPOINT/tmp umount $SUBVOL umount $MOUNTPOINT mkfs.btrfs -f $DEVICE mount /dev/vdb $MOUNTPOINT btrfs quota enable $MOUNTPOINT btrfs subvol create $SUBVOL umount $MOUNTPOINT mount /dev/vdb $MOUNTPOINT mount -o subvol=tmp $DEVICE $SUBVOL btrfs qgroup limit -e 3G $SUBVOL btrfs quota rescan /mnt -w for i in `seq 1 44000`; do dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1 if [[ $? > 0 ]]; then btrfs qgroup show -pcref $SUBVOL exit 1 fi done Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> [ add reproducer to changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
174cd4b1 |
|
02-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
db0a669f |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_add_link take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc4f21b1 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make get_extent_t take btrfs_inode In addition to changing the signature, this patch also switches all the functions which are used as an argument to also take btrfs_inode. Namely those are: btrfs_get_extent and btrfs_get_extent_filemap. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9cdc5124 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
73f2e545 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_orphan_add take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a6d7067 |
|
20-Feb-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_mark_extent_written take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dcdbc059 |
|
20-Feb-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_drop_extent_cache take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6158e1ce |
|
20-Feb-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
691fa059 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: all btrfs_delalloc_release_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f3db423 |
|
20-Feb-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
703b391a |
|
20-Feb-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_orphan_release_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ed7a2a0 |
|
20-Feb-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
04f4f916 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
70ddc553 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_is_free_space_inode take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e7611cf |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_insert_dir_item take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
11bac800 |
|
24-Feb-2017 |
Dave Jiang <dave.jiang@intel.com> |
mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
71367b3f |
|
15-Feb-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: use btrfs_debug instead of pr_debug in transaction abort Commit e5d6b12fe14 (Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors) added a pr_debug call to be printed when a transaction is aborted with -EIO instead of WARN. btrfs_debug prints which file system the message is associated with so let's use that instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5e00f193 |
|
15-Feb-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: convert btrfs_inc_block_group_ro to accept fs_info btrfs_inc_block_group_ro is either passed the extent root or the dev root, but it doesn't do anything with the dev tree. Let's convert to passing an fs_info and using the extent root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b74c03e |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from btrfs_prepare_extent_commit Added but never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7775c818 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from btrfs_subvolume_release_metadata Unused since qgroup refactoring that split data and metadata accounting, the btrfs_qgroup_free helper. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4c3b2dc |
|
30-Jan-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist run_delalloc_nocow has used trans in two places where they don't actually need @trans. For btrfs_lookup_file_extent, we search for file extents without COWing anything, and for btrfs_cross_ref_exist, the only place where we need @trans is deferencing it in order to get running_transaction which we could easily get from the global fs_info. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
310712b2 |
|
18-Jan-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: constify struct btrfs_{,disk_}key wherever possible In a lot of places, it's unclear when it's safe to reuse a struct btrfs_key after it has been passed to a helper function. Constify these arguments wherever possible to make it obvious. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4ec5934e |
|
17-Jan-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_unlink_inode take btrfs_inode Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
823bb20a |
|
04-Jan-2017 |
David Sterba <dsterba@suse.com> |
btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE The expression is open-coded in several places, this asks for a wrapper. As we know the MAX_EXTENT fits to u32, we can use the appropirate division helper. This cascades to the result type updates. Compiler is clever enough to use shift instead of integer division, so there's no change in the generated assembly. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a76b5b04 |
|
09-Dec-2016 |
Christoph Hellwig <hch@lst.de> |
fs: try to clone files first in vfs_copy_file_range A clone is a perfectly fine implementation of a file copy, so most file systems just implement the copy that way. Instead of duplicating this logic move it to the VFS. Currently btrfs and XFS implement copies the same way as clones and there is no behavior change for them, cifs only implements clones and grow support for copy_file_range with this patch. NFS implements both, so this will allow copy_file_range to work on servers that only implement CLONE and be lot more efficient on servers that implements CLONE and COPY. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
e5d6b12f |
|
09-Dec-2016 |
Chris Mason <clm@fb.com> |
Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors btrfs_transaction_abort() has a WARN() to help us nail down whatever problem lead to the abort. But most of the time, we're aborting for EIO, and the warning just adds noise. Signed-off-by: Chris Mason <clm@fb.com>
|
#
2ff7e61e |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: take an fs_info directly when the root is not used otherwise There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0b246afa |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, add fs_info convenience variables In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
27965b6c |
|
16-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da17066c |
|
15-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: pull node/sector/stripe sizes out of root and into fs_info We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f15376df |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, io_ctl_init The io_ctl->root member was only being used to access root->fs_info. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c28f158e |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: struct reada_control.root -> reada_control.fs_info The root is never used. We substitute extent_root in for the reada_find_extent call, since it's only ever used to obtain the node size. This call site will be changed to use fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6bccf3ab |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: call functions that always use the same root with fs_info instead There are many functions that are always called with the same root argument. Rather than passing the same root every time, we can pass an fs_info pointer instead and have the function get the root pointer itself. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5b4aacef |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: call functions that overwrite their root parameter with fs_info There are 11 functions that accept a root parameter and immediately overwrite it. We can pass those an fs_info pointer instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed0df618 |
|
01-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: store and load values of stripes_min/stripes_max in balance status item The balance status item contains currently known filter values, but the stripes filter was unintentionally not among them. This would mean, that interrupted and automatically restarted balance does not apply the stripe filters. Fixes: dee32d0ac3719ef8d640efaf0884111df444730f CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2230adff |
|
08-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: delete unused member from superblock __bdev' has never been used since 0b86a832a1f38abec695864ec2eaedc9d2383f1b (2008). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc2e901f |
|
08-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: reada, sink start parameter to btree_readahead_hook Originally, the eb and start were passed separately in case eb is NULL. Since the readahead has been refactored in 4.6, this is not true anymore and we can get rid of the parameter. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
939659df |
|
07-Nov-2016 |
Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> |
btrfs: add necessary comments about tickets_id Tickets_id's name may result in some misunderstandings, it just indicates the next ticket will be handled and is not stored per ticket. Fixes: ce12965 ("btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress") Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf8cddd3 |
|
27-Oct-2016 |
Christoph Hellwig <hch@lst.de> |
btrfs: don't abuse REQ_OP_* flags for btrfs_map_block btrfs_map_block supports different types of mappings, which to a large extent resemble block layer operations. But they don't always do, and currently btrfs dangerously overlays it's own flag over the block layer flags. This is just asking for a conflict, so introduce a different map flags enum inside of btrfs instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6675df31 |
|
22-Sep-2016 |
Omar Sandoval <osandov@fb.com> |
Btrfs: catch invalid free space trees There are two separate issues that can lead to corrupted free space trees. 1. The free space tree bitmaps had an endianness issue on big-endian systems which is fixed by an earlier patch in this series. 2. btrfs-progs before v4.7.3 modified filesystems without updating the free space tree. To catch both of these issues at once, we need to force the free space tree to be rebuilt. To do so, add a FREE_SPACE_TREE_VALID compat_ro bit. If the bit isn't set, we know that it was either produced by a broken big-endian kernel or may have been corrupted by btrfs-progs. This also provides us with a way to add rudimentary read-write support for the free space tree to btrfs-progs: it can just clear this bit and have the kernel rebuild the free space tree. Cc: stable@vger.kernel.org # 4.5+ Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2fd57fcb |
|
23-Sep-2016 |
Arnd Bergmann <arnd@arndb.de> |
btrfs: fix btrfs_no_printk stub helper The addition of btrfs_no_printk() caused a build failure when CONFIG_PRINTK is disabled: fs/btrfs/send.c: In function 'send_rename': fs/btrfs/ctree.h:3367:2: error: implicit declaration of function 'btrfs_no_printk' [-Werror=implicit-function-declaration] This moves the helper outside of that #ifdef so it is always defined, and changes the existing #ifdef to refer to that helper as well for consistency. Fixes: 47c57058ff2c ("btrfs: btrfs_debug should consume fs_info when DEBUG is not defined") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
851cd173 |
|
23-Sep-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: memset to avoid stale content in btree leaf This is an additional patch to "Btrfs: memset to avoid stale content in btree node block". This uses memset to initialize the unused space in a leaf to avoid potential stale content, which may be incurred by pushing items between sibling leaves. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c01f5f96 |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_debug should consume fs_info when DEBUG is not defined We can hit unused variable warnings when btrfs_debug and friends are just aliases for no_printk. This is due to the fs_info not getting consumed by the function call, which can happen if convenenience variables are used. This patch adds a new btrfs_no_printk static inline that consumes the convenience variable and does nothing else. It silences the unused variable warning and has no impact on the generated code: $ size fs/btrfs/extent_io.o* text data bss dec hex filename 44072 152 32 44256 ace0 fs/btrfs/extent_io.o.btrfs_no_printk 44072 152 32 44256 ace0 fs/btrfs/extent_io.o.no_printk Fixes: 27a0dd61a5 (Btrfs: make btrfs_debug match pr_debug handling related to DEBUG) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab8d0fc4 |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: convert pr_* to btrfs_* where possible For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62e85577 |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: convert printk(KERN_* to use pr_* calls This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
afcdd129 |
|
02-Sep-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add a flags field to btrfs_fs_info We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba8b04c1 |
|
19-Jul-2016 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc() parameters for both in-band dedupe and subpage sector size patchset. This should reduce conflict of both patchset and the effort to rebase them. Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
897a41b1 |
|
31-Aug-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: add dynamic debug support We can re-use the dynamic debugging descriptor to make use of the dynamic debugging mechanism but still use our own printk interface. Defining the DEBUG macro works as it did before. When it's defined, all of the messages default to print. We can also enable all debug messages at boot or module-load time using the 'dyndbg' and 'btrfs.dyndbg' options. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f0312210 |
|
15-Sep-2016 |
Miklos Szeredi <mszeredi@redhat.com> |
btrfs: use filemap_check_errors() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Cc: Chris Mason <clm@fb.com>
|
#
ce129655 |
|
01-Sep-2016 |
Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> |
btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress In btrfs_async_reclaim_metadata_space(), we use ticket's address to determine whether asynchronous metadata reclaim work is making progress. ticket = list_first_entry(&space_info->tickets, struct reserve_ticket, list); if (last_ticket == ticket) { flush_state++; } else { last_ticket = ticket; flush_state = FLUSH_DELAYED_ITEMS_NR; if (commit_cycles) commit_cycles--; } But indeed it's wrong, we should not rely on local variable's address to do this check, because addresses may be same. In my test environment, I dd one 168MB file in a 256MB fs, found that for this file, every time wait_reserve_ticket() called, local variable ticket's address is same, For above codes, assume a previous ticket's address is addrA, last_ticket is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and wake up it, then another ticket is added, but with the same address addrA, now last_ticket will be same to current ticket, then current ticket's flush work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR, which may result in some enospc issues(I have seen this in my test machine). Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9e7cc91a |
|
31-Jul-2016 |
Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> |
btrfs: fix fsfreeze hang caused by delayed iputs deal When running fstests generic/068, sometimes we got below deadlock: xfs_io D ffff8800331dbb20 0 6697 6693 0x00000080 ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000 ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001 ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8 Call Trace: [<ffffffff816a9045>] schedule+0x35/0x80 [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140 [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100 [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] [<ffffffff810d32b5>] percpu_down_read+0x35/0x50 [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40 [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs] [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs] [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] [<ffffffff81230a1a>] evict+0xba/0x1a0 [<ffffffff812316b6>] iput+0x196/0x200 [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs] [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs] [<ffffffff81218040>] freeze_super+0xf0/0x190 [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0 [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140 [<ffffffff81229409>] SyS_ioctl+0x79/0x90 [<ffffffff81003c12>] do_syscall_64+0x62/0x110 [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25 >From this warning, freeze_super() already holds SB_FREEZE_FS, but btrfs_freeze() will call btrfs_commit_transaction() again, if btrfs_commit_transaction() finds that it has delayed iputs to handle, it'll start_transaction(), which will try to get SB_FREEZE_FS lock again, then deadlock occurs. The root cause is that in btrfs, sync_filesystem(sb) does not make sure all metadata is updated. There still maybe some codes adding delayed iputs, see below sample race window: CPU1 | CPU2 |-> freeze_super() | |-> sync_filesystem(sb); | | |-> cleaner_kthread() | | |-> btrfs_delete_unused_bgs() | | |-> btrfs_remove_chunk() | | |-> btrfs_remove_block_group() | | |-> btrfs_add_delayed_iput() | | |-> sb->s_writers.frozen = SB_FREEZE_FS; | |-> sb_wait_write(sb, SB_FREEZE_FS); | | acquire SB_FREEZE_FS lock. | | | |-> btrfs_freeze() | |-> btrfs_commit_transaction() | |-> btrfs_run_delayed_iputs() | | will handle delayed iputs, | | that means start_transaction() | | will be called, which will try | | to get SB_FREEZE_FS lock. | To fix this issue, introduce a "int fs_frozen" to record internally whether fs has been frozen. If fs has been frozen, we can not handle delayed iputs. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment to btrfs_freeze ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
18513091 |
|
25-Jul-2016 |
Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> |
btrfs: update btrfs_space_info's bytes_may_use timely This patch can fix some false ENOSPC errors, below test script can reproduce one false ENOSPC error: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount $dev /tmp/mntpoint cd /tmp/mntpoint xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile Above script will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. Please see call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() | bytes_may_use += 64M |-> btrfs_prealloc_file_range() |-> btrfs_reserve_extent() |-> btrfs_add_reserved_bytes() | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not | change bytes_may_use, and bytes_reserved += 64M. Now | bytes_may_use + bytes_reserved == 128M, which is greater | than btrfs_space_info's total_bytes, false enospc occurs. | Note, the bytes_may_use decrease operation will be done in | end of btrfs_fallocate(), which is too late. Here is another simple case for buffered write: CPU 1 | CPU 2 | |-> cow_file_range() |-> __btrfs_buffered_write() |-> btrfs_reserve_extent() | | | | | | | | | ..... | |-> btrfs_check_data_free_space() | | | | |-> extent_clear_unlock_delalloc() | In CPU 1, btrfs_reserve_extent()->find_free_extent()-> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(). Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's btrfs_check_data_free_space() tries to reserve 100MB data space. If 100MB > data_sinfo->total_bytes - data_sinfo->bytes_used - data_sinfo->bytes_reserved - data_sinfo->bytes_pinned - data_sinfo->bytes_readonly - data_sinfo->bytes_may_use btrfs_check_data_free_space() will try to allcate new data chunk or call btrfs_start_delalloc_roots(), or commit current transaction in order to reserve some free space, obviously a lot of work. But indeed it's not necessary as long as decreasing bytes_may_use timely, we still have free space, decreasing 128M from bytes_may_use. To fix this issue, this patch chooses to update bytes_may_use for both data and metadata in btrfs_add_reserved_bytes(). For compress path, real extent length may not be equal to file content length, so introduce a ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by file content length. Then compress path can update bytes_may_use correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC and RESERVE_FREE. As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as PREALLOC, we also need to update bytes_may_use, but can not pass EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook() to update btrfs_space_info's bytes_may_use. Meanwhile __btrfs_prealloc_file_range() will call btrfs_free_reserved_data_space() internally for both sucessful and failed path, btrfs_prealloc_file_range()'s callers does not need to call btrfs_free_reserved_data_space() any more. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d2c609b8 |
|
14-Aug-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: properly track when rescan worker is running The qgroup_flags field is overloaded such that it reflects the on-disk status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is used to indicate that a rescan operation is in progress, but if the file system is unmounted while a rescan is running, the rescan operation is paused. If the file system is then mounted read-only, the flag will still be present but the rescan operation will not have been resumed. When we go to umount, btrfs_qgroup_wait_for_completion will see the flag and interpret it to mean that the rescan worker is still running and will wait for a completion that will never come. This patch uses a separate flag to indicate when the worker is running. The locking and state surrounding the qgroup rescan worker needs a lot of attention beyond this patch but this is enough to avoid a hung umount. Cc: <stable@vger.kernel.org> # v4.4+ Signed-off-by; Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
66642832 |
|
10-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_abort_transaction, drop root parameter __btrfs_abort_transaction doesn't use its root parameter except to obtain an fs_info pointer. We can obtain that from trans->root->fs_info for now and from trans->fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1db1ff92 |
|
15-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: convert nodesize macros to static inlines This patch converts the macros used to calculate various node size limits to static inlines. That way we get type checking for free. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14a1e067 |
|
15-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: introduce BTRFS_MAX_ITEM_SIZE We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in several places. This introduces a BTRFS_MAX_ITEM_SIZE macro to do the same. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0c83b62e |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: cleanup, remove prototype for btrfs_find_root_ref The function isn't implemented anywhere. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5ee5c9a |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root Now that we have a dummy fs_info associated with each test that uses a root, we don't need the DUMMY_ROOT bit anymore. This lets us make choices without needing an actual root like in e.g. btrfs_find_create_tree_block. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c0260ee |
|
20-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: tests, require fs_info for root This allows the upcoming patchset to push nodesize and sectorsize into fs_info. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cdde224 |
|
09-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs_test_opt and friends only use the root pointer to access the fs_info. Let's pass the fs_info directly in preparation to eliminate similar patterns all over btrfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
05653ef3 |
|
15-Jul-2016 |
David Sterba <dsterba@suse.com> |
btrfs: hide test-only member under ifdef Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f376df2b |
|
25-Mar-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add tracepoints for flush events We want to track when we're triggering flushing from our reservation code and what flushing is being done when we start flushing. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
957780eb |
|
17-May-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: introduce ticketed enospc infrastructure Our enospc flushing sucks. It is born from a time where we were early enospc'ing constantly because multiple threads would race in for the same reservation and randomly starve other ones out. So I came up with this solution to block any other reservations from happening while one guy tried to flush stuff to satisfy his reservation. This gives us pretty good correctness, but completely crap latency. The solution I've come up with is ticketed reservations. Basically we try to make our reservation, and if we can't we put a ticket on a list in order and kick off an async flusher thread. This async flusher thread does the same old flushing we always did, just asynchronously. As space is freed and added back to the space_info it checks and sees if we have any tickets that need satisfying, and adds space to the tickets and wakes up anything we've satisfied. Once the flusher thread stops making progress it wakes up all the current tickets and tells them to take a hike. There is a priority list for things that can't flush, since the async flusher could do anything we need to avoid deadlocks. These guys get priority for having their reservation made, and will still do manual flushing themselves in case the async flusher isn't running. This patch gives us significantly better latencies. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
25d609f8 |
|
25-Mar-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: fix callers of btrfs_block_rsv_migrate So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes. Not only this but it unconditionally changes the size of the block_rsv. This isn't a bug strictly speaking, but it makes truncate block rsv's look funny because every time we migrate bytes over its size grows, even though we only want it to be a specific size. So collapse this into one function that takes an update_size argument and make truncate and evict not update the size for consistency sake. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
31b9655f |
|
11-Apr-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: track transid for delayed ref flushing Using the offwakecputime bpf script I noticed most of our time was spent waiting on the delayed ref throttling. This is what is supposed to happen, but sometimes the transaction can commit and then we're waiting for throttling that doesn't matter anymore. So change this stuff to be a little smarter by tracking the transid we were in when we initiated the throttling. If the transaction we get is different then we can just bail out. This resulted in a 50% speedup in my fs_mark test, and reduced the amount of time spent throttling by 60 seconds over the entire run (which is about 30 minutes). Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
81a75f67 |
|
05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
btrfs: use bio fields for op and flags The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is no need to pass around the rq_flag_bits bits too. btrfs users should should access the bio insead. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
01327610 |
|
19-May-2016 |
Nicholas D Steeves <nsteeves@gmail.com> |
btrfs: fix string and comment grammatical issues and typos Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f78c436c |
|
09-May-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between block group relocation and nocow writes Relocation of a block group waits for all existing tasks flushing dellaloc, starting direct IO writes and any ordered extents before starting the relocation process. However for direct IO writes that end up doing nocow (inode either has the flag nodatacow set or the write is against a prealloc extent) we have a short time window that allows for a race that makes relocation proceed without waiting for the direct IO write to complete first, resulting in data loss after the relocation finishes. This is illustrated by the following diagram: CPU 1 CPU 2 btrfs_relocate_block_group(bg X) direct IO write starts against an extent in block group X using nocow mode (inode has the nodatacow flag or the write is for a prealloc extent) btrfs_direct_IO() btrfs_get_blocks_direct() --> can_nocow_extent() returns 1 btrfs_inc_block_group_ro(bg X) --> turns block group into RO mode btrfs_wait_ordered_roots() --> returns and does not know about the DIO write happening at CPU 2 (the task there has not created yet an ordered extent) relocate_block_group(bg X) --> rc->stage == MOVE_DATA_EXTENTS find_next_extent() --> returns extent that the DIO write is going to write to relocate_data_extent() relocate_file_extent_cluster() --> reads the extent from disk into pages belonging to the relocation inode and dirties them --> creates DIO ordered extent btrfs_submit_direct() --> submits bio against a location on disk obtained from an extent map before the relocation started btrfs_wait_ordered_range() --> writes all the pages read before to disk (belonging to the relocation inode) relocation finishes bio completes and wrote new data to the old location of the block group So fix this by tracking the number of nocow writers for a block group and make sure relocation waits for that number to go down to 0 before starting to move the extents. The same race can also happen with buffered writes in nocow mode since the patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes when relocating", because we are no longer flushing all delalloc which served as a synchonization mechanism (due to page locking) and ensured the ordered extents for nocow buffered writes were created before we called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow mode existed before that patch (no pages are locked or used during direct IO) and that fixed only races with direct IO writes that do cow. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
9cfa3e34 |
|
26-Apr-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: don't do unnecessary delalloc flushes when relocating Before we start the actual relocation process of a block group, we do calls to flush delalloc of all inodes and then wait for ordered extents to complete. However we do these flush calls just to make sure we don't race with concurrent tasks that have actually already started to run delalloc and have allocated an extent from the block group we want to relocate, right before we set it to readonly mode, but have not yet created the respective ordered extents. The flush calls make us wait for such concurrent tasks because they end up calling filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -> __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() -> btrfs_run_delalloc_work()) which ends up serializing us with those tasks due to attempts to lock the same pages (and the delalloc flush procedure calls the allocator and creates the ordered extents before unlocking the pages). These flushing calls not only make us waste time (cpu, IO) but also reduce the chances of writing larger extents (applications might be writing to contiguous ranges and we flush before they finish dirtying the whole ranges). So make sure we don't flush delalloc and just wait for concurrent tasks that have already started flushing delalloc and have allocated an extent from the block group we are about to relocate. This change also ends up fixing a race with direct IO writes that makes relocation not wait for direct IO ordered extents. This race is illustrated by the following diagram: CPU 1 CPU 2 btrfs_relocate_block_group(bg X) starts direct IO write, target inode currently has no ordered extents ongoing nor dirty pages (delalloc regions), therefore the root for our inode is not in the list fs_info->ordered_roots btrfs_direct_IO() __blockdev_direct_IO() btrfs_get_blocks_direct() btrfs_lock_extent_direct() locks range in the io tree btrfs_new_extent_direct() btrfs_reserve_extent() --> extent allocated from bg X btrfs_inc_block_group_ro(bg X) btrfs_start_delalloc_roots() __start_delalloc_inodes() --> does nothing, no dealloc ranges in the inode's io tree so the inode's root is not in the list fs_info->delalloc_roots btrfs_wait_ordered_roots() --> does not find the inode's root in the list fs_info->ordered_roots --> ends up not waiting for the direct IO write started by the task at CPU 2 relocate_block_group(rc->stage == MOVE_DATA_EXTENTS) prepare_to_relocate() btrfs_commit_transaction() iterates the extent tree, using its commit root and moves extents into new locations btrfs_add_ordered_extent_dio() --> now a ordered extent is created and added to the list root->ordered_extents and the root added to the list fs_info->ordered_roots --> this is too late and the task at CPU 1 already started the relocation btrfs_commit_transaction() btrfs_finish_ordered_io() btrfs_alloc_reserved_file_extent() --> adds delayed data reference for the extent allocated from bg X relocate_block_group(rc->stage == UPDATE_DATA_PTRS) prepare_to_relocate() btrfs_commit_transaction() --> delayed refs are run, so an extent item for the allocated extent from bg X is added to extent tree --> commit roots are switched, so the next scan in the extent tree will see the extent item sees the extent in the extent tree When this happens the relocation produces the following warning when it finishes: [ 7260.832836] ------------[ cut here ]------------ [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]() [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1 [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000 [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000 [ 7260.852998] Call Trace: [ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90 [ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2 [ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c [ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs] [ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs] [ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19 [ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs] [ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs] [ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63 [ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44 [ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc [ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34 [ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be [ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71 [ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79 [ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]--- This is because at the end of the first stage, in relocate_block_group(), we commit the current transaction, which makes delayed refs run, the commit roots are switched and so the second stage will find the extent item that the ordered extent added to the delayed refs. But this extent was not moved (ordered extent completed after first stage finished), so at the end of the relocation our block group item still has a positive used bytes counter, triggering a warning at the end of btrfs_relocate_block_group(). Later on when trying to read the extent contents from disk we hit a BUG_ON() due to the inability to map a block with a logical address that belongs to the block group we relocated and is no longer valid, resulting in the following trace: [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096 [ 7344.887518] ------------[ cut here ]------------ [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833! [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1 [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000 [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282 [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001 [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000 [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000 [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0 [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000 [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0 [ 7344.888431] Stack: [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950 [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000 [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000 [ 7344.888431] Call Trace: [ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs] [ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs] [ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf [ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs] [ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10 [ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd [ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs] [ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc [ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207 [ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207 [ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154 [ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f [ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1 [ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d [ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2 [ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e [ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0 [ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP <ffff8802046878f0> [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]--- Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
|
#
db671160 |
|
01-Apr-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: uapi/linux/btrfs_tree.h migration, item types and defines The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly to userspace. In order to decode them, full type information is required. Create a new header, btrfs_tree to contain these since most users won't need them. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
33ca9133 |
|
01-Apr-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE ioctl. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
18db9ac6 |
|
01-Apr-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: uapi/linux/btrfs.h migration, move feature flags The compat/compat_ro/incompat feature flags are used by the feature set/get ioctls. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
83288b60 |
|
01-Apr-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: uapi/linux/btrfs.h migration, qgroup limit flags The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d4ae133b |
|
01-Apr-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and BTRFS_IOC_SET_FSLABEL ioctls. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c63c245 |
|
29-Oct-2015 |
Luke Dashjr <luke@dashjr.org> |
btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl 32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr fail. Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org> Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c5f4ccb2 |
|
16-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move error handling code together in ctree.h Looks like we added the incompatible defines in between the error handling defines in the file ctree.h. Now group them back. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2351d743 |
|
16-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove unused function btrfs_assert() Apparently looks like ASSERT does the same intended job, as intended btrfs_assert(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
34d97007 |
|
16-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename btrfs_std_error to btrfs_handle_fs_error btrfs_std_error() handles errors, puts FS into readonly mode (as of now). So its good idea to rename it to btrfs_handle_fs_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ edit changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bb7ab3b9 |
|
04-Mar-2016 |
Adam Buchbinder <adam.buchbinder@gmail.com> |
btrfs: Fix misspellings in comments. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ebb8765b |
|
10-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move btrfs_compression_type to compression.h So that its better organized. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
73beece9 |
|
17-Jul-2015 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix lockdep deadlock warning due to dev_replace Xfstests btrfs/011 complains about a deadlock warning, [ 1226.649039] ========================================================= [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ] [ 1226.649039] 4.1.0+ #270 Not tainted [ 1226.649039] --------------------------------------------------------- [ 1226.652955] kswapd0/46 just changed the state of lock: [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1226.652955] other info that might help us debug this: [ 1226.652955] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1226.652955] Possible interrupt unsafe locking scenario: [ 1226.652955] CPU0 CPU1 [ 1226.652955] ---- ---- [ 1226.652955] lock(&fs_info->dev_replace.lock); [ 1226.652955] local_irq_disable(); [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] lock(&found->groups_sem); [ 1226.652955] <Interrupt> [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] *** DEADLOCK *** Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried to fix a similar one that has the exactly same warning, but with that, we still run to this. The above lock chain comes from btrfs_commit_transaction ->btrfs_run_delayed_items ... ->__btrfs_update_delayed_inode ... ->__btrfs_cow_block ... ->find_free_extent ->cache_block_group ->load_free_space_cache ->btrfs_readpages ->submit_one_bio ... ->__btrfs_map_block ->btrfs_dev_replace_lock However, with high memory pressure, tasks which hold dev_replace.lock can be interrupted by kswapd and then kswapd is intended to release memory occupied by superblock, inodes and dentries, where we may call evict_inode, and it comes to [ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30 [ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700 delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads to a ABBA deadlock. To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but things are simpler here since we only needs read's spinlock to blocking lock. With this, btrfs/011 no more produces warnings in dmesg. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d5131b65 |
|
17-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: drop unused argument in btrfs_ioctl_get_supported_features Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c5868f83 |
|
17-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls The control device is accessible when no filesystem is mounted and we may want to query features supported by the module. This is already possible using the sysfs files, this ioctl is for parity and convenience. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7e98a7f |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: change max_inline default to 2048 The current practical default is ~4k on x86_64 (the logic is more complex, simplified for brevity), the inlined files land in the metadata group and thus consume space that could be needed for the real metadata. The inlining brings some usability surprises: 1) total space consumption measured on various filesystems and btrfs with DUP metadata was quite visible because of the duplicated data within metadata 2) inlined data may exhaust the metadata, which are more precious in case the entire device space is allocated to chunks (ie. balance cannot make the space more compact) 3) performance suffers a bit as the inlined blocks are duplicate and stored far away on the device. Proposed fix: set the default to 2048 This fixes namely 1), the total filesysystem space consumption will be on par with other filesystems. Partially fixes 2), more data are pushed to the data block groups. The characteristics of 3) are based on actual small file size distribution. The change is independent of the metadata blockgroup type (though it's most visible with DUP) or system page size as these parameters are not trival to find out, compared to file size. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0138b6fe |
|
26-Jan-2016 |
Byongho Lee <bhlee.kernel@gmail.com> |
btrfs: simplify expression in btrfs_calc_trans_metadata_size() Simplify expression in btrfs_calc_trans_metadata_size(). Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2fefd558 |
|
07-Jan-2016 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: reada: limit max works count Reada creates 2 works for each level of tree recursively. In case of a tree having many levels, the number of created works is 2^level_of_tree. Actually we don't need so many works in parallel, this patch limits max works to BTRFS_MAX_MIRRORS * 2. The per-fs works_counter will be also used for btrfs_reada_wait() to check is there are background workers. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
02873e43 |
|
31-Dec-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: reada: Use fs_info instead of root in __readahead_hook's argument What __readahead_hook() need exactly is fs_info, no need to convert fs_info to root in caller and convert back in __readahead_hook() Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
96da0919 |
|
18-Jan-2016 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Introduce new mount option to disable tree log replay Introduce a new mount option "nologreplay" to co-operate with "ro" mount option to get real readonly mount, like "norecovery" in ext* and xfs. Since the new parse_options() need to check new flags at remount time, so add a new parameter for parse_options(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8dcddfa0 |
|
18-Jan-2016 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Introduce new mount option usebackuproot to replace recovery Current "recovery" mount option will only try to use backup root. However the word "recovery" is too generic and may be confusing for some users. Here introduce a new and more specific mount option, "usebackuproot" to replace "recovery" mount option. "Recovery" will be kept for compatibility reason, but will be deprecated. Also, since "usebackuproot" will only affect mount behavior and after open_ctree() it has nothing to do with the filesystem, so clear the flag after mount succeeded. This provides the basis for later unified "norecovery" mount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> [ dropped usebackuproot from show_mount, added note about 'recovery' to docs ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
242e2956 |
|
25-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: switch dev stats item to the permanent item key Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50c2d5ab |
|
25-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: introduce key type for persistent permanent items The number of distinct key types is not that big that we could waste one for something new we want to store in the tree. Similar to the temporary items, we'll introduce a new name for an existing key value and use the objectid for further extension. The victim is the BTRFS_DEV_STATS_KEY (248). The device stats are an example of a permanent item. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0bbbccb1 |
|
25-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: introduce key type for persistent temporary items The number of distinct key types is not that big that we could waste one for something new we want to store in the tree. We'll introduce a new name for an existing key value and use the objectid for further extension. The victim is the BTRFS_BALANCE_ITEM_KEY (248). The nature of the balance status item is a good example of the temporary item. It exists from beginning of the balance, keeps the status until it finishes. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9703fefe |
|
21-Jan-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
Btrfs: fallocate: Work with sectorsized blocks While at it, this commit changes btrfs_truncate_page() to truncate sectorsized blocks instead of pages. Hence the function has been renamed to btrfs_truncate_block(). Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2e78c927 |
|
21-Jan-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE units. Fix this by doing reservation/releases in block size units. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0bc19f90 |
|
06-Jan-2016 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: merge functions for wait snapshot creation wait_for_snapshot_creation() is in same group with oher two: btrfs_start_write_no_snapshoting() btrfs_end_write_no_snapshoting() Rename wait_for_snapshot_creation() and move it into same place with other two. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c2d6cb16 |
|
15-Jan-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix deadlock running delayed iputs at transaction commit time While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79 [ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b [ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108 [ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d [ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs] [ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c [ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266 [ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31 [ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c [ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76 [ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff [ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b [ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71 [ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4fb72bf2 |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: use smaller type for btrfs_path locks The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see: * overall size of btrfs_path drops down from 136 to 112 (-24 bytes), * better packing in a slab page +6 objects * the whole structure now fits to 2 cachelines * slight decrease in code size: text data bss dec hex filename 938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before 938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after (and the generated assembly does not change much) The main purpose is to decrease the size of the structure without affecting performance. The byte access is usually well behaving accross arches, the locks are not accessed frequently and sometimes just compared to zero. Note for further size reduction attempts: the slots could be made u16 but this might generate worse code on some arches (non-byte and non-int access). Also the range of operations on slots is wider compared to locks and the potential performance drop should be evaluated first. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7853f15b |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: use smaller type for btrfs_path lowest_level The level is 0..7, we can use smaller type. The size of btrfs_path is now 136 bytes from 144, which is +2 objects that fit into a 4k slab. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dccabfad |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: use smaller type for btrfs_path reada The possible values for reada are all positive and bounded, we can later save some bytes by storing it in u8. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4058b54 |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup, use enum values for btrfs_path reada Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4d4ab6d6 |
|
19-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: constify static arrays There are a few statically initialized arrays that can be made const. The remaining (like file_system_type, sysfs attributes or prop handlers) do not allow that due to type mismatch when passed to the APIs or because the structures are modified through other members. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee22184b |
|
14-Dec-2015 |
Byongho Lee <bhlee.kernel@gmail.com> |
Btrfs: use linux/sizes.h to represent constants We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9780c497 |
|
18-Oct-2015 |
Alexandru Moise <00moses.alexander00@gmail.com> |
btrfs: switch __btrfs_fs_incompat return type from int to bool Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly returning boolean not int. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b3909f8 |
|
19-Dec-2015 |
Darrick J. Wong <darrick.wong@oracle.com> |
btrfs: use new dedupe data function pointer Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
511711af |
|
30-Dec-2015 |
Chris Mason <clm@fb.com> |
btrfs: don't run delayed references while we are creating the free space tree This is a short term solution to make sure btrfs_run_delayed_refs() doesn't change the extent tree while we are scanning it to create the free space tree. Longer term we need to synchronize scanning the block groups one by one, similar to what happens during a balance. Signed-off-by: Chris Mason <clm@fb.com>
|
#
70f6d82e |
|
29-Sep-2015 |
Omar Sandoval <osandov@fb.com> |
Btrfs: add free space tree mount option Now we can finally hook up everything so we can actually use free space tree. The free space tree is enabled by passing the space_cache=v2 mount option. On the first mount with the this option set, the free space tree will be created and the FREE_SPACE_TREE read-only compat bit will be set. Any time the filesystem is mounted from then on, we must use the free space tree. The clear_cache option will also clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
a5ed9182 |
|
29-Sep-2015 |
Omar Sandoval <osandov@fb.com> |
Btrfs: implement the free space B-tree The free space cache has turned out to be a scalability bottleneck on large, busy filesystems. When the cache for a lot of block groups needs to be written out, we can get extremely long commit times; if this happens in the critical section, things are especially bad because we block new transactions from happening. The main problem with the free space cache is that it has to be written out in its entirety and is managed in an ad hoc fashion. Using a B-tree to store free space fixes this: updates can be done as needed and we get all of the benefits of using a B-tree: checksumming, RAID handling, well-understood behavior. With the free space tree, we get commit times that are about the same as the no cache case with load times slower than the free space cache case but still much faster than the no cache case. Free space is represented with extents until it becomes more space-efficient to use bitmaps, giving us similar space overhead to the free space cache. The operations on the free space tree are: adding and removing free space, handling the creation and deletion of block groups, and loading the free space for a block group. We can also create the free space tree by walking the extent tree and clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
208acb8c |
|
29-Sep-2015 |
Omar Sandoval <osandov@fb.com> |
Btrfs: introduce the free space B-tree on-disk format The on-disk format for the free space tree is straightforward. Each block group is represented in the free space tree by a free space info item that stores accounting information: whether the free space for this block group is stored as bitmaps or extents and how many extents of free space exist for this block group (regardless of which format is being used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys with no corresponding item, and bitmaps instead have the FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an array of bytes. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
73fa48b6 |
|
29-Sep-2015 |
Omar Sandoval <osandov@fb.com> |
Btrfs: refactor caching_thread() We're also going to load the free space tree from caching_thread(), so we should refactor some of the common code. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1abfbcdf |
|
29-Sep-2015 |
Omar Sandoval <osandov@fb.com> |
Btrfs: add helpers for read-only compat bits We're finally going to add one of these for the free space tree, so let's add the same nice helpers that we have for the incompat bits. While we're add it, also add helpers to clear the bits. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
04b38d60 |
|
02-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
vfs: pull btrfs clone API to vfs layer The btrfs clone ioctls are now adopted by other file systems, with NFS and CIFS already having support for them, and XFS being under active development. To avoid growth of various slightly incompatible implementations, add one to the VFS. Note that clones are different from file copies in several ways: - they are atomic vs other writers - they support whole file clones - they support 64-bit legth clones - they do not allow partial success (aka short writes) - clones are expected to be a fast metadata operation Because of that it would be rather cumbersome to try to piggyback them on top of the recent clone_file_range infrastructure. The converse isn't true and the clone_file_range system call could try clone file range as a first attempt to copy, something that further patches will enable. Based on earlier work from Peng Tao. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
30424601 |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: remove wait from struct btrfs_delalloc_work The value is 0 and never changes, we can propagate the value, remove wait and the implied dead code. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
651d494a |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: sink parameter wait to btrfs_alloc_delalloc_work There's only one caller and single value, we can propagate it down to the callee and remove the unused parameter. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3db11b2e |
|
10-Nov-2015 |
Zach Brown <zab@redhat.com> |
btrfs: add .copy_file_range file operation This rearranges the existing COPY_RANGE ioctl implementation so that the .copy_file_range file operation can call the core loop that copies file data extent items. The extent copying loop is lifted up into its own function. It retains the core btrfs error checks that should be shared. Signed-off-by: Zach Brown <zab@redhat.com> [Anna Schumaker: Make flags an unsigned int, Check for COPY_FR_REFLINK] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
758f2dfc |
|
19-Nov-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix scrub preventing unused block groups from being deleted Currently scrub can race with the cleaner kthread when the later attempts to delete an unused block group, and the result is preventing the cleaner kthread from ever deleting later the block group - unless the block group becomes used and unused again. The following diagram illustrates that race: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs and removes it from that list scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO sees the block group is already RO and therefore doesn't delete it nor adds it back to unused list So fix this by making scrub add the block group again to the list of unused block groups if the block group is still unused when it finished scrubbing it and it hasn't been removed already. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7fd01182 |
|
13-Nov-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix the number of transaction units needed to remove a block group We were using only 1 transaction unit when attempting to delete an unused block group but in reality we need 3 + N units, where N corresponds to the number of stripes. We were accounting only for the addition of the orphan item (for the block group's free space cache inode) but we were not accounting that we need to delete one block group item from the extent tree, one free space item from the tree of tree roots and N device extent items from the device tree. While one unit is not enough, it worked most of the time because for each single unit we are too pessimistic and assume an entire tree path, with the highest possible heigth (8), needs to be COWed with eventual node splits at every possible level in the tree, so there was usually enough reserved space for removing all the items and adding the orphan item. However after adding the orphan item, writepages() can by called by the VM subsystem against the btree inode when we are under memory pressure, which causes writeback to start for the nodes we COWed before, this forces the operation to remove the free space item to COW again some (or all of) the same nodes (in the tree of tree roots). Even without writepages() being called, we could fail with ENOSPC because these items are located in multiple trees and one of them might have a higher heigth and require node/leaf splits at many levels, exhausting all the reserved space before removing all the items and adding the orphan. In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group"), we attempted to fix a BUG_ON due to ENOSPC when trying to add the orphan item by making the cleaner kthread reserve one transaction unit before attempting to remove the block group, but this was not enough. We had a couple user reports still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on a 4.2-rc6 kernel for example: http://www.spinics.net/lists/linux-btrfs/msg46070.html So fix this by reserving all the necessary units of metadata. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8eab77ff |
|
13-Nov-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: use global reserve when deleting unused block group after ENOSPC It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c62d2555 |
|
06-Nov-2015 |
Michal Hocko <mhocko@suse.com> |
mm, fs: introduce mapping_gfp_constraint() There are many places which use mapping_gfp_mask to restrict a more generic gfp mask which would be used for allocations which are not directly related to the page cache but they are performed in the same context. Let's introduce a helper function which makes the restriction explicit and easier to track. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5846a3c2 |
|
26-Oct-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans Between btrfs_allocerved_file_extent() and btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs are run and delayed ref head maybe freed before btrfs_add_delayed_qgroup_reserve(). This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT, and cause transaction to be aborted. This patch will record qgroup reserve space info into delayed_ref_head at btrfs_add_delayed_ref(), to eliminate the race window. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
bc309467 |
|
20-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: extend balance filter usage to take minimum and maximum Similar to the 'limit' filter, we can enhance the 'usage' filter to accept a range. The change is backward compatible, the range is applied only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag. We don't have a usecase yet, the current syntax has been sufficient. The enhancement should provide parity with other range-like filters. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
dee32d0a |
|
28-Sep-2015 |
GabrÃel Arthúr Pétursson <gabriel@system.is> |
btrfs: add balance filter for stripes Balance block groups which have the given number of stripes, defined by a range min..max. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. Signed-off-by: GabrÃel Arthúr Pétursson <gabriel@system.is> [ renamed bargs members, added to the UAPI, wrote the changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
12907fc7 |
|
10-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: extend balance filter limit to take minimum and maximum The 'limit' filter is underdesigned, it should have been a range for [min,max], with some relaxed semantics when one of the bounds is missing. Besides that, using a full u64 for a single value is a waste of bytes. Let's fix both by extending the use of the u64 bytes for the [min,max] range. This can be done in a backward compatible way, the range will be interpreted only if the appropriate flag is set (BTRFS_BALANCE_ARGS_LIMIT_RANGE). Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
b06c4bf5 |
|
23-Oct-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix regression running delayed references when using qgroups In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) || !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) || !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
|
#
c759c4e1 |
|
02-Oct-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: don't keep trying to build clusters if we are fragmented If we are extremely fragmented then we won't be able to create a free_cluster. So if this happens set last_ptr->fragmented so that all future allcations will give up trying to create a cluster. When we unpin extents we will unset ->fragmented if we free up a sufficient amount of space in a block group. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4f4db217 |
|
29-Sep-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: keep track of max_extent_size per space_info When we are heavily fragmented we can induce a lot of latency trying to make an allocation happen that is simply not going to happen. Thankfully we keep track of our max_extent_size when going through the allocator, so if we get to the point where we are exiting find_free_extent with ENOSPC then set our space_info->max_extent_size so we can keep future allocations from having to pay this cost. We reset the max_extent_size whenever we release pinned bytes back into this space info so we can redo all the work. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d0bd4560 |
|
23-Sep-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add fragment=* debug mount option In tracking down these weird bitmap problems it was helpful to artificially create an extremely fragmented file system. These mount options let us either fragment data or metadata or both. With these options I could reproduce all sorts of weird latencies and hangs that occur under extreme fragmentation and get them fixed. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
51773bec |
|
08-Oct-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook In clear_bit_hook, qgroup reserved data is already handled quite well, either released by finish_ordered_io or invalidatepage. So calling btrfs_qgroup_free_data() here is completely meaningless, and since btrfs_qgroup_free_data() will lock io_tree, so it can't be called with io_tree lock hold. This patch will add a new function btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease the lockdep warning. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7cf5b976 |
|
08-Sep-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Cleanup old inaccurate facilities Cleanup the old facilities which use old btrfs_qgroup_reserve() function call, replace them with the newer version, and remove the "__" prefix in them. Also, make btrfs_qgroup_reserve/free() functions private, as they are now only used inside qgroup codes. Now, the whole btrfs qgroup is swithed to use the new reserve facilities. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1ada3a62 |
|
08-Sep-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: extent-tree: Add new version of btrfs_delalloc_reserve/release_space Add new version of btrfs_delalloc_reserve_space() and btrfs_delalloc_release_space() functions, which supports accurate qgroup reserve. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4ceff079 |
|
08-Sep-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: extent-tree: Add new version of btrfs_check_data_free_space and btrfs_free_reserved_data_space. Add new functions __btrfs_check_data_free_space() and __btrfs_free_reserved_data_space() to work with new accurate qgroup reserved space framework. The new function will replace old btrfs_check_data_free_space() and btrfs_free_reserved_data_space() respectively, but until all the change is done, let's just use the new name. Also, export internal use function btrfs_alloc_data_chunk_ondemand(), as now qgroup reserve requires precious bytes, some operation can't get the accurate number in advance(like fallocate). But data space info check and data chunk allocate doesn't need to be that accurate, and can be called at the beginning. So export it for later operations. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
55eeaf05 |
|
08-Sep-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Introduce new functions to reserve/free metadata Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free metadata reserved space. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1dd6d7ca |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: introduce ratelimited variants of message printing functions Signed-off-by: David Sterba <dsterba@suse.com>
|
#
24aa6b41 |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: introduce ratelimited _in_rcu variants of message printing functions Signed-off-by: David Sterba <dsterba@suse.com>
|
#
08a84e25 |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: introduce _in_rcu variants of message printing functions Due to the missing variants there are messages that lack the information printed by btrfs_info etc helpers. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4553fef |
|
25-Sep-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: consolidate btrfs_error() to btrfs_std_error() btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6618a59b |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
96f3136e |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4027a20 |
|
09-Jul-2015 |
Byongho Lee <bhlee.kernel@gmail.com> |
Btrfs: remove unused mutex from struct 'btrfs_fs_info' The code using 'ordered_extent_flush_mutex' mutex has removed by below commit. - 8d875f95da43c6a8f18f77869f2ef26e9594fecc btrfs: disable strict file flushes for renames and truncates But the mutex still lives in struct 'btrfs_fs_info'. So, this patch removes the mutex from struct 'btrfs_fs_info' and its initialization code. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
147d256e |
|
06-Aug-2015 |
Zhaolei <zhaolei@cn.fujitsu.com> |
btrfs: Remove unnecessary variants in relocation.c These arguments are not used in functions, remove them for cleanup and make kernel stack happy. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
868f401a |
|
05-Aug-2015 |
Zhaolei <zhaolei@cn.fujitsu.com> |
btrfs: Use ref_cnt for set_block_group_ro() More than one code call set_block_group_ro() and restore rw in fail. Old code use bool bit to save blockgroup's ro state, it can not support parallel case(it is confirmd exist in my debug log). This patch use ref count to store ro state, and rename set_block_group_ro/set_block_group_rw to inc_block_group_ro/dec_block_group_ro. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e33e17ee |
|
15-Jun-2015 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: add missing discards when unpinning extents with -o discard When we clear the dirty bits in btrfs_delete_unused_bgs for extents in the empty block group, it results in btrfs_finish_extent_commit being unable to discard the freed extents. The block group removal patch added an alternate path to forget extents other than btrfs_finish_extent_commit. As a result, any extents that would be freed when the block group is removed aren't discarded. In my test run, with a large copy of mixed sized files followed by removal, it left nearly 2/3 of extents undiscarded. To clean up the block groups, we add the removed block group onto a list that will be discarded after transaction commit. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
67c5e7d4 |
|
10-Jun-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between balance and unused block group deletion We have a race between deleting an unused block group and balancing the same block group that leads to an assertion failure/BUG(), producing the following trace: [181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622 [181631.220591] ------------[ cut here ]------------ [181631.222959] kernel BUG at fs/btrfs/ctree.h:4062! [181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$ [181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1 [181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000 [181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs] [181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246 [181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce [181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff [181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000 [181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200 [181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000 [181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000 [181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0 [181631.224566] Stack: [181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e [181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001 [181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000 [181631.224566] Call Trace: [181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs] [181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs] [181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs] [181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs] [181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs] [181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs] [181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf [181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs] [181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15 [181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc [181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2 [181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2 [181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424 [181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479 (...) The sequence of steps leading to this are: CPU 0 CPU 1 btrfs_balance() btrfs_relocate_chunk() btrfs_relocate_block_group(bg X) btrfs_lookup_block_group(bg X) cleaner_kthread locks fs_info->cleaner_mutex btrfs_delete_unused_bgs() finds bg X, which became unused in the previous transaction checks bg X ->ro == 0, so it proceeds sets bg X ->ro to 1 (btrfs_set_block_group_ro(bg X)) blocks on fs_info->cleaner_mutex btrfs_remove_chunk(bg X) unlocks fs_info->cleaner_mutex acquires fs_info->cleaner_mutex relocate_block_group() --> does nothing, no extents found in the extent tree from bg X unlocks fs_info->cleaner_mutex btrfs_relocate_block_group(bg X) returns btrfs_remove_chunk(bg X) extent map not found --> ASSERT(0) Fix this by using a new mutex to make sure these 2 operations, block group relocation and removal, are serialized. This issue is reproducible by running fstests generic/038 (which stresses chunk allocation and automatic removal of unused block groups) together with the following balance loop: while true; do btrfs balance start -dusage=0 <mountpoint> ; done Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e69bcee3 |
|
16-Apr-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. Goodbye, the old mechanisim. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
20b2e302 |
|
04-Jun-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() lockdep report following warning in test: [25176.843958] ================================= [25176.844519] [ INFO: inconsistent lock state ] [25176.845047] 4.1.0-rc3 #22 Tainted: G W [25176.845591] --------------------------------- [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes: [25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs] [25176.847838] {SOFTIRQ-ON-W} state was registered at: [25176.848396] [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10 [25176.848955] [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0 [25176.849491] [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410 [25176.850029] [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs] [25176.850575] [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs] [25176.851110] [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs] [25176.851660] [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs] [25176.852189] [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs] [25176.852771] [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs] [25176.853315] [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570 [25176.853868] [<ffffffff8121c851>] SyS_ioctl+0x41/0x80 [25176.854406] [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f [25176.854935] irq event stamp: 51506 [25176.855511] hardirqs last enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0 [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0 [25176.856642] softirqs last enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640 [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120 [25176.857746] other info that might help us debug this: [25176.858845] Possible unsafe locking scenario: [25176.859981] CPU0 [25176.860537] ---- [25176.861059] lock(&wr_ctx->wr_lock); [25176.861705] <Interrupt> [25176.862272] lock(&wr_ctx->wr_lock); [25176.862881] *** DEADLOCK *** Reason: Above warning is caused by: Interrupt -> bio_endio() -> ... -> scrub_put_ctx() -> scrub_free_ctx() *1 -> ... -> mutex_lock(&wr_ctx->wr_lock); scrub_put_ctx() is allowed to be called in end_bio interrupt, but in code design, it will never call scrub_free_ctx(sctx) in interrupe context(above *1), because btrfs_scrub_dev() get one additional reference of sctx->refs, which makes scrub_free_ctx() only called withine btrfs_scrub_dev(). Now the code runs out of our wish, because free sequence in scrub_pending_bio_dec() have a gap. Current code: -----------------------------------+----------------------------------- scrub_pending_bio_dec() | btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); | wake_up(&sctx->list_wait); | | scrub_put_ctx() | -> atomic_dec_and_test(&sctx->refs) scrub_put_ctx(sctx); | -> atomic_dec_and_test(&sctx->refs)| -> scrub_free_ctx() | -----------------------------------+----------------------------------- We expected: -----------------------------------+----------------------------------- scrub_pending_bio_dec() | btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); | wake_up(&sctx->list_wait); | scrub_put_ctx(sctx); | -> atomic_dec_and_test(&sctx->refs)| | scrub_put_ctx() | -> atomic_dec_and_test(&sctx->refs) | -> scrub_free_ctx() -----------------------------------+----------------------------------- Fix: Move scrub_pending_bio_dec() to a workqueue, to avoid this function run in interrupt context. Tested by check tracelog in debug. Changelog v1->v2: Use workqueue instead of adjust function call sequence in v1, because v1 will introduce a bug pointed out by: Filipe David Manana <fdmanana@gmail.com> Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4617ea3a |
|
09-Jun-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix necessary chunk tree space calculation when allocating a chunk When allocating a new chunk or removing one we need to update num_devs device items and insert or remove a chunk item in the chunk tree, so in the worst case the space needed in the chunk space_info is: btrfs_calc_trunc_metadata_size(chunk_root, num_devs) + btrfs_calc_trans_metadata_size(chunk_root, 1) That is, in the worst case we need to cow num_devs paths and cow 1 other path that can result in splitting every node and leaf, and each path consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes, which were unnecessary since updating the existing device items does not result in splitting the nodes and leaf since after updating them they remain with the same size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
39c2d7fa |
|
20-May-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix -ENOSPC on block group removal Unlike when attempting to allocate a new block group, where we check that we have enough space in the system space_info to update the device items and insert a new chunk item in the chunk tree, we were not checking if the system space_info had enough space for updating the device items and deleting the chunk item in the chunk tree. This often lead to -ENOSPC error when attempting to allocate blocks for the chunk tree (during btree node/leaf COW operations) while updating the device items or deleting the chunk item, which resulted in the current transaction being aborted and turning the filesystem into read-only mode. While running fstests generic/038, which stresses allocation of block groups and removal of unused block groups, with a large scratch device (750Gb) this happened often, despite more than enough unallocated space, and resulted in the following trace: [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [68663.600407] BTRFS: Transaction aborted (error -28) (...) [68663.730829] Call Trace: [68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs] [68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs] [68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs] [68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs] [68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c [68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs] [68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs] [68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7 [68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28 [68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90 [68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.849503] ---[ end trace 798477c6d6dbaad6 ]--- [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left So fix this by verifying that enough space exists in system space_info, and reserving the space in the chunk block reserve, before attempting to delete the block group and allocate a new system chunk if we don't have enough space to perform the necessary updates and delete in the chunk tree. Like for the block group creation case, we don't error our if we fail to allocate a new system chunk, since we might end up not needing it (no node/leaf splits happen during the COW operations and/or we end up not needing to COW any btree nodes or leafs because they were already COWed in the current transaction and their writeback didn't start yet). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4fbcdf66 |
|
20-May-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix -ENOSPC when finishing block group creation While creating a block group, we often end up getting ENOSPC while updating the chunk tree, which leads to a transaction abortion that produces a trace like the following: [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [30670.117777] BTRFS: Transaction aborted (error -28) (...) [30670.163567] Call Trace: [30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs] [30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs] [30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs] [30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15 [30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de [30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62 [30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [30670.187792] ---[ end trace 0373e6b491c4a8cc ]--- This is because we don't do proper space reservation for the chunk block reserve when we have multiple tasks allocating chunks in parallel. So block group creation has 2 phases, and the first phase essentially checks if there is enough space in the system space_info, allocating a new system chunk if there isn't, while the second phase updates the device, extent and chunk trees. However, because the updates to the chunk tree happen in the second phase, if we have N tasks, each with its own transaction handle, allocating new chunks in parallel and if there is only enough space in the system space_info to allocate M chunks, where M < N, none of the tasks ends up allocating a new system chunk in the first phase and N - M tasks will get -ENOSPC when attempting to update the chunk tree in phase 2 if they need to COW any nodes/leafs from the chunk tree. Fix this by doing proper reservation in the chunk block reserve. The issue could be reproduced by running fstests generic/038 in a loop, which eventually triggered the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1f6e4b3f |
|
04-May-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Fix superblock csum type check. Old csum type check is wrong and can't catch csum_type 1(not supported). Fix it to avoid hostile 0 division. Reported-by: Lukas Lueg <lukas.lueg@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c0d19e2b |
|
24-Apr-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: add 'cold' compiler annotations to all error handling functions The annotated functios will be placed into .text.unlikely section. The annotation also hints compiler to move the code out of the hot paths, and may implicitly mark if-statement leading to that block as unlikely. This is a heuristic, the impact on the generated code is not significant. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1a9a8a71 |
|
24-Apr-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: report exact callsite where transaction abort occurs WARN is called from a single location and all bugreports say that's in super.c __btrfs_abort_transaction. This is slightly confusing as we'd rather want to know the exact callsite. Whereas this information is printed in the syslog below the stacktrace, this requires further look and we usually see only the headline from WARNING. Moving the WARN into the macro has to inline some code and increases code by a few kilobytes: text data bss dec hex filename 835481 20305 14120 869906 d4612 btrfs.ko.before 842883 20305 14120 877308 d62fc btrfs.ko.after The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config. The increase is not small and could lead to worse icache use. The code is on error/exit paths that can be recognized by compiler as cold and moved out of the way so the impact is speculated to be low, if measurable at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2e7910d6 |
|
09-Mar-2015 |
Anand Jain <Anand.Jain@oracle.com> |
Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices This patch will provide a framework and help to create attributes from the structure btrfs_fs_devices which are available even before fs_info is created. So by moving the parent kobject super_kobj from fs_info to btrfs_fs_devices, it will help to create attributes from the btrfs_fs_devices as well. Patches on top of this patch now will be able to create the sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices when devices are scanned and registered to the kernel. Just to note, this does not change any of the existing btrfs sysfs external kobject names and its attributes and not even the life cycle of them. Changes are internal only. And to ensure the same, this path has been tested with various device operations and, checking and comparing the sysfs kobjects and attributes with sysfs kobject and attributes with out this patch, and they remain same. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
e09fe2d2 |
|
27-Feb-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created Btrfs will create qgroup on subvolume creation if quota is enabled, but qgroup uses the high bits(currently 16 bits) as level, to build the inheritance. However it is fully possible a subvolume can be created with a subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be considered as level 1 and can't be assigned to other qgroup in level 1. This patch will prevent such things so qgroup inheritance will not be screwed up. The downside is very clear, btrfs subvolume number limit will decrease from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to (u48 max -256(first free objectid)). But we still have near u48(that's 15 digits in dec), so that should not be a huge problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8465ecec |
|
27-Feb-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Check qgroup level in kernel qgroup assign. Although we have qgroup level check in btrfs-progs, it's not enough since other programe may still call ioctl directly not using btrfs-progs. For example, systemd. But it's btrfs-progs to be blame since we don't provide a full-function(like subvolume create things) btrfs library with enough check, and only rely on kernel ioctl. So Add level checks in kernel too. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e2d1f923 |
|
06-Feb-2015 |
Dongsheng Yang <yangds.fnst@cn.fujitsu.com> |
btrfs: qgroup: do a reservation in a higher level. There are two problems in qgroup: a). The PAGE_CACHE is 4K, even when we are writing a data of 1K, qgroup will reserve a 4K size. It will cause the last 3K in a qgroup is not available to user. b). When user is writing a inline data, qgroup will not reserve it, it means this is a window we can exceed the limit of a qgroup. The main idea of this patch is reserving the data size of write_bytes rather than the reserve_bytes. It means qgroup will not care about the data size btrfs will reserve for user, but only care about the data size user is going to write. Then reserve it when user want to write and release it in transaction committed. In this way, qgroup can be released from the complex procedure in btrfs and only do the reserve when user want to write and account when the data is written in commit_transaction(). Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d7c15171 |
|
25-Feb-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Fix NO_SPACE bug caused by delayed-iput Steps to reproduce: while true; do dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%] rm /btrfs_dir/file sync done And we'll see dd failed because btrfs return NO_SPACE. Reason: Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs() in end to free fs space for next write, but sometimes it hadn't done work on time, because btrfs-cleaner thread get delayed-iputs from list before, but do iput() after next write. This is log: [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs() [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs() [ 2569.087554] comm=sync func=btrfs_commit_transaction() end [ 2569.191081] comm=dd begin [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28 [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888 [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512 ... [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496 [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end Fix: Make btrfs_commit_transaction() wait current running btrfs-cleaner's delayed-iputs() done in end. Test: Use script similar to above(more complex), before patch: 7 failed in 100 * 20 loop. after patch: 0 failed in 100 * 20 loop. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
cdfb080e |
|
06-Apr-2015 |
Chris Mason <clm@fb.com> |
Btrfs: fix use after free when close_ctree frees the orphan_rsv Near the end of close_ctree, we're calling btrfs_free_block_rsv to free up the orphan rsv. The problem is this call updates the space_info, which has already been freed. This adds a new __ function that directly calls kfree instead of trying to update the space infos. Signed-off-by: Chris Mason <clm@fb.com>
|
#
1bbc621e |
|
06-Apr-2015 |
Chris Mason <clm@fb.com> |
Btrfs: allow block group cache writeout outside critical section in commit We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: Chris Mason <clm@fb.com>
|
#
c9dc4c65 |
|
04-Apr-2015 |
Chris Mason <clm@fb.com> |
Btrfs: two stage dirty block group writeout Block group cache writeout is currently waiting on the pages for each block group cache before moving on to writing the next one. This commit switches things around to send down all the caches and then wait on them in batches. The end result is much faster, since we're keeping the disk pipeline full. Signed-off-by: Chris Mason <clm@fb.com>
|
#
4c6d1d85 |
|
06-Apr-2015 |
Chris Mason <clm@fb.com> |
btrfs: move struct io_ctl into ctree.h and rename it We'll need to put the io_ctl into the block_group cache struct, so name it struct btrfs_io_ctl and move it into ctree.h Signed-off-by: Chris Mason <clm@fb.com>
|
#
28f75a0e |
|
04-Feb-2015 |
Chris Mason <clm@fb.com> |
Btrfs: refill block reserves during truncate When truncate starts, it allocates some space in the block reserves so that we'll have enough to update metadata along the way. For very large files, we can easily go through all of that space as we loop through the extents. This changes truncate to refill the space reservation as it progresses through the file. Signed-off-by: Chris Mason <clm@fb.com>
|
#
6a3891c5 |
|
16-Mar-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add sanity test for outstanding_extents accounting I introduced a regression wrt outstanding_extents accounting. These are tricky areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE at any time. So add sanity tests to cover the various conditions that are tricky in order to make sure we don't introduce regressions in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
dcdf7f6d |
|
02-Mar-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: prepare block group cache before writing Writing the block group cache will modify the extent tree quite a bit because it truncates the old space cache and pre-allocates new stuff. To try and cut down on the churn lets do the setup dance first, then later on hopefully we can avoid looping with newly dirtied roots. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
3284da7b |
|
25-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: use explicit initializer for seq_elem Using {} as initializer for struct seq_elem does not properly initialize the list_head member, but it currently works because it gets set through btrfs_get_tree_mod_seq if 'seq' is 0. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
b7a0365e |
|
11-Nov-2014 |
Daniel Dressler <danieru.dressler@gmail.com> |
Btrfs: ctree: reduce args where only fs_info used This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. This causes programmers to ponder which root can be passed. Since only the fs_info is read affected functions can accept any root, except this is only obvious upon inspection. This patch reduces the specificty of such functions to accept the fs_info directly. This patch does not address the two functions in ctree.c (insert_ptr, and split_item) which only use root for BUG_ONs in ctree.c This patch affects the following functions: 1) fixup_low_keys 2) btrfs_set_item_key_safe Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
dcab6a3b |
|
11-Feb-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: account for large extents with enospc On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d4b450cd |
|
29-Jan-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between transaction commit and empty block group removal Committing a transaction can race with automatic removal of empty block groups (cleaner kthread), leading to a BUG_ON() in the transaction commit code while running btrfs_finish_extent_commit(). The following sequence diagram shows how it can happen: CPU 1 CPU 2 btrfs_commit_transaction() fs_info->running_transaction = NULL btrfs_finish_extent_commit() find_first_extent_bit() -> found range for block group X in fs_info->freed_extents[] btrfs_delete_unused_bgs() -> found block group X Removed block group X's range from fs_info->freed_extents[] btrfs_remove_chunk() btrfs_remove_block_group(bg X) unpin_extent_range(bg X range) btrfs_lookup_block_group(bg X) -> returns NULL -> BUG_ON() The trace that results from the BUG_ON() is: [48665.187808] ------------[ cut here ]------------ [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675! [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000 [48665.197388] RIP: 0010:[<ffffffffa0350d05>] [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP: 0018:ffff8801b56a7b88 EFLAGS: 00010246 [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8 [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0 [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000 [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000 [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000 [48665.197388] FS: 0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [48665.197388] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0 [48665.197388] Stack: [48665.197388] ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff [48665.197388] ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0 [48665.197388] ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227 [48665.197388] Call Trace: [48665.197388] [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs] [48665.197388] [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs] [48665.197388] [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs] [48665.197388] [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33 [48665.197388] [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs] [48665.197388] [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab [48665.197388] [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab [48665.197388] [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf [48665.197388] [<ffffffff8105a55b>] worker_thread+0x210/0x2d0 [48665.197388] [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3 [48665.197388] [<ffffffff8105e5c0>] kthread+0xef/0xf7 [48665.197388] [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b [48665.197388] RIP [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP <ffff8801b56a7b88> [48665.272246] ---[ end trace b9c6ab9957521376 ]--- Fix this by ensuring that unpining the block group's range in btrfs_finish_extent_commit() is done in a synchronized fashion with removing the block group's range from freed_extents[] in btrfs_delete_unused_bgs() This race got introduced with the change: Btrfs: remove empty block groups automatically commit 47ab2a6c689913db23ccae38349714edf8365e0a Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
a937b979 |
|
12-Dec-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: kill btrfs_inode_*time helpers They just opencode taking address of the timespec member. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
78f55e5e |
|
12-Jan-2015 |
Anand Jain <Anand.Jain@oracle.com> |
Btrfs: fix unused members in struct btrfs_root There isn't any real use of following members of struct btrfs_root so delete them. struct kobject root_kobj; struct completion kobj_unregister; Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ffe2d203 |
|
20-Jan-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply So we can check raid56 with: (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) instead of long: (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ce93ec54 |
|
17-Nov-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: track dirty block groups on their own list Currently any time we try to update the block groups on disk we will walk _all_ block groups and check for the ->dirty flag to see if it is set. This function can get called several times during a commit. So if you have several terabytes of data you will be a very sad panda as we will loop through _all_ of the block groups several times, which makes the commit take a while which slows down the rest of the file system operations. This patch introduces a dirty list for the block groups that we get added to when we dirty the block group for the first time. Then we simply update any block groups that have been dirtied since the last time we called btrfs_write_dirty_block_groups. This allows us to clean up how we write the free space cache out so it is much cleaner. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e7070be1 |
|
16-Dec-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: change how we track dirty roots I've been overloading root->dirty_list to keep track of dirty roots and which roots need to have their commit roots switched at transaction commit time. This could cause us to lose an update to the root which could corrupt the file system. To fix this use a state bit to know if the root is dirty, and if it isn't set we go ahead and move the root to the dirty list. This way if we re-dirty the root after adding it to the switch_commit list we make sure to update it. This also makes it so that the extent root is always the last root on the dirty list to try and keep the amount of churn down at this point in the commit. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
75c68e9f |
|
16-Jan-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race deleting block group from space_info->ro_bgs list When removing a block group we were deleting it from its space_info's ro_bgs list without the correct protection - the space info's spinlock. Fix this by doing the list delete while holding the spinlock of the corresponding space info, which is the correct lock for any operation on that list. This issue was introduced in the 3.19 kernel by the following change: Btrfs: move read only block groups onto their own list V2 commit 633c0aad4c0243a506a3e8590551085ad78af82d I ran into a kernel crash while a task was running statfs, which iterates the space_info->ro_bgs list while holding the space info's spinlock, and another task was deleting it from the same list, without holding that spinlock, as part of the block group remove operation (while running the function btrfs_remove_block_group). This happened often when running the stress test xfstests/generic/038 I recently made. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1edb647b |
|
08-Dec-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: remove non-sense btrfs_error_discard_extent() function It doesn't do anything special, it just calls btrfs_discard_extent(), so just remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
04216820 |
|
27-Nov-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between fs trimming and block group remove/allocation Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4f69cb98 |
|
26-Nov-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix crash caused by block group removal If we remove a block group (because it became empty), we might have left a caching_ctl structure in fs_info->caching_block_groups that points to the block group and is accessed at transaction commit time. This results in accessing an invalid or incorrect block group. This issue became visible after Josef's patch "Btrfs: remove empty block groups automatically". So if the block group is removed make sure we don't leave a dangling caching_ctl in caching_block_groups. Sample crash trace: [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8 [58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060 [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs] [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1 [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000 [58380.443238] RIP: 0010:[<ffffffffa03f6d05>] [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs] [58380.443238] RSP: 0018:ffff88013896fdd8 EFLAGS: 00010283 [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000 [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8 [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60 [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000 [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00 [58380.443238] FS: 0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000 [58380.443238] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0 [58380.443238] Stack: [58380.443238] ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800 [58380.443238] ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000 [58380.443238] ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090 [58380.443238] Call Trace: [58380.443238] [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs] [58380.443238] [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs] [58380.443238] [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs] [58380.443238] [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [58380.443238] [<ffffffff8105966b>] kthread+0xb7/0xbf [58380.443238] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67 [58380.443238] [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0 [58380.443238] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4245215d |
|
25-Nov-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 The commit c404e0dc (Btrfs: fix use-after-free in the finishing procedure of the device replace) fixed a use-after-free problem which happened when removing the source device at the end of device replace, but at that time, btrfs didn't support device replace on raid56, so we didn't fix the problem on the raid56 profile. Currently, we implemented device replace for raid56, so we need kick that problem out before we enable that function for raid56. The fix method is very simple, we just increase the bio per-cpu counter before we submit a raid56 io, and decrease the counter when the raid56 io ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
9ea24bbe |
|
29-Oct-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix snapshot inconsistency after a file write followed by truncate If right after starting the snapshot creation ioctl we perform a write against a file followed by a truncate, with both operations increasing the file's size, we can get a snapshot tree that reflects a state of the source subvolume's tree where the file truncation happened but the write operation didn't. This leaves a gap between 2 file extent items of the inode, which makes btrfs' fsck complain about it. For example, if we perform the following file operations: $ mkfs.btrfs -f /dev/vdd $ mount /dev/vdd /mnt $ xfs_io -f \ -c "pwrite -S 0xaa -b 32K 0 32K" \ -c "fsync" \ -c "pwrite -S 0xbb -b 32770 16K 32770" \ -c "truncate 90123" \ /mnt/foobar and the snapshot creation ioctl was just called before the second write, we often can get the following inode items in the snapshot's btree: item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160 inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20 inode ref index 282 namelen 10 name: foobar item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53 extent data disk byte 1104855040 nr 32768 extent data offset 0 nr 32768 ram 32768 extent compression 0 item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 40960 ram 40960 extent compression 0 There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[ for which there's no file extent item covering it. This is because the file write and file truncate operations happened both right after the snapshot creation ioctl called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the ordered extent that matches the write and, in btrfs_setsize(), we were able to call btrfs_cont_expand() before being able to commit the current transaction in the snapshot creation ioctl. So this made it possibe to insert the hole file extent item in the source subvolume (which represents the region added by the truncate) right before the transaction commit from the snapshot creation ioctl. Btrfs' fsck tool complains about such cases with a message like the following: "root 331 inode 257 errors 100, file extent discount" >From a user perspective, the expectation when a snapshot is created while those file operations are being performed is that the snapshot will have a file that either: 1) is empty 2) only the first write was captured 3) only the 2 writes were captured 4) both writes and the truncation were captured But never capture a state where only the first write and the truncation were captured (since the second write was performed before the truncation). A test case for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
b38ef71c |
|
13-Nov-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: ensure ordered extent errors aren't missed on fsync When doing a fsync with a fast path we have a time window where we can miss the fact that writeback of some file data failed, and therefore we endup returning success (0) from fsync when we should return an error. The steps that lead to this are the following: 1) We start all ordered extents by calling filemap_fdatawrite_range(); 2) We do some other work like locking the inode's i_mutex, start a transaction, start a log transaction, etc; 3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the ordered extents from inode's ordered tree into a list; 4) But by the time we do ordered extent collection, some ordered extents we started at step 1) might have already completed with an error, and therefore we didn't found them in the ordered tree and had no idea they finished with an error. This makes our fsync return success (0) to userspace, but has no bad effects on the log like for example insertion of file extent items into the log that point to unwritten extents, because the invalid extent maps were removed before the ordered extent completed (in inode.c:btrfs_finish_ordered_io). So after collecting the ordered extents just check if the inode's i_mapping has any error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever writeback fails for a page of an ordered extent, we call mapping_set_error (done in extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage) that sets one of those error flags in the inode's i_mapping flags. This change also has the side effect of fixing the issue where for fast fsyncs we never checked/cleared the error flags from the inode's i_mapping flags, which means that a full fsync performed after a fast fsync could get such errors that belonged to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which calls filemap_fdatawait_range(), and the later checks for and clears those flags, while for fast fsyncs we never call filemap_fdatawait_range() or anything else that checks for and clears the error flags from the inode's i_mapping. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5f5bc6b1 |
|
09-Nov-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: make xattr replace operations atomic Replacing a xattr consists of doing a lookup for its existing value, delete the current value from the respective leaf, release the search path and then finally insert the new value. This leaves a time window where readers (getxattr, listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs, so this has security implications. This change also fixes 2 other existing issues which were: *) Deleting the old xattr value without verifying first if the new xattr will fit in the existing leaf item (in case multiple xattrs are packed in the same item due to name hash collision); *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't exist but we have have an existing item that packs muliple xattrs with the same name hash as the input xattr. In this case we should return ENOSPC. A test case for xfstests follows soon. Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace implementation. Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
633c0aad |
|
31-Oct-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: move read only block groups onto their own list V2 Our gluster boxes were spending lots of time in statfs because our fs'es are huge. The problem is statfs loops through all of the block groups looking for read only block groups, and when you have several terabytes worth of data that ends up being a lot of block groups. Move the read only block groups onto a read only list and only proces that list in btrfs_account_ro_block_groups_free_space to reduce the amount of churn. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
728404da |
|
10-Oct-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: add helper btrfs_fdatawrite_range To avoid duplicating this double filemap_fdatawrite_range() call for inodes with async extents (compressed writes) so often. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d51033d0 |
|
12-Nov-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: introduce pending action: commit In some contexts, like in sysfs handlers, we don't want to trigger a transaction commit. It's a heavy operation, we don't know what external locks may be taken. Instead, make it possible to finish the operation through sync syscall or SYNC_FS ioctl. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
7e1876ac |
|
05-Feb-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: switch inode_cache option handling to pending changes The pending mount option(s) now share namespace and bits with the normal options, and the existing one for (inode_cache) is unset unconditionally at each transaction commit. Introduce a separate namespace for pending changes and enhance the descriptions of the intended change to use separate bits for each action. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
572d9ab7 |
|
05-Feb-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: add support for processing pending changes There are some actions that modify global filesystem state but cannot be performed at the time of request, but later at the transaction commit time when the filesystem is in a known state. For example enabling new incompat features on-the-fly or issuing transaction commit from unsafe contexts (sysfs handlers). Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
1a4ed8fd |
|
27-Oct-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() If we couldn't find our extent item, we accessed the current slot (path->slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0d4cf4e6 |
|
07-Oct-2014 |
Chris Mason <clm@fb.com> |
Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off Commit fccb84c94 moved added some helpers to cleanup our sanity tests, but it looks like both Dave and I always compile with the tests enabled. This fixes things to work when they are turned off too. Signed-off-by: Chris Mason <clm@fb.com>
|
#
f667aef6 |
|
22-Sep-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Make btrfs handle security mount options internally to avoid losing security label. [BUG] Originally when mount btrfs with "-o subvol=" mount option, btrfs will lose all security lable. And if the btrfs fs is mounted somewhere else, due to the lost of security lable, SELinux will refuse to mount since the same super block is being mounted using different security lable. [REPRODUCER] With SELinux enabled: #mkfs -t btrfs /dev/sda5 #mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs #btrfs subvolume create /mnt/btrfs/subvol #mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/test kernel message: SELinux: mount invalid. Same superblock, different security settings for (dev sda5, type btrfs) [REASON] This happens because btrfs will call vfs_kern_mount() and then mount_subtree() to handle subvolume name lookup. First mount will cut off all the security lables and when it comes to the second vfs_kern_mount(), it has no security label now. [FIX] This patch will makes btrfs behavior much more like nfs, which has the type flag FS_BINARY_MOUNTDATA, making btrfs handles the security label internally. So security label will be set in the real mount time and won't lose label when use with "subvol=" mount option. Reported-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
15b636e1 |
|
25-Sep-2014 |
Fabian Frederick <fabf@skynet.be> |
Btrfs: remove redundant btrfs_verify_qgroup_counts declaration. Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS and keep prototype in qgroup.h only Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fccb84c9 |
|
29-Sep-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: move checks for DUMMY_ROOT into a helper Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
7ec20afb |
|
24-Jul-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: new define for the inline extent data start Use a common definition for the inline data start so we don't have to open-code it and introduce bugs like "Btrfs: fix wrong max inline data size limit" fixed. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
95ac567a |
|
08-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: set default max_inline to 8KiB instead of 8MiB 8MiB is way too large and likely set by mistake. This is not a significant issue as in practice the max amount of data added to an inline extent is also limited by the page cache and btree leaf sizes. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
4d75f8a9 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove blocksize from btrfs_alloc_free_block and rename Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free + _tree_block family. The parameter blocksize was set to the metadata block size, directly or indirectly. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
47ab2a6c |
|
18-Sep-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: remove empty block groups automatically One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8b110e39 |
|
12-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: implement repair function when direct read fails This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
23ea8e5a |
|
12-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: load checksum data once when submitting a direct read io The current code would load checksum data for several times when we split a whole direct read io because of the limit of the raid stripe, it would make us search the csum tree for several times. In fact, it just wasted time, and made the contention of the csum tree root be more serious. This patch improves this problem by loading the data at once. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f87c4318 |
|
20-Aug-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove stale define after removing ordered operations Last user removed in commit "btrfs: disable strict file flushes for renames and truncates" (8d875f95da43c6a8f18f77869f2ef26e9594fecc). Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c01a5c07 |
|
16-Jul-2014 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: fix wrong max inline data size limit inline data is stored from offset of @disk_bytenr in struct btrfs_file_extent_item. So substracting total size of struct btrfs_file_extent_item is wrong, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
707e8a07 |
|
04-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: use nodesize everywhere, kill leafsize The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
57cdc8db |
|
04-Feb-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup ino cache members of btrfs_root The naming is confusing, generic yet used for a specific cache. Add a prefix 'ino_' or rename appropriately. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e339a6b0 |
|
02-Jul-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: __btrfs_mod_ref should always use no_quota Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't understand how snapshot delete was using it and assumed that we needed the quota operations there. With Mark's work this has turned out to be not the case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the argument and make __btrfs_mod_ref call it's process function with no_quota set always. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e570fd27 |
|
18-Jun-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix broken free space cache after the system crashed When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group xxxx has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group xxx It is because we didn't update the metadata of the allocated space (in extent tree) until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time (commit transaction), those spaces were lost. In fact, only the free space that is used to store the file data had this problem, the others didn't because the metadata of them is updated in the same transaction context. There are many methods which can fix the above problem - track the allocated space, and write it out when we write out the free space cache - account the size of the allocated space that is used to store the file data, if the size is not zero, don't write out the free space cache. The first one is complex and may make the performance drop down. This patch chose the second method, we use a per-block-group variant to account the size of that allocated space. Besides that, we also introduce a per-block-group read-write semaphore to avoid the race between the allocation and the free space cache write out. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7ffbb598 |
|
08-Jun-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: make fsync work after cloning into a file When cloning into a file, we were correctly replacing the extent items in the target range and removing the extent maps. However we weren't replacing the extent maps with new ones that point to the new extents - as a consequence, an incremental fsync (when the inode doesn't have the full sync flag) was a NOOP, since it relies on the existence of extent maps in the modified list of the inode's extent map tree, which was empty. Therefore add new extent maps to reflect the target clone range. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c1895442 |
|
26-May-2014 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: allocate raid type kobjects dynamically We are currently allocating space_info objects in an array when we allocate space_info. When a user does something like: # btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt # btrfs balance start -mconvert=single -dconvert=single /mnt -f # btrfs balance start -mconvert=raid1 -dconvert=raid1 / We can end up with memory corruption since the kobject hasn't been reinitialized properly and the name pointer was left set. The rationale behind allocating them statically was to avoid creating a separate kobject container that just contained the raid type. It used the index in the array to determine the index. Ultimately, though, this wastes more memory than it saves in all but the most complex scenarios and introduces kobject lifetime questions. This patch allocates the kobjects dynamically instead. Note that we also remove the kobject_get/put of the parent kobject since kobject_add and kobject_del do that internally. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
a79b7d4b |
|
22-May-2014 |
Chris Mason <clm@fb.com> |
Btrfs: async delayed refs Delayed extent operations are triggered during transaction commits. The goal is to queue up a healthly batch of changes to the extent allocation tree and run through them in bulk. This farms them off to async helper threads. The goal is to have the bulk of the delayed operations being done in the background, but this is also important to limit our stack footprint. Signed-off-by: Chris Mason <clm@fb.com>
|
#
faa2dbf0 |
|
07-May-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add sanity tests for new qgroup accounting code This exercises the various parts of the new qgroup accounting code. We do some basic stuff and do some things with the shared refs to make sure all that code works. I had to add a bunch of infrastructure because I needed to be able to insert items into a fake tree without having to do all the hard work myself, hopefully this will be usefull in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fcebe456 |
|
13-May-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: rework qgroup accounting Currently qgroups account for space by intercepting delayed ref updates to fs trees. It does this by adding sequence numbers to delayed ref updates so that it can figure out how the tree looked before the update so we can adjust the counters properly. The problem with this is that it does not allow delayed refs to be merged, so if you say are defragging an extent with 5k snapshots pointing to it we will thrash the delayed ref lock because we need to go back and manually merge these things together. Instead we want to process quota changes when we know they are going to happen, like when we first allocate an extent, we free a reference for an extent, we add new references etc. This patch accomplishes this by only adding qgroup operations for real ref changes. We only modify the sequence number when we need to lookup roots for bytenrs, this reduces the amount of churn on the sequence number and allows us to merge delayed refs as we add them most of the time. This patch encompasses a bunch of architectural changes 1) qgroup ref operations: instead of tracking qgroup operations through the delayed refs we simply add new ref operations whenever we notice that we need to when we've modified the refs themselves. 2) tree mod seq: we no longer have this separation of major/minor counters. this makes the sequence number stuff much more sane and we can remove some locking that was needed to protect the counter. 3) delayed ref seq: we now read the tree mod seq number and use that as our sequence. This means each new delayed ref doesn't have it's own unique sequence number, rather whenever we go to lookup backrefs we inc the sequence number so we can make sure to keep any new operations from screwing up our world view at that given point. This allows us to merge delayed refs during runtime. With all of these changes the delayed ref stuff is a little saner and the qgroup accounting stuff no longer goes negative in some cases like it was before. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
27cdeb70 |
|
02-Apr-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
21c7e756 |
|
13-May-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: reclaim the reserved metadata space at background Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. Here is my test result(Tested by compilebench): Memory: 2GB CPU: 2Cores * 1CPU Partition: 40GB(SSD) Test command: # compilebench -D <mnt> -m Without this patch: intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s) compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s) read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s) delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s) With this patch: intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s) compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s) read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s) delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s) Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
521e0546 |
|
15-Apr-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: protect snapshots from deleting during send The patch "Btrfs: fix protection between send and root deletion" (18f687d538449373c37c) does not actually prevent to delete the snapshot and just takes care during background cleaning, but this seems rather user unfriendly, this patch implements the idea presented in http://www.spinics.net/lists/linux-btrfs/msg30813.html - add an internal root_item flag to denote a dead root - check if the send_in_progress is set and refuse to delete, otherwise set the flag and proceed - check the flag in send similar to the btrfs_root_readonly checks, for all involved roots The root lookup in send via btrfs_read_fs_root_no_name will check if the root is really dead or not. If it is, ENOENT, aborted send. If it's alive, it's protected by send_in_progress, send can continue. CC: Miao Xie <miaox@cn.fujitsu.com> CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7d824b6f |
|
07-May-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: balance filter: add limit of processed chunks This started as debugging helper, to watch the effects of converting between raid levels on multiple devices, but could be useful standalone. In my case the usage filter was not finegrained enough and led to converting too many chunks at once. Another example use is in connection with drange+devid or vrange filters that allow to work with a specific chunk or even with a chunk on a given device. The limit filter applies last, the value of 0 means no limiting. CC: Ilya Dryomov <idryomov@gmail.com> CC: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
9d89ce65 |
|
23-Apr-2014 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: move btrfs_{set,clear}_and_info() to ctree.h Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
36523e95 |
|
07-Feb-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: export global block reserve size as space_info Introduce a block group type bit for a global reserve and fill the space info for SPACE_INFO ioctl. This should replace the newly added ioctl (01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part of the global reserve, while the actual usage can be now visible in the 'btrfs fi df' output during ENOSPC stress. The unpatched userspace tools will show the blockgroup as 'unknown'. CC: Jeff Mahoney <jeffm@suse.com> CC: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3f8a18cc |
|
28-Mar-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: hold the commit_root_sem when getting the commit root during send We currently rely too heavily on roots being read-only to save us from just accessing root->commit_root. We can easily balance blocks out from underneath a read only root, so to save us from getting screwed make sure we only access root->commit_root under the commit root sem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
9e351cc8 |
|
13-Mar-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: remove transaction from send Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, Reportedy-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
573bfb72 |
|
05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix possible empty list access when flushing the delalloc inodes We didn't have a lock to protect the access to the delalloc inodes list, that is we might access a empty delalloc inodes list if someone start flushing delalloc inodes because the delalloc inodes were moved into a other list temporarily. Fix it by wrapping the access with a lock. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
31f3d255 |
|
05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: split the global ordered extents mutex When we create a snapshot, we just need wait the ordered extents in the source fs/file root, but because we use the global mutex to protect this ordered extents list of the source fs/file root to avoid accessing a empty list, if someone got the mutex to access the ordered extents list of the other fs/file root, we had to wait. This patch splits the above global mutex, now every fs/file root has its own mutex to protect its own list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
6c255e67 |
|
05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock We needn't flush all delalloc inodes when we doesn't get s_umount lock, or we would make the tasks wait for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
8257b2dc |
|
05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume If the snapshot creation happened after the nocow write but before the dirty data flush, we would fail to flush the dirty data because of no space. So we must keep track of when those nocow write operations start and when they end, if there are nocow writers, the snapshot creators must wait. In order to implement this function, I introduce btrfs_{start, end}_nocow_write(), which is similar to mnt_{want,drop}_write(). These two functions are only used for nocow file write operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
d458b054 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the "_struct" suffix in btrfs_workequeue Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
a046e9c8 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the old btrfs_worker. Since all the btrfs_worker is replaced with the newly created btrfs_workqueue, the old codes can be easily remove. Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
0339ef2f |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue. Replace the fs_info->scrub_* with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
fc97fab0 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue. Replace the fs_info->qgroup_rescan_worker with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
5b3bc44e |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue. Replace the fs_info->delayed_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
dc6e3209 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue. Replace the fs_info->fixup_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
736cfa15 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue. Replace the fs_info->readahead_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
e66f0bb1 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue. Replace the fs_info->cache_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
d05a33ac |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue. Replace the fs_info->rmw_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
fccb5d86 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue. Replace the fs_info->endio_* workqueues with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
a44903ab |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->flush_workers with btrfs_workqueue. Replace the fs_info->submit_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
a8c93d4e |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->submit_workers with btrfs_workqueue. Much like the fs_info->workers, replace the fs_info->submit_workers use the same btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
afe3d242 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue Much like the fs_info->workers, replace the fs_info->delalloc_workers use the same btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
5cdc7ad3 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->workers with btrfs_workqueue. Use the newly created btrfs_workqueue_struct to replace the original fs_info->workers Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
d1433deb |
|
20-Feb-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: just wait or commit our own log sub-transaction We might commit the log sub-transaction which didn't contain the metadata we logged. It was because we didn't record the log transid and just select the current log sub-transaction to commit, but the right one might be committed by the other task already. Actually, we needn't do anything and it is safe that we go back directly in this case. This patch improves the log sync by the above idea. We record the transid of the log sub-transaction in which we log the metadata, and the transid of the log sub-transaction we have committed. If the committed transid is >= the transid we record when logging the metadata, we just go back. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
8b050d35 |
|
20-Feb-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix skipped error handle when log sync failed It is possible that many tasks sync the log tree at the same time, but only one task can do the sync work, the others will wait for it. But those wait tasks didn't get the result of the log sync, and returned 0 when they ended the wait. It caused those tasks skipped the error handle, and the serious problem was they told the users the file sync succeeded but in fact they failed. This patch fixes this problem by introducing a log context structure, we insert it into the a global list. When the sync fails, we will set the error number of every log context in the list, then the waiting tasks get the error number of the log context and handle the error if need. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
bb14a59b |
|
20-Feb-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use signed integer instead of unsigned long integer for log transid The log trans id is initialized to be 0 every time we create a log tree, and the log tree need be re-created after a new transaction is started, it means the log trans id is unlikely to be a huge number, so we can use signed integer instead of unsigned long integer to save a bit space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
c404e0dc |
|
30-Jan-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix use-after-free in the finishing procedure of the device replace During device replace test, we hit a null pointer deference (It was very easy to reproduce it by running xfstests' btrfs/011 on the devices with the virtio scsi driver). There were two bugs that caused this problem: - We might allocate new chunks on the replaced device after we updated the mapping tree. And we forgot to replace the source device in those mapping of the new chunks. - We might get the mapping information which including the source device before the mapping information update. And then submit the bio which was based on that mapping information after we freed the source device. For the first bug, we can fix it by doing mapping tree update and source device remove in the same context of the chunk mutex. The chunk mutex is used to protect the allocable device list, the above method can avoid the new chunk allocation, and after we remove the source device, all the new chunks will be allocated on the new device. So it can fix the first bug. For the second bug, we need make sure all flighting bios are finished and no new bios are produced during we are removing the source device. To fix this problem, we introduced a global @bio_counter, we not only inc/dec @bio_counter outsize of map_blocks, but also inc it before submitting bio and dec @bio_counter when ending bios. Since Raid56 is a little different and device replace dosen't support raid56 yet, it is not addressed in the patch and I add comments to make sure we will fix it in the future. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
514ac8ad |
|
03-Jan-2014 |
Chris Mason <clm@fb.com> |
Btrfs: don't use ram_bytes for uncompressed inline items If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect the new size. The fixe uses the size directly from the item header when reading uncompressed inlines, and also fixes truncate to update the size as it goes. Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org
|
#
26b47ff6 |
|
15-Jan-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss It is better that the position of the lock is close to the data which is protected by it, because they may be in the same cache line, we will load less cache lines when we access them. So we rearrange the members' position of btrfs_space_info structure to make the lock be closer to the its data. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3818aea2 |
|
12-Jan-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Add noinode_cache mount option Add noinode_cache mount option for btrfs. Since inode map cache involves all the btrfs_find_free_ino/return_ino things and if just trigger the mount_opt, an inode number get from inode map cache will not returned to inode map cache. To keep the find and return inode both in the same behavior, a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea. CHANGE_INODE_CACHE is set/cleared in remounting, and the original INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a success transaction. Since find/return inode is all done between btrfs_start_transaction and btrfs_commit_transaction, this will keep consistent behavior. Also noinode_cache mount option will not stop the caching_kthread. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ade2e0b3 |
|
12-Jan-2014 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: fix to search previous metadata extent item since skinny metadata There is a bug that using btrfs_previous_item() to search metadata extent item. This is because in btrfs_previous_item(), we need type match, however, since skinny metada was introduced by josef, we may mix this two types. So just use btrfs_previous_item() is not working right. To keep btrfs_previous_item() like normal tree search, i introduce another function btrfs_previous_extent_item(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0a2b2a84 |
|
23-Jan-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: throttle delayed refs better On one of our gluster clusters we noticed some pretty big lag spikes. This turned out to be because our transaction commit was taking like 3 minutes to complete. This is because we have like 30 gigs of metadata, so our global reserve would end up being the max which is like 512 mb. So our throttling code would allow a ridiculous amount of delayed refs to build up and then they'd all get run at transaction commit time, and for a cold mounted file system that could take up to 3 minutes to run. So fix the throttling to be based on both the size of the global reserve and how long it takes us to run delayed refs. This patch tracks the time it takes to run delayed refs and then only allows 1 seconds worth of outstanding delayed refs at a time. This way it will auto-tune itself from cold cache up to when everything is in memory and it no longer has to go to disk. This makes our transaction commits take much less time to run. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
63541927 |
|
07-Jan-2014 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: add support for inode properties This change adds infrastructure to allow for generic properties for inodes. Properties are name/value pairs that can be associated with inodes for different purposes. They are stored as xattrs with the prefix "btrfs." Properties can be inherited - this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Further, subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. Naturally, directory properties have priority over subvolume properties (in practice a subvolume property is just a regular property associated with the root inode, objectid 256, of the subvolume's fs tree). This change also adds one specific property implementation, named "compression", whose values can be "lzo" or "zlib" and it's an inheritable property. The corresponding changes to btrfs-progs were also implemented. A patch with xfstests for this feature will follow once there's agreement on this change/feature. Further, the script at the bottom of this commit message was used to do some benchmarks to measure any performance penalties of this feature. Basically the tests correspond to: Test 1 - create a filesystem and mount it with compress-force=lzo, then sequentially create N files of 64Kb each, measure how long it took to create the files, unmount the filesystem, mount the filesystem and perform an 'ls -lha' against the test directory holding the N files, and report the time the command took. Test 2 - create a filesystem and don't use any compression option when mounting it - instead set the compression property of the subvolume's root to 'lzo'. Then create N files of 64Kb, and report the time it took. The unmount the filesystem, mount it again and perform an 'ls -lha' like in the former test. This means every single file ends up with a property (xattr) associated to it. Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the compression property, have no real effect other than adding more work when inheriting properties and taking more btree leaf space. Test 4 - same as test 3 but with 10 properties per file. Results (in seconds, and averages of 5 runs each), for different N numbers of files follow. * Without properties (test 1) file creation time ls -lha time 10 000 files 3.49 0.76 100 000 files 47.19 8.37 1 000 000 files 518.51 107.06 * With 1 property (compression property set to lzo - test 2) file creation time ls -lha time 10 000 files 3.63 0.93 100 000 files 48.56 9.74 1 000 000 files 537.72 125.11 * With 4 properties (test 3) file creation time ls -lha time 10 000 files 3.94 1.20 100 000 files 52.14 11.48 1 000 000 files 572.70 142.13 * With 10 properties (test 4) file creation time ls -lha time 10 000 files 4.61 1.35 100 000 files 58.86 13.83 1 000 000 files 656.01 177.61 The increased latencies with properties are essencialy because of: *) When creating an inode, we now synchronously write 1 more item (an xattr item) for each property inherited from the parent dir (or subvolume). This could be done in an asynchronous way such as we do for dir intex items (delayed-inode.c), which could help reduce the file creation latency; *) With properties, we now have larger fs trees. For this particular test each xattr item uses 75 bytes of leaf space in the fs tree. This could be less by using a new item for xattr items, instead of the current btrfs_dir_item, since we could cut the 'location' and 'type' fields (saving 18 bytes) and maybe 'transid' too (saving a total of 26 bytes per xattr item) from the btrfs_dir_item type. Also tried batching the xattr insertions (ignoring proper hash collision handling, since it didn't exist) when creating files that inherit properties from their parent inode/subvolume, but the end results were (surprisingly) essentially the same. Test script: $ cat test.pl #!/usr/bin/perl -w use strict; use Time::HiRes qw(time); use constant NUM_FILES => 10_000; use constant FILE_SIZES => (64 * 1024); use constant DEV => '/dev/sdb4'; use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev'; use constant TEST_DIR => (MNT_POINT . '/testdir'); system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!"; # following line for testing without properties #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!"; # following 2 lines for testing with properties system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!"; system("mkdir", TEST_DIR) == 0 or die "mkdir failed!"; my ($t1, $t2); $t1 = time(); for (my $i = 1; $i <= NUM_FILES; $i++) { my $p = TEST_DIR . '/file_' . $i; open(my $f, '>', $p) or die "Error opening file!"; $f->autoflush(1); for (my $j = 0; $j < FILE_SIZES; $j += 4096) { print $f ('A' x 4096) or die "Error writing to file!"; } close($f); } $t2 = time(); print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; $t1 = time(); system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!"; $t2 = time(); print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1acae57b |
|
07-Jan-2014 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: faster file extent item replace operations When writing to a file we drop existing file extent items that cover the write range and then add a new file extent item that represents that write range. Before this change we were doing a tree lookup to remove the file extent items, and then after we did another tree lookup to insert the new file extent item. Most of the time all the file extent items we need to drop are located within a single leaf - this is the leaf where our new file extent item ends up at. Therefore, in this common case just combine these 2 operations into a single one. By avoiding the second btree navigation for insertion of the new file extent item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf COW operations, CPU time on btree node/leaf key binary searches, etc. Besides for file writes, this is an operation that happens for file fsync's as well. However log btrees are much less likely to big as big as regular fs btrees, therefore the impact of this change is smaller. The following benchmark was performed against an SSD drive and a HDD drive, both for random and sequential writes: sysbench --test=fileio --file-num=4096 --file-total-size=8G \ --file-test-mode=[rndwr|seqwr] --num-threads=512 \ --file-block-size=8192 \ --max-requests=1000000 \ --file-fsync-freq=0 --file-io-mode=sync [prepare|run] All results below are averages of 10 runs of the respective test. ** SSD sequential writes Before this change: 225.88 Mb/sec After this change: 277.26 Mb/sec ** SSD random writes Before this change: 49.91 Mb/sec After this change: 56.39 Mb/sec ** HDD sequential writes Before this change: 68.53 Mb/sec After this change: 69.87 Mb/sec ** HDD random writes Before this change: 13.04 Mb/sec After this change: 14.39 Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
efe120a0 |
|
20-Dec-2013 |
Frank Holton <fholton@gmail.com> |
Btrfs: convert printk to btrfs_ and fix BTRFS prefix Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2c686537 |
|
16-Dec-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: Check read-only status of roots during send All the subvolues that are involved in send must be read-only during the whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the status to read-write and the result of send stream is undefined if the data change unexpectedly. Fix that by adding a refcount for all involved roots and verify that there's no send in progress during SUBVOL_SETFLAGS ioctl call that does read-only -> read-write transition. We need refcounts because there are no restrictions on number of send parallel operations currently run on a single subvolume, be it source, parent or one of the multiple clone sources. Kernel is silent when the RO checks fail and returns EPERM. The same set of checks is done already in userspace before send starts. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e223cfcd |
|
13-Dec-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct It's not used anywhere, so just drop it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f28491e0 |
|
16-Dec-2013 |
Josef Bacik <jbacik@fb.com> |
Btrfs: move the extent buffer radix tree into the fs_info I need to create a fake tree to test qgroups and I don't want to have to setup a fake btree_inode. The fact is we only use the radix tree for the fs_info, so everybody else who allocates an extent_io_tree is just wasting the space anyway. This patch moves the radix tree and its lock into btrfs_fs_info so there is less stuff I have to fake to do qgroup sanity tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
27a0dd61 |
|
12-Nov-2013 |
Frank Holton <fholton@gmail.com> |
Btrfs: make btrfs_debug match pr_debug handling related to DEBUG The kernel macro pr_debug is defined as a empty statement when DEBUG is not defined. Make btrfs_debug match pr_debug to avoid spamming the kernel log with debug messages Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
33b98f22 |
|
13-Nov-2013 |
Sergei Trofimovich <slyfox@gentoo.org> |
btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index' Found by uselex.rb: > btrfs_get_inode_ref_index: [R]: exported from: fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: David Stebra <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e33d5c3d |
|
04-Nov-2013 |
Kelley Nielsen <kelleynnn@gmail.com> |
btrfs: bootstrap generic btrfs_find_item interface There are many btrfs functions that manually search the tree for an item. They all reimplement the same mechanism and differ in the conditions that they use to find the item. __inode_info() is one such example. Zach Brown proposed creating a new interface to take the place of these functions. This patch is the first step to creating the interface. A new function, btrfs_find_item, has been added to ctree.c and prototyped in ctree.h. It is identical to __inode_info, except that the order of the parameters has been rearranged to more closely those of similar functions elsewhere in the code (now, root and path come first, then the objectid, offset and type, and the key to be filled in last). __inode_info's callers have been set to call this new function instead, and __inode_info itself has been removed. Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Suggested-by: Zach Brown <zab@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
29e5be24 |
|
01-Nov-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: publish device membership in sysfs Now that we have the infrastructure for per-super attributes, we can publish device membership in /sys/fs/btrfs/<fsid>/devices. The information is published as symlinks to the block devices. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
6ab0a202 |
|
01-Nov-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: publish allocation data in sysfs While trying to debug ENOSPC issues, it's helpful to understand what the kernel's view of the available space is. We export this information via ioctl, but sysfs files are more easily used. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5ac1d209 |
|
01-Nov-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: publish per-super attributes in sysfs This patch adds per-super attributes to sysfs. It doesn't publish any attributes yet, but does the proper lifetime handling as well as the basic infrastructure to add new attributes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2eaa055f |
|
15-Nov-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: add ioctls to query/change feature bits online There are some feature bits that require no offline setup and can be enabled online. I've only reviewed extended irefs, but there will probably be more. We introduce three new ioctls: - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features. - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs basis, as well as querying for which features are changeable with mounted. - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis. We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that allow us to define which features are safe to change at runtime. The failure modes for BTRFS_IOC_SET_FEATURES are as follows: - Enabling a completely unsupported feature: warns and returns -ENOTSUPP - Enabling a feature that can only be done offline: warns and returns -EPERM Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e20d6c5b |
|
13-Nov-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix check-integrity to look at the referenced data properly We were looking at file_extent_num_bytes unconditionally when looking at referenced data bytes, but this isn't correct for compression. Fix this by checking the compression of the file extent we are and setting num_bytes to disk_num_bytes in the case of compression so that we are marking the proper bytes as referenced. This fixes check_int_data freaking out when running btrfs/004. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
16e7549f |
|
21-Oct-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: incompatible format change to remove hole extents Btrfs has always had these filler extent data items for holes in inodes. This has made somethings very easy, like logging hole punches and sending hole punches. However for large holey files these extent data items are pure overhead. So add an incompatible feature to no longer add hole extents to reduce the amount of metadata used by these sort of files. This has a few changes for logging and send obviously since they will need to detect holes and log/send the holes if there are any. I've tested this thoroughly with xfstests and it doesn't cause any issues with and without the incompat format set. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
996a710d |
|
20-Dec-2013 |
Christoph Hellwig <hch@infradead.org> |
btrfs: use generic posix ACL infrastructure Also don't bother to set up a .get_acl method for symlinks as we do not support access control (ACLs or even mode bits) for symlinks in Linux. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9650e05c |
|
12-Nov-2013 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: remove dead codes from ctree.h These two functions are only stated but undefined. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
54563d41 |
|
01-Sep-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: get rid of fdentry() 3 of 4 callers actually want file_inode()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
91aef86f |
|
04-Nov-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: rename btrfs_start_all_delalloc_inodes rename the function -- btrfs_start_all_delalloc_inodes(), and make its name be compatible to btrfs_wait_ordered_roots(), since they are always used at the same place. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9b011adf |
|
25-Oct-2013 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: remove scrub_super_lock holding in btrfs_sync_log() Originally, we introduced scrub_super_lock to synchronize tree log code with scrubbing super. However we can replace scrub_super_lock with device_list_mutex, because writing super will hold this mutex, this will reduce an extra lock holding when writing supers in sync log code. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
aaedb55b |
|
11-Oct-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add tests for btrfs_get_extent I'm going to be removing hole extents in the near future so I wanted to make a sanity test for btrfs_get_extent to make sure I don't break anything in the meantime. This patch just puts btrfs_get_extent through its paces by giving it a completely unreasonable mapping to look at and make sure it is giving us back maps that make sense. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
294e30fe |
|
08-Oct-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add tests for find_lock_delalloc_range So both Liu and I made huge messes of find_lock_delalloc_range trying to fix stuff, me first by fixing extent size, then him by fixing something I broke and then me again telling him to fix it a different way. So this is obviously a candidate for some testing. This patch adds a pseudo fs so we can allocate fake inodes for tests that need an inode or pages. Then it addes a bunch of tests to make sure find_lock_delalloc_range is acting the way it is supposed to. With this patch and all of our previous patches to find_lock_delalloc_range I am sure it is working as expected now. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6174d3cb |
|
01-Oct-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: remove unused max_key arg from btrfs_search_forward It is not used for anything. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
0a4e5586 |
|
24-Sep-2013 |
Ross Kirk <ross.kirk@gmail.com> |
btrfs: remove unused parameter from btrfs_header_fsid Remove unused parameter, 'eb'. Unused since introduction in 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee Updated to be rebased against current upstream and correct diff supplied this time! Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
06ea65a3 |
|
19-Sep-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add a sanity test for btrfs_split_item While looking at somebodys corruption I became completely convinced that btrfs_split_item was broken, so I wrote this test to verify that it was working as it was supposed to. Thankfully it appears to be working as intended, so just add this test to make sure nobody breaks it in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
dd3cc16b |
|
16-Sep-2013 |
Ross Kirk <ross.kirk@gmail.com> |
btrfs: drop unused parameter from btrfs_item_nr Remove unused eb parameter from btrfs_item_nr Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
363e4d35 |
|
17-Sep-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: remove space_info->reservation_progress This isn't used for anything anymore, just remove it. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
c4fbb430 |
|
17-Sep-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix worst case calculator for space usage Forever ago I made the worst case calculator say that we could potentially split into 3 blocks for every level on the way down, which isn't right. If we split we're only going to get two new blocks, the one we originally cow'ed and the new one we're going to split. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
83d4cfd4 |
|
30-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fixup error handling in btrfs_reloc_cow If we failed to actually allocate the correct size of the extent to relocate we will end up in an infinite loop because we won't return an error, we'll just move on to the next extent. So fix this up by returning an error, and then fix all the callers to return an error up the stack rather than BUG_ON()'ing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d8f98039 |
|
24-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix memory leak of uuid_root in free_fs_info Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
2e17c7c6 |
|
26-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add support for asserts One of the complaints we get a lot is how many BUG_ON()'s we have. So to help with this I'm introducing a kconfig option to enable/disable a new ASSERT() mechanism much like what XFS does. This will allow us developers to still get our nice panics but allow users/distros to compile them out. With this we can go through and convert any BUG_ON()'s that we have to catch actual programming mistakes to the new ASSERT() and then fix everybody else to return errors. This will also allow developers to leave sanity checks in their new code to make sure we don't trip over problems while testing stuff and vetting new features. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b308bc2f |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
fba6aa75 |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_header_fsid() return unsigned long Internally, btrfs_header_fsid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
231e88f4 |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_dev_extent_chunk_tree_uuid() return unsigned long Internally, btrfs_dev_extent_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
1473b24e |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_device_fsid() return unsigned long All callers of btrfs_device_fsid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
410ba3a2 |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_device_uuid() return unsigned long All callers of btrfs_device_uuid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6e71c47a |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make BTRFS_DEV_REPLACE_DEVID an unsigned long long constant The internal btrfs device id is a u64, hence make the constant BTRFS_DEV_REPLACE_DEVID "unsigned long long" as well, so we no longer need a cast to print it. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f420ee1e |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add mount option to force UUID tree checking This should never be needed, but since all functions are there to check and rebuild the UUID tree, a mount option is added that allows to force this check and rebuild procedure. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
70f80175 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: check UUID tree during mount if required If the filesystem was mounted with an old kernel that was not aware of the UUID tree, this is detected by looking at the uuid_tree_generation field of the superblock (similar to how the free space cache is doing it). If a mismatch is detected at mount time, a thread is started that does two things: 1. Iterate through the UUID tree, check each entry, delete those entries that are not valid anymore (i.e., the subvol does not exist anymore or the value changed). 2. Iterate through the root tree, for each found subvolume, add the UUID tree entries for the subvolume (if they are not already there). This mechanism is also used to handle and repair errors that happened during the initial creation and filling of the tree. The update of the uuid_tree_generation field (which indicates that the state of the UUID tree is up to date) is blocked until all create and repair operations are successfully completed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
26432799 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: introduce uuid-tree-gen field In order to be able to detect the case that a filesystem is mounted with an old kernel, add a uuid-tree-gen field like the free space cache is doing it. It is part of the super block and written with each commit. Old kernels do not know this field and don't update it. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
803b2f54 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: fill UUID tree initially When the UUID tree is initially created, a task is spawned that walks through the root tree. For each found subvolume root_item, the uuid and received_uuid entries in the UUID tree are added. This is such a quick operation so that in case somebody wants to unmount the filesystem while the task is still running, the unmount is delayed until the UUID tree building task is finished. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
dd5f9615 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: maintain subvolume items in the UUID tree When a new subvolume or snapshot is created, a new UUID item is added to the UUID tree. Such items are removed when the subvolume is deleted. The ioctl to set the received subvolume UUID is also touched and will now also add this received UUID into the UUID tree together with the ID of the subvolume. The latter is also done when read-only snapshots are created which inherit all the send/receive information from the parent subvolume. User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and read in the UUID tree. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f7a81ea4 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: create UUID tree if required This tree is not created by mkfs.btrfs. Therefore when a filesystem is mounted writable and the UUID tree does not exist, this tree is created if required. The tree is also added to the fs_info structure and initialized, but this commit does not yet read or write UUID tree elements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
07b30a49 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: introduce a tree for items that map UUIDs to something Mapping UUIDs to subvolume IDs is an operation with a high effort today. Today, the algorithm even has quadratic effort (based on the number of existing subvolumes), which means, that it takes minutes to send/receive a single subvolume if 10,000 subvolumes exist. But even linear effort would be too much since it is a waste. And these data structures to allow mapping UUIDs to subvolume IDs are created every time a btrfs send/receive instance is started. It is much more efficient to maintain a searchable persistent data structure in the filesystem, one that is updated whenever a subvolume/snapshot is created and deleted, and when the received subvolume UUID is set by the btrfs-receive tool. Therefore kernel code is added with this commit that is able to maintain data structures in the filesystem that allow to quickly search for a given UUID and to retrieve data that is assigned to this UUID, like which subvolume ID is related to this UUID. This commit adds a new tree to hold UUID-to-data mapping items. The key of the items is the full UUID plus the key type BTRFS_UUID_KEY. Multiple data blocks can be stored for a given UUID, a type/length/ value scheme is used. Now follows the lengthy justification, why a new tree was added instead of using the existing root tree: The first approach was to not create another tree that holds UUID items. Instead, the items should just go into the top root tree. Unfortunately this confused the algorithm to assign the objectid of subvolumes and snapshots. The reason is that btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for the first created subvol or snapshot after mounting a filesystem, and this function simply searches for the largest used objectid in the root tree keys to pick the next objectid to assign. Of course, the UUID keys have always been the ones with the highest offset value, and the next assigned subvol ID was wastefully huge. To use any other existing tree did not look proper. To apply a workaround such as setting the objectid to zero in the UUID item key and to implement collision handling would either add limitations (in case of a btrfs_extend_item() approach to handle the collisions) or a lot of complexity and source code (in case a key would be looked up that is free of collisions). Adding new code that introduces limitations is not good, and adding code that is complex and lengthy for no good reason is also not good. That's the justification why a completely new tree was introduced. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
171170c1 |
|
14-Aug-2013 |
Sergei Trofimovich <slyfox@gentoo.org> |
btrfs: mark some local function as 'static' Cc: Josef Bacik <jbacik@fusionio.com> Cc: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
35a3621b |
|
14-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: get rid of sparse warnings make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__ I tried to filter out the warnings for which patches have already been sent to the mailing list, pending for inclusion in btrfs-next. All these changes should be obviously safe. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ba5e8f2e |
|
16-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix send issues related to inode number reuse If you are sending a snapshot and specifying a parent snapshot we will walk the trees and figure out where they differ and send the differences only. The way we check for differences are if the leaves aren't the same and if the keys are not the same within the leaves. So if neither leaf is the same (ie the leaf has been cow'ed from the parent snapshot) we walk each item in the send root and check it against the parent root. If the items match exactly then we don't do anything. This doesn't quite work for inode refs, since they will just have the name and the parent objectid. If you move the file from a directory and then remove that directory and re-create a directory with the same inode number as the old directory and then move that file back into that directory we will assume that nothing changed and you will get errors when you try to receive. In order to fix this we need to do extra checking to see if the inode ref really is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the items match. Then if the key type is an inode ref we can do some extra checking, otherwise we just keep processing. The extra checking is to look up the generation of the directory in the parent volume and compare it to the generation of the send volume. If they match then they are the same directory and we are good to go. If they don't we have to add them to the changed refs list. This means we have to track the generation of the ref we're trying to lookup when we iterate all the refs for a particular inode. So in the case of looking for new refs we have to get the generation from the parent volume, and in the case of looking for deleted refs we have to get the generation from the send volume to compare with. There was also the issue of using a ulist to keep track of the directories we needed to check. Because we can get a deleted ref and a new ref for the same inode number the ulist won't work since it indexes based on the value. So instead just dup any directory ref we find and add it to a local list, and then process that list as normal and do away with using a ulist for this altogether. Before we would fail all of the tests in the far-progs that related to moving directories (test group 32). With this patch we now pass these tests, and all of the tests in the far-progs send testing suite. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
00361589 |
|
14-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: avoid starting a transaction in the write path I noticed while looking at a deadlock that we are always starting a transaction in cow_file_range(). This isn't really needed since we only need a transaction if we are doing an inline extent, or if the allocator needs to allocate a chunk. So push down all the transaction start stuff to be closer to where we actually need a transaction in all of these cases. This will hopefully reduce our write latency when we are committing often. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9ffba8cd |
|
14-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix heavy delalloc related deadlock I added a patch where we started taking the ordered operations mutex when we waited on ordered extents. We need this because we splice the list and process it, so if a flusher came in during this scenario it would think the list was empty and we'd usually get an early ENOSPC. The problem with this is that this lock is used in transaction committing. So we end up with something like this Transaction commit -> wait on writers Delalloc flusher -> run_ordered_operations (holds mutex) ->wait for filemap-flush to do its thing flush task -> cow_file_range ->wait on btrfs_join_transaction because we're commiting some other task -> commit_transaction because we notice trans->transaction->flush is set -> run_ordered_operations (hang on mutex) We need to disentangle the ordered operations flushing from the delalloc flushing, since they are separate things. This solves the deadlock issue I was seeing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
8b87dc17 |
|
01-Aug-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: add mount option to set commit interval I'ts hardcoded to 30 seconds which is fine for most users. Higher values defer data being synced to permanent storage with obvious consequences when the system crashes. The upper bound is not forced, but a warning is printed if it's more than 300 seconds (5 minutes). Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
36cce922 |
|
05-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: handle errors when doing slow caching Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait forever if a drive failed. This is because we just bail out if there is an error while trying to cache a block group, we don't update anybody who may be waiting. So this introduces a new enum for the cache state in case of error and makes everybody bail out if we have an error. Alex tested and verified this patch fixed his problem. This fixes bz 59431. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
facc8a22 |
|
25-Jul-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't cache the csum value into the extent state tree Before applying this patch, we cached the csum value into the extent state tree when reading some data from the disk, this operation increased the lock contention of the state tree. Now, we just store the csum value into the bio structure or other unshared structure, so we can reduce the lock contention. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3cae210f |
|
15-Jul-2013 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup for using BTRFS_SETGET_STACK instead of raw convert Some codes still use the cpu_to_lexx instead of the BTRFS_SETGET_STACK_FUNCS declared in ctree.h. Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec and other structures. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ee3441b4 |
|
09-Jul-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: fall back to global reservation when removing subvolumes I recently did some ENOSPC testing that involved filling the disk while create and removing snapshots in a loop. During the test cycle, I ran into an ENOSPC when trying to remove a snapshot, leaving the fs stuck in ENOSPC even after a umount/mount cycle. This patch allow subvolume removal to fall back onto the global block reservation in order to succeed when it would have failed otherwise. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
7ee9e440 |
|
21-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: check if we can nocow if we don't have data space We always just try and reserve data space when we write, but if we are out of space but have prealloc'ed extents we should still successfully write. This patch will try and see if we can write to prealloc'ed space and if we can go ahead and allow the write to continue. With this patch we now pass xfstests generic/274. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
b150a4f1 |
|
19-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: use a percpu to keep track of possibly pinned bytes There are all of these checks in the ENOSPC code to see if committing the transaction would free up enough space to make the allocation. This is because early on we just committed the transaction and hoped and prayed, which resulted in cases where it took _forever_ to get an ENOSPC when we really were out of space. So we check space_info->bytes_pinned, except this isn't completely true because it doesn't account for space we may free but are stuck in delayed refs. So tests like xfstests 226 would fail because we wouldn't commit the transaction to free up the data space. So instead add a percpu counter that will be a little fuzzier, it will add bytes as soon as we try to free up the space, and remove any space it doesn't actually free up when we get around to doing the actual free. We then 0 out this counter every transaction period so we have a better idea of how much space we will actually free up by committing this transaction. With this patch we now pass xfstests 226. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
1be41b78 |
|
12-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix transaction throttling for delayed refs Dave has this fs_mark script that can make btrfs abort with sufficient amount of ram. This is because with more ram we can keep more dirty metadata in cache which in a round about way makes for many more pending delayed refs. What happens is we end up not throttling the transaction enough so when we go to commit the transaction when we've completely filled the file system we'll abort() because we use all of the space in the global reserve and we still have delayed refs to run. To fix this we need to make the delayed ref flushing and the transaction throttling dependant upon the number of delayed refs that we have instead of how much reserved space is left in the global reserve. With this patch we not only stop aborting transactions but we also get a smoother run speed with fs_mark and it makes us about 10% faster. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
8c2a1a30 |
|
06-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: exclude logged extents before replying when we are mixed With non-mixed block groups we replay the logs before we're allowed to do any writes, so we get away with not pinning/removing the data extents until right when we replay them. However with mixed block groups we allocate out of the same pool, so we could easily allocate a metadata block that was logged in our tree log. To deal with this we just need to notice that we have mixed block groups and do the normal excluding/removal dance during the pin stage of the log replay and that way we don't allocate metadata blocks from areas we have logged data extents. With this patch we now pass xfstests generic/311 with mixed block groups turned on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
b382a324 |
|
28-May-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix qgroup rescan resume on mount When called during mount, we cannot start the rescan worker thread until open_ctree is done. This commit restuctures the qgroup rescan internals to enable a clean deferral of the rescan resume operation. First of all, the struct qgroup_rescan is removed, saving us a malloc and some initialization synchronizations problems. Its only element (the worker struct) now lives within fs_info just as the rest of the rescan code. Then setting up a rescan worker is split into several reusable stages. Currently we have three different rescan startup scenarios: (A) rescan ioctl (B) rescan resume by mount (C) rescan by quota enable Each case needs its own combination of the four following steps: (1) set the progress [A, C: zero; B: state of umount] (2) commit the transaction [A] (3) set the counters [A, C: zero; B: state of umount] (4) start worker [A, B, C] qgroup_rescan_init does step (1). There's no extra function added to commit a transaction, we've got that already. qgroup_rescan_zero_tracking does step (3). Step (4) is nothing more than a call to the generic btrfs_queue_worker. We also get rid of a double check for the rescan progress during btrfs_qgroup_account_ref, which is no longer required due to having step 2 from the list above. As a side effect, this commit prepares to move the rescan start code from btrfs_run_qgroups (which is run during commit) to a less time critical section. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
d52be818 |
|
29-May-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: simplify unlink reservations Dave pointed out a problem where if you filled up a file system as much as possible you couldn't remove any files. The whole unlink reservation thing is convoluted because it tries to guess if it's going to add space to unlink something or not, and has all these odd uncommented cases where it simply does not try. So to fix this I've added a way to conditionally steal from the global reserve if we can't make our normal reservation. If we have more than half the space in the global reserve free we will go ahead and steal from the global reserve. With this patch Dave's reproducer now works and I can rm all the files on the file system. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
4a9d8bde |
|
16-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: make the state of the transaction more readable We used 3 variants to track the state of the transaction, it was complex and wasted the memory space. Besides that, it was hard to understand that which types of the transaction handles should be blocked in each transaction state, so the developers often made mistakes. This patch improved the above problem. In this patch, we define 6 states for the transaction, enum btrfs_trans_state { TRANS_STATE_RUNNING = 0, TRANS_STATE_BLOCKED = 1, TRANS_STATE_COMMIT_START = 2, TRANS_STATE_COMMIT_DOING = 3, TRANS_STATE_UNBLOCKED = 4, TRANS_STATE_COMPLETED = 5, TRANS_STATE_MAX = 6, } and just use 1 variant to track those state. In order to make the blocked handle types for each state more clear, we introduce a array: unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = { [TRANS_STATE_RUNNING] = 0U, [TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE | __TRANS_START), [TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE | __TRANS_START | __TRANS_ATTACH), [TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE | __TRANS_START | __TRANS_ATTACH | __TRANS_JOIN), [TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE | __TRANS_START | __TRANS_ATTACH | __TRANS_JOIN | __TRANS_JOIN_NOLOCK), [TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE | __TRANS_START | __TRANS_ATTACH | __TRANS_JOIN | __TRANS_JOIN_NOLOCK), } it is very intuitionistic. Besides that, because we remove ->in_commit in transaction structure, so the lock ->commit_lock which was used to protect it is unnecessary, remove ->commit_lock. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
199c2a9c |
|
15-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: introduce per-subvolume ordered extent list The reason we introduce per-subvolume ordered extent list is the same as the per-subvolume delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
eb73c1b7 |
|
15-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: introduce per-subvolume delalloc inode list When we create a snapshot, we need flush all delalloc inodes in the fs, just flushing the inodes in the source tree is OK. So we introduce per-subvolume delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
b0feb9d9 |
|
15-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: introduce grab/put functions for the root of the fs/file tree The grab/put funtions will be used in the next patch, which need grab the root object and ensure it is not freed. We use reference counter instead of the srcu lock is to aovid blocking the memory reclaim task, which invokes synchronize_srcu(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
cb517eab |
|
15-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: cleanup the similar code of the fs root read There are several functions whose code is similar, such as btrfs_find_last_root() btrfs_read_fs_root_no_radix() Besides that, some functions are invoked twice, it is unnecessary, for example, we are sure that all roots which is found in btrfs_find_orphan_roots() have their orphan items, so it is unnecessary to check the orphan item again. So cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
babbf170 |
|
14-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: make the snap/subv deletion end more early when the fs is R/O The snapshot/subvolume deletion might spend lots of time, it would make the remount task wait for a long time. This patch improve this problem, we will break the deletion if the fs is remounted to be R/O. It will make the users happy. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
1c89cdd1 |
|
11-May-2013 |
Andreas Philipp <philipp.andreas@gmail.com> |
Minor format cleanup. Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and BTRFS_BLOCK_GROUP_RAID6. Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
57254b6e |
|
06-May-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add ioctl to wait for qgroup rescan completion btrfs_qgroup_wait_for_completion waits until the currently running qgroup operation completes. It returns immediately when no rescan process is in progress. This is useful to automate things around the rescan process (e.g. testing). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
1e8f9158 |
|
06-May-2013 |
Wang Shilong <wangsl-fnst@cn.fujitsu.com> |
Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time when we want to walk qgroup tree. By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free() once. This reduce some sys time to allocate memory, see the measurements below fsstress -p 4 -n 10000 -d $dir With this patch: real 0m50.153s user 0m0.081s sys 0m6.294s real 0m51.113s user 0m0.092s sys 0m6.220s real 0m52.610s user 0m0.096s sys 0m6.125s avg 6.213 ----------------------------------------------------- Without the patch: real 0m54.825s user 0m0.061s sys 0m10.665s real 1m6.401s user 0m0.089s sys 0m11.218s real 1m13.768s user 0m0.087s sys 0m10.665s avg 10.849 we can see the sys time reduce ~43%. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
b1c79e09 |
|
09-May-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: handle running extent ops with skinny metadata Chris hit a bug where we weren't finding extent records when running extent ops. This is because we use the delayed_ref_head when running the extent op, which means we can't use the ->type checks to see if we are metadata. We also lose the level of the metadata we are working on. So to fix this we can just check the ->is_data section of the extent_op, and we can store the level of the buffer we were modifying in the extent_op. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
60b62978 |
|
30-Apr-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: annotate quota tree for lockdep Quota tree has been missing from lockdep annotations, though no warning has been seen in the wild. There's currently one entry that does not belong there, BTRFS_ORPHAN_OBJECTID. No such tree exists, it's probably a copy & paste mistake, the id is defined among tree ids. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
1104a885 |
|
06-Mar-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: enhance superblock checks The superblock checksum is not verified upon mount. <awkward silence> Add that check and also reorder existing checks to a more logical order. Current mkfs.btrfs does not calculate the correct checksum of super_block and thus a freshly created filesytem will fail to mount when this patch is applied. First transaction commit calculates correct superblock checksum and saves it to disk. Reproducer: $ mfks.btrfs /dev/sda $ mount /dev/sda /mnt $ btrfs scrub start /mnt $ sleep 5 $ btrfs scrub status /mnt ... super:2 ... Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b6919a58 |
|
29-Apr-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: fix misleading variable name for flags The variable was named 'data' in btrfs_reserve_extent and that's the only function that actually uses it to let btrfs_get_alloc_profile know what profile we want. Then it's passed down as u64 flags. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
48a3b636 |
|
25-Apr-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: make static code static & remove dead code Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2f232036 |
|
25-Apr-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: rescan for qgroups If qgroup tracking is out of sync, a rescan operation can be started. It iterates the complete extent tree and recalculates all qgroup tracking data. This is an expensive operation and should not be used unless required. A filesystem under rescan can still be umounted. The rescan continues on the next mount. Status information is provided with a separate ioctl while a rescan operation is in progress. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
fc36ed7e |
|
24-Apr-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log Sequence numbers for delayed refs have been introduced in the first version of the qgroup patch set. To solve the problem of find_all_roots on a busy file system, the tree mod log was introduced. The sequence numbers for that were simply shared between those two users. However, at one point in qgroup's quota accounting, there's a statement accessing the previous sequence number, that's still just doing (seq - 1) just as it would have to in the very first version. To satisfy that requirement, this patch makes the sequence number counter 64 bit and splits it into a major part (used for qgroup sequence number counting) and a minor part (incremented for each tree modification in the log). This enables us to go exactly one major step backwards, as required for qgroups, while still incrementing the sequence counter for tree mod log insertions to keep track of their order. Keeping them in a single variable means there's no need to change all the code dealing with comparisons of two sequence numbers. The sequence number is reset to 0 on commit (not new in this patch), which ensures we won't overflow the two 32 bit counters. Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs from the tree mod log code may happen. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5fbf83c1 |
|
19-Apr-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: delete unused parameter to btrfs_read_root_item() Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
4b90c680 |
|
15-Apr-2013 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: remove unused argument of btrfs_extend_item() Argument 'trans' is not used in btrfs_extend_item(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
afe5fea7 |
|
15-Apr-2013 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: cleanup of function where fixup_low_keys() is called If argument 'trans' is unnecessary in the function where fixup_low_keys() is called, 'trans' is deleted. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
ceda0864 |
|
11-Apr-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use a lock to protect incompat/compat flag of the super block The following case will make the incompat/compat flag of the super block be recovered. Task1 |Task2 flags = btrfs_super_incompat_flags(); | |flags = btrfs_super_incompat_flags(); flags |= new_flag1; | |flags |= new_flag2; btrfs_set_super_incompat_flags(flags); | |btrfs_set_super_incompat_flags(flags); the new_flag1 is recovered. In order to avoid this problem, we introduce a lock named super_lock into the btrfs_fs_info structure. If we want to update incompat/compat flags of the super block, we must hold it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
f2f6ed3d |
|
07-Apr-2013 |
Wang Shilong <wangsl-fnst@cn.fujitsu.com> |
Btrfs: introduce a mutex lock for btrfs quota operations The original code has one spin_lock 'qgroup_lock' to protect quota configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY, it will be added to Btree firstly, and then update configurations in memory,however, a race condition may happen between these operations. For example: ->add_qgroup_info_item() ->add_qgroup_rb() For the above case, del_qgroup_info_item() may happen just before add_qgroup_rb(). What's worse, when we want to add a qgroup relation: ->add_qgroup_relation_item() ->add_qgroup_relations() We don't have any checks whether 'src' and 'dst' exist before add_qgroup_relation_item(), a race condition can also happen for the above case. To avoid race condition and have all the necessary checks, we introduce a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations protected by the mutex lock. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
09a2a8f9 |
|
05-Apr-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix bad extent logging A user sent me a btrfs-image of a file system that was panicing on mount during the log recovery. I had originally thought these problems were from a bug in the free space cache code, but that was just a symptom of the problem. The problem is if your application does something like this [prealloc][prealloc][prealloc] the internal extent maps will merge those all together into one extent map, even though on disk they are 3 separate extents. So if you go to write into one of these ranges the extent map will be right since we use the physical extent when doing the write, but when we log the extents they will use the wrong sizes for the remainder prealloc space. If this doesn't happen to trip up the free space cache (which it won't in a lot of cases) then you will get bogus entries in your extent tree which will screw stuff up later. The data and such will still work, but everything else is broken. This patch fixes this by not allowing extents that are on the modified list to be merged. This has the side effect that we are no longer adding everything to the modified list all the time, which means we now have to call btrfs_drop_extents every time we log an extent into the tree. So this allows me to drop all this speciality code I was using to get around calling btrfs_drop_extents. With this patch the testcase I've created no longer creates a bogus file system after replaying the log. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
c2cf52eb |
|
19-Mar-2013 |
Simon Kirby <sim@hostway.ca> |
Btrfs: Include the device in most error printk()s With more than one btrfs volume mounted, it can be very difficult to find out which volume is hitting an error. btrfs_error() will print this, but it is currently rigged as more of a fatal error handler, while many of the printk()s are currently for debugging and yet-unhandled cases. This patch just changes the functions where the device information is already available. Some cases remain where the root or fs_info is not passed to the function emitting the error. This may introduce some confusion with volumes backed by multiple devices emitting errors referring to the primary device in the set instead of the one on which the error occurred. Use btrfs_printk(fs_info, format, ...) rather than writing the device string every time, and introduce macro wrappers ala XFS for brevity. Since the function already cannot be used for continuations, print a newline as part of the btrfs_printk() message rather than at each caller. Signed-off-by: Simon Kirby <sim@hostway.ca> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
3173a18f |
|
07-Mar-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add a incompatible format change for smaller metadata extent refs We currently store the first key of the tree block inside the reference for the tree block in the extent tree. This takes up quite a bit of space. Make a new key type for metadata which holds the level as the offset and completely removes storing the btrfs_tree_block_info inside the extent ref. This reduces the size from 51 bytes to 33 bytes per extent reference for each tree block. In practice this results in a 30-35% decrease in the size of our extent tree, which means we COW less and can keep more of the extent tree in memory which makes our heavy metadata operations go much faster. This is not an automatic format change, you must enable it at mkfs time or with btrfstune. This patch deals with having metadata stored as either the old format or the new format so it is easy to convert. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
08748810 |
|
12-Mar-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: clean up transaction abort messages The transaction abort stacktrace is printed only once per module lifetime, but we'd like to see it each time it happens per mounted filesystem. Introduce a fs_state flag that records it. Tweak the messages around abort: * add error number to the first abort * print the exact negative errno from btrfs_decode_error * clean up btrfs_decode_error and callers * no dots at the end of the messages Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
d5c12070 |
|
28-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong reserved space in qgroup during snap/subv creation There are two problems in the space reservation of the snapshot/ subvolume creation. - don't reserve the space for the root item insertion - the space which is reserved in the qgroup is different with the free space reservation. we need reserve free space for 7 items, but in qgroup reservation, we need reserve space only for 3 items. So we implement new metadata reservation functions for the snapshot/subvolume creation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
dc81cdc5 |
|
20-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix remount vs autodefrag If we remount the fs to close the auto defragment or make the fs R/O, we should stop the auto defragment. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
cdb4c574 |
|
19-Feb-2013 |
Zach Brown <zab@redhat.com> |
btrfs: define BTRFS_MAGIC as a u64 value super.magic is an le64 but it's treated as an unterminated string when compared against BTRFS_MAGIC which is defined as a string. Instead define BTRFS_MAGIC as a normal hex value and use endian helpers to compare it to the super's magic. I tested this by mounting an fs made before the change and made sure that it didn't introduce sparse errors. This matches a similar cleanup that is pending in btrfs-progs. David Sterba pointed out that we should fix the kernel side as well :). Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
569e0f35 |
|
13-Feb-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: place ordered operations on a per transaction list Miao made the ordered operations stuff run async, which introduced a deadlock where we could get somebody (sync) racing in and committing the transaction while a commit was already happening. The new committer would try and flush ordered operations which would hang waiting for the commit to finish because it is done asynchronously and no longer inherits the callers trans handle. To fix this we need to make the ordered operations list a per transaction list. We can get new inodes added to the ordered operation list by truncating them and then having another process writing to them, so this makes it so that anybody trying to add an ordered operation _must_ start a transaction in order to add itself to the list, which will keep new inodes from getting added to the ordered operations list after we start committing. This should fix the deadlock and also keeps us from doing a lot more work than we need to during commit. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
210549eb |
|
09-Feb-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: add cancellation points to defrag The defrag operation can take very long, we want to have a way how to cancel it. The code checks for a pending signal at safe points in the defrag loops and returns EAGAIN. This means a user can press ^C after running 'btrfs fi defrag', woks for both defrag modes, files and root. Returning from the command was instant in my light tests, but may take longer depending on the aging factor of the filesystem. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5d80366e |
|
07-Feb-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: steal from global reserve if we are cleaning up orphans Sometimes xfstest 83 will fail to remount the scratch device because we've gotten ourselves so full that we cannot cleanup the orphan items. In this case check to see if we're doing the orphan cleanup and if we are allow us to steal our reservation from the global block rsv. With this patch I've not been able to reproduce the failed mount problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
de78b51a |
|
31-Jan-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: remove cache only arguments from defrag path The entry point at the defrag ioctl always sets "cache only" to 0; the codepaths haven't run for a long time as far as I can tell. Chris says they're dead code, so remove them. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
aa43a17c |
|
30-Jan-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: handle null fs_info in btrfs_panic() At least backref_tree_panic() can apparently pass in a null fs_info, so handle that in __btrfs_panic to get the message out on the console. The btrfs_panic macro also uses fs_info, but that's largely pointless; it's testing to see if BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set. But if it *were* set, __btrfs_panic() would have, well, paniced and we wouldn't be here, testing it! So just BUG() at this point. And since we only use fs_info once now, just use it directly. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
87533c47 |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use bit operation for ->fs_state There is no lock to protect fs_info->fs_state, it will introduce some problems, such as the value may be covered by the other task when several tasks modify it. For example: Task0 - CPU0 Task1 - CPU1 mov %fs_state rax or $0x1 rax mov %fs_state rax or $0x2 rax mov rax %fs_state mov rax %fs_state The expected value is 3, but in fact, it is 2. Though this problem doesn't happen now (because there is only one flag currently), the code is error prone, if we add other flags, the above problem will happen to a certainty. Now we use bit operation for it to fix the above problem. In this way, we can make the code more robust and be easy to add new flags. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
de98ced9 |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits There is no lock to protect fs_info->avail_{data, metadata, system}_alloc_bits, it may introduce some problem, such as the wrong profile information, so we add a seqlock to protect them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
963d678b |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use percpu counter for fs_info->delalloc_bytes fs_info->delalloc_bytes is accessed very frequently, so use percpu counter instead of the u64 variant for it to reduce the lock contention. This patch also fixed the problem that we access the variant without the lock protection.At worst, we would not flush the delalloc inodes, and just return ENOSPC error when we still have some free space in the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
e2d84521 |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use percpu counter for dirty metadata count ->dirty_metadata_bytes is accessed very frequently, so use percpu counter instead of the u64 variant to reduce the contention of the lock. This patch also fixed the problem that we access it without lock protection in __btrfs_btree_balance_dirty(), which may cause we skip the dirty pages flush. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
c018daec |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: protect fs_info->alloc_start fs_info->alloc_start is a 64bits variant, can be accessed by multi-task, but it is not protected strictly, it can be changed while we are accessing it. On 32bit machine, we will get wrong value because we access it by two instructions.(In fact, it is also possible that the same problem happens on the 64bit machine, because the compiler may split the 64bit operation into two 32bit operation.) For example: Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning, then we remount and set ->alloc_start to 0x0000 0100 0000 0000. Task0 Task1 load high 32 bits set high 32 bits set low 32 bits load low 32 bits Task1 will get 0. This patch fixes this problem by using two locks to protect it fs_info->chunk_mutex sb->s_umount On the read side, we just need get one of these two locks, and on the write side, we must lock all of them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
8c6a3ee6 |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: add a comment for fs_info->max_inline Though ->max_inline is a 64bit variant, and may be accessed by multi-task, but it is just suggestive number, so we needn't add anything to protect fs_info->max_inline, just add a comment to explain wny we don't use a lock to protect it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
55e301fd |
|
28-Jan-2013 |
Filipe Brandenburger <filbranden@google.com> |
Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h The header file will then be installed under /usr/include/linux so that userspace applications can refer to Btrfs ioctls by name and use the same structs used internally in the kernel. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
e6ec716f |
|
16-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: make raid attr array more readable The current code of raid attr arry is hard to understand and it is easy to introduce some problem if we modify the array. So I changed it and made it more readable. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
a1897fdd |
|
27-Dec-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: record first logical byte in memory This'd save us a rbtree search which may become expensive in large filesystem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
dcfac415 |
|
27-Dec-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2ab28f32 |
|
12-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: wait on ordered extents at the last possible moment Since we don't actually copy the extent information from the source tree in the fast case we don't need to wait for ordered io to be completed in order to fsync, we just need to wait for the io to be completed. So when we're logging our file just attach all of the ordered extents to the log, and then when the log syncs just wait for IO_DONE on the ordered extents and then write the super. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
4ae10b3a |
|
31-Jan-2013 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: Add a stripe cache to raid56 The stripe cache allows us to avoid extra read/modify/write cycles by caching the pages we read off the disk. Pages are cached when: * They are read in during a read/modify/write cycle * They are written during a read/modify/write cycle * They are involved in a parity rebuild Pages are not cached if we're doing a full stripe write. We're assuming that a full stripe write won't be followed by another partial stripe write any time soon. This provides a substantial boost in performance for workloads that synchronously modify adjacent offsets in the file, and for the parity rebuild use case in general. The size of the stripe cache isn't tunable (yet) and is set at 1024 entries. Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct Without the stripe cache -- 2.1MB/s With the stripe cache 21MB/s Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
53b381b3 |
|
29-Jan-2013 |
David Woodhouse <David.Woodhouse@intel.com> |
Btrfs: RAID5 and RAID6 This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
64a16701 |
|
15-Jul-2009 |
David Woodhouse <David.Woodhouse@intel.com> |
Btrfs: add rw argument to merge_bio_hook() We'll want to merge writes so they can fill a full RAID[56] stripe, but not necessarily reads. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9c52057c |
|
17-Dec-2012 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: fix hash overflow handling The handling for directory crc hash overflows was fairly obscure, split_leaf returns EOVERFLOW when we try to extend the item and that is supposed to bubble up to userland. For a while it did so, but along the way we added better handling of errors and forced the FS readonly if we hit IO errors during the directory insertion. Along the way, we started testing only for EEXIST and the EOVERFLOW case was dropped. The end result is that we may force the FS readonly if we catch a directory hash bucket overflow. This fixes a few problem spots. First I add tests for EOVERFLOW in the places where we can safely just return the error up the chain. btrfs_rename is harder though, because it tries to insert the new directory item only after it has already unlinked anything the rename was going to overwrite. Rather than adding very complex logic, I added a helper to test for the hash overflow case early while it is still safe to bail out. Snapshot and subvolume creation had a similar problem, so they are using the new helper now too. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Pascal Junod <pascal@junod.info>
|
#
31e50229 |
|
21-Nov-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: put raid properties into global table Raid properties can be shared among raid calculation code, we can put them into a global table to keep it simple. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ad914559 |
|
15-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: don't memset new tokens Our token logic depends on token->kaddr being set, and if it is not it sets everything properly as needed. So instead of memsetting just set token->kaddr to NULL. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
70c8a91c |
|
11-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: log changed inodes based on the extent map tree We don't really need to copy extents from the source tree since we have all of the information already available to us in the extent_map tree. So instead just write the extents straight to the log tree and don't bother to copy the extent items from the source tree. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d6393786 |
|
12-Dec-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add path->really_keep_locks You'd think path->keep_locks would keep all the locks wouldn't you? You'd be wrong. It only keeps them if the slot is pointing to the last item in the node. This is for use with btrfs_next_leaf, which needs this sort of thing. But the horrible horrible things I'm going to do to the tree log means I really need everything held from root to leaf so I can add and delete items in the same search. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
5f3ab90a |
|
07-Dec-2012 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: rename root_times_lock to root_item_lock Originally root_times_lock was introduced as part of send/receive code however newly developed patch to label the subvol reused the same lock, so renaming it for a meaningful name. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
26176e7c |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: restructure btrfs_run_defrag_inodes() This patch restructure btrfs_run_defrag_inodes() and make the code of the auto defragment more readable. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9247f317 |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use slabs for auto defrag allocation The auto defrag allocation is in the fast path of the IO, so use slabs to improve the speed of the allocation. And besides that, it can do check for leaked objects when the module is removed. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
72d7aefc |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace This change of the define is effective in all modes, it is required and used only in the case when a device replace procedure is running. The reason is that during an active device replace procedure, the target device of the copy operation is a mirror for the filesystem data as well that can be used to read data in order to repair read errors on other disks. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
29a8d9a0 |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block() Before this commit, btrfs_map_block() was called with REQ_WRITE in order to retrieve the list of mirrors for a disk block. This needs to be changed for the device replace procedure since it makes a difference whether you are asking for read mirrors or for locations to write to. GET_READ_MIRRORS is introduced as a new interface to call btrfs_map_block(). In the current commit, the functionality is not yet changed, only the interface for GET_READ_MIRRORS is introduced and all the places that should use this new interface are adapted. The reason that REQ_WRITE cannot be abused anymore to retrieve a list of read mirrors is that during a running dev replace operation all write requests to the live filesystem are duplicated to also write to the target drive. Keep in mind that the target disk is only partially a valid copy of the source disk while the operation is ongoing. All writes go to the target disk, but not all reads would return valid data on the target disk. Therefore it is not possible anymore to abuse a REQ_WRITE interface to find valid mirrors for a REQ_READ. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e93c89c1 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add new sources for device replace code This adds a new file to the sources together with the header file and the changes to ioctl.h and ctree.h that are required by the new C source file. Additionally, 4 new functions are added to volume.c that deal with device creation and destruction. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ff023aac |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add code to scrub to copy read data to another disk The device replace procedure makes use of the scrub code. The scrub code is the most efficient code to read the allocated data of a disk, i.e. it reads sequentially in order to avoid disk head movements, it skips unallocated blocks, it uses read ahead mechanisms, and it contains all the code to detect and repair defects. This commit adds code to scrub to allow the scrub code to copy read data to another disk. One goal is to be able to perform as fast as possible. Therefore the write requests are collected until huge bios are built, and the write process is decoupled from the read process with some kind of flow control, of course, in order to limit the allocated memory. The best performance on spinning disks could by reached when the head movements are avoided as much as possible. Therefore a single worker is used to interface the read process with the write process. The regular scrub operation works as fast as before, it is not negatively influenced and actually it is more or less unchanged. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
63a212ab |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: disallow some operations on the device replace target device This patch adds some code to disallow operations on the device that is used as the target for the device replace operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
5ac00add |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: disallow mutually exclusive admin operations from user mode Btrfs admin operations that are manually started from user mode and that cannot be executed at the same time return -EINPROGRESS. A common way to enter and leave this locked section is introduced since it used to be specific to the balance operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
a2bff640 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: introduce a btrfs_dev_replace_item type Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e922e087 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: enhance btrfs structures for device replace support Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
aa1b8cd4 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: pass fs_info instead of root A small number of functions that are used in a device replace procedure when the operation is resumed at mount time are unable to pass the same root pointer that would be used in the regular (ioctl) context. And since the root pointer is not required, only the fs_info is, the root pointer argument is replaced with the fs_info pointer argument. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
315a9850 |
|
01-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong file extent length There are two types of the file extent - inline extent and regular extent, When we log file extents, we didn't take inline extent into account, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d1423248 |
|
31-Oct-2012 |
Masanari Iida <standby24x7@gmail.com> |
Btrfs: Fix typo in fs/btrfs Correct spelling typo in btrfs. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
8ccf6f19 |
|
25-Oct-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: make delalloc inodes be flushed by multi-task This patch introduce a new worker pool named "flush_workers", and if we want to force all the inode with pending delalloc to the disks, we can queue those inodes into the work queue of the worker pool, in this way, those inodes will be flushed by multi-task. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
08e007d2 |
|
16-Oct-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: improve the noflush reservation In some places(such as: evicting inode), we just can not flush the reserved space of delalloc, flushing the delayed directory index and delayed inode is OK, but we don't try to flush those things and just go back when there is no enough space to be reserved. This patch fixes this problem. We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL. If we can in the transaction, we should not flush anything, or the deadlock would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used, and we will flush all things. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
926ccfef |
|
31-Oct-2012 |
Masanari Iida <standby24x7@gmail.com> |
Btrfs: Fix printk and variable name Correct spelling typo in btrfs. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
be6aef60 |
|
22-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot On a really full file system I was getting ENOSPC back from btrfs_update_inode when trying to update the parent inode when creating a snapshot. Just use the fallback method so we can update the inode and not have to worry about having a delayed ref. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5b6602e7 |
|
23-Oct-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: determine level of old roots In btrfs_find_all_roots' termination condition, we compare the level of the old buffer we got from btrfs_search_old_slot to the level of the current root node. We'd better compare it to the level of the rewinded root node. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
5af3e8cc |
|
01-Aug-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: make filesystem read-only when submitting barrier fails So far the return code of barrier_all_devices() is ignored, which means that errors are ignored. The result can be a corrupt filesystem which is not consistent. This commit adds code to evaluate the return code of barrier_all_devices(). The normal btrfs_error() mechanism is used to switch the filesystem into read-only mode when errors are detected. In order to decide whether barrier_all_devices() should return error or success, the number of disks that are allowed to fail the barrier submission is calculated. This calculation accounts for the worst RAID level of metadata, system and data. If single, dup or RAID0 is in use, a single disk error is already considered to be fatal. Otherwise a single disk error is tolerated. The calculation of the number of disks that are tolerated to fail the barrier operation is performed when the filesystem gets mounted, when a balance operation is started and finished, and when devices are added or removed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
f186373f |
|
08-Aug-2012 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: extended inode refs This patch adds basic support for extended inode refs. This includes support for link and unlink of the refs, which basically gets us support for rename as well. Inode creation does not need changing - extended refs are only added after the ref array is full. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
005d6427 |
|
18-Sep-2012 |
David Sterba <dsterba@suse.cz> |
btrfs: move transaction aborts to the point of failure Call btrfs_abort_transaction as early as possible when an error condition is detected, that way the line number reported is useful and we're not clueless anymore which error path led to the abort. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
2e90cf85 |
|
14-Sep-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: cleanup fs_info->hashers fs_info->hashers is now an obsolete one. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
#
ea658bad |
|
11-Sep-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: delay block group item insertion So we have lots of places where we try to preallocate chunks in order to make sure we have enough space as we make our allocations. This has historically meant that we're constantly tweaking when we should allocate a new chunk, and historically we have gotten this horribly wrong so we way over allocate either metadata or data. To try and keep this from happening we are going to make it so that the block group item insertion is done out of band at the end of a transaction. This will allow us to create chunks even if we are trying to make an allocation for the extent tree. With this patch my enospc tests run faster (didn't expect this) and more efficiently use the disk space (this is what I wanted). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
0647d6bd |
|
07-Sep-2012 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: cleanup for unused ref cache stuff As ref cache has been removed from btrfs, there is no user on its lock and its check. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
#
2ecb7923 |
|
06-Sep-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unprotected ->log_batch We forget to protect ->log_batch when syncing a file, this patch fix this problem by atomic operation. And ->log_batch is used to check if there are parallel sync operations or not, so it is unnecessary to reset it to 0 after the sync operation of the current log tree complete. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
66d8f3dd |
|
06-Sep-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: add a new "type" field into the block reservation structure Sometimes we need choose the method of the reservation according to the type of the block reservation, such as the reservation for the delayed inode update. Now we identify the type just by comparing the address of the reservation variants, it is very ugly if it is a temporary one because we need compare it with all the common reservation variants. So we add a new "type" field to keep the type the reservation variants. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
7014cdb4 |
|
30-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: btrfs_drop_extent_cache should never fail I noticed this when I was doing the fsync stuff, we allocate split extents if we drop an extent range that is in the middle of an existing extent. This BUG()'s if we fail to allocate memory, but the fact is this is just a cache, we will just regenerate the cache if we need it, the important part is that we free the range we are given. This can be done without allocations, so if we fail to allocate splits just skip the splitting stage and free our em and look for more extents to drop. This also makes btrfs_drop_extent_cache a void since nobody was checking the return value anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2aaa6655 |
|
29-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add hole punching This patch adds hole punching via fallocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2671485d |
|
28-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: remove unused hint byte argument for btrfs_drop_extents I audited all users of btrfs_drop_extents and found that nobody actually uses the hint_byte argument. I'm sure it was used for something at some point but it's not used now, and the way the pinning works the disk bytenr would never be immediately useful anyway so lets just remove it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
ca7e70f5 |
|
27-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: do not needlessly restart the transaction for enospc We will stop and restart a transaction every time we move to a different leaf when truncating a file. This is for enospc reasons, but really we could probably get away with doing this a little better by actually working until we hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do truncates which will fail as soon as the block rsv runs out of space, and then at that point we can stop and restart the transaction and refill the block rsv and carry on. This will make rm'ing of a file with lots of extents a bit faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5dc562c5 |
|
17-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
527a1361 |
|
05-Sep-2012 |
Wang Sheng-Hui <shhuiw@gmail.com> |
btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID It should be storing, not sotring. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
1fa11e26 |
|
06-Aug-2012 |
Arne Jansen <sensille@gmx.net> |
Btrfs: fix deadlock in wait_for_more_refs Commit a168650c introduced a waiting mechanism to prevent busy waiting in btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations, where a tree_mod_seq is held while waiting for the io to complete, while the end_io calls btrfs_run_delayed_refs. This whole mechanism is unnecessary. If not enough runnable refs are available to satisfy count, just return as count is more like a guideline than a strict requirement. In case we have to run all refs, commit transaction makes sure that no other threads are working in the transaction anymore, so we just assert here that no refs are blocked. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
c329861d |
|
03-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: don't allocate a seperate csums array for direct reads We've been allocating a big array for csums instead of storing them in the io_tree like we do for buffered reads because previously we were locking the entire range, so we didn't have an extent state for each sector of the range. But now that we do the range locking as we map the buffers we can limit the mapping lenght to sectorsize and use the private part of the io_tree for our csums. This allows us to avoid an extra memory allocation for direct reads which could incur latency. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
533574c6 |
|
30-Jul-2012 |
Joe Perches <joe@perches.com> |
btrfs: use printk_get_level and printk_skip_level, add __printf, fix fallout Use the generic printk_get_level() to search a message for a kern_level. Add __printf to verify format and arguments. Fix a few messages that had mismatches in format and arguments. Add #ifdef CONFIG_PRINTK blocks to shrink the object size a bit when not using printk. [akpm@linux-foundation.org: whitespace tweak] Signed-off-by: Joe Perches <joe@perches.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7069830a |
|
05-Jun-2012 |
Alexander Block <ablock84@googlemail.com> |
Btrfs: add btrfs_compare_trees function This function is used to find the differences between two trees. The tree compare skips whole subtrees if it detects shared tree blocks and thus is pretty fast. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
|
#
8ea05e3a |
|
25-Jul-2012 |
Alexander Block <ablock84@googlemail.com> |
Btrfs: introduce subvol uuids and times This patch introduces uuids for subvolumes. Each subvolume has it's own uuid. In case it was snapshotted, it also contains parent_uuid. In case it was received, it also contains received_uuid. It also introduces subvolume ctime/otime/stime/rtime. The first two are comparable to the times found in inodes. otime is the origin/creation time and ctime is the change time. stime/rtime are only valid on received subvolumes. stime is the time of the subvolume when it was sent. rtime is the time of the subvolume when it was received. Additionally to the times, we have a transid for each time. They are updated at the same place as the times. btrfs receive uses stransid and rtransid to find out if a received subvolume changed in the meantime. If an older kernel mounts a filesystem with the extented fields, all fields become invalid. The next mount with a new kernel will detect this and reset the fields. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
|
#
2b0ce2c2 |
|
24-Jul-2012 |
Mitch Harder <mitch.harder@sabayonlinux.org> |
Btrfs: Check INCOMPAT flags on remount and add helper function In support of the recently added capability to remount with lzo compression, provide a helper function to check the compression INCOMPAT flags when remounting with lzo compression, and set the flags if necessary. Also, implement the new helper function when defragmenting with explicit lzo compression and when setting the default subvolume. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e6793769 |
|
13-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: add helper for tree enumeration Often no exact match is wanted but just the next lower or higher item. There's a lot of duplicated code throughout btrfs to deal with the corner cases. This patch adds a helper function that can facilitate searching. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
18077bb4 |
|
09-Jul-2012 |
Li Zefan <lizefan@huawei.com> |
Btrfs: rewrite BTRFS_SETGET_FUNCS BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and btrfs_foo() functions, which read and write specific fields in the extent buffer. The total number of set/get functions is ~200, but in fact we only need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64. It results in redunction of ~37K bytes. text data bss dec hex filename 629661 12489 216 642366 9cd3e fs/btrfs/btrfs.o.orig 592637 12489 216 605342 93c9e fs/btrfs/btrfs.o Signed-off-by: Li Zefan <lizefan@huawei.com>
|
#
b4d7c3c9 |
|
09-Jul-2012 |
Li Zefan <lizefan@huawei.com> |
Btrfs: kill free_space pointer from inode structure Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type, which means every inode has the same BTRFS_I(inode)->free_space pointer. This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits). Signed-off-by: Li Zefan <lizefan@huawei.com>
|
#
bcef60f2 |
|
13-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: quota tree support and startup Init the quota tree along with the others on open_ctree and close_ctree. Add the quota tree to the list of well known trees in btrfs_read_fs_root_no_name. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
bed92eae |
|
28-Jun-2012 |
Arne Jansen <sensille@gmx.net> |
Btrfs: qgroup implementation and prototypes Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
416ac51d |
|
12-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: qgroup state and initialization Add state to fs_info. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
2f38b3e1 |
|
13-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: add helper for tree enumeration Often no exact match is wanted but just the next lower or higher item. There's a lot of duplicated code throughout btrfs to deal with the corner cases. This patch adds a helper function that can facilitate searching. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
630dc772 |
|
13-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: qgroup on-disk format Not all features are in use by the current version and thus may change in the future. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
097b8a7c |
|
21-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: join tree mod log code with the code holding back delayed refs We've got two mechanisms both required for reliable backref resolving (tree mod log and holding back delayed refs). You cannot make use of one without the other. So instead of requiring the user of this mechanism to setup both correctly, we join them into a single interface. Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list as we did before, which was of no value. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
1c8f52a5 |
|
19-Jun-2012 |
Alexander Block <ablock84@googlemail.com> |
Btrfs: introduce btrfs_next_old_item We introduce btrfs_next_old_item that uses btrfs_next_old_leaf instead of btrfs_next_leaf. btrfs_next_item is also changed to simply call btrfs_next_old_item with time_seq being 0. Signed-off-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3d7806ec |
|
11-Jun-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add btrfs_next_old_leaf To make sense of the tree mod log, the backref walker not only needs btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was calling btrfs_search_slot. This obviously didn't give the correct result. This commit adds btrfs_next_old_leaf, a drop-in replacement for btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot with this time_seq parameter. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
e41f941a |
|
26-Mar-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: move over to use ->update_time Btrfs had been doing it's own file_update_time so we could catch ENOSPC properly, so just update our btrfs_update_time to work with the new stuff and then we'll be fancy later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
95a06077 |
|
29-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: use delayed ref sequence numbers for all fs-tree updates The sequence number for delayed refs is needed to postpone certain delayed refs for a very short period while walking backrefs. Before the tree modification log, we thought we'd only have to hold back those references that don't have a counter operation. While now we've the tree mod log, we're rewinding fs tree blocks to a defined consistent state. We cannot know in advance for which tree block we'll be doing rewind operations later. Therefore, we must postpone all the delayed refs for fs-tree blocks, even those having a counter operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
3d136a11 |
|
03-Feb-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: set ioprio of scrub readahead to idle Reduce ioprio class of scrub readahead threads to idle priority. This setting is fixed. This priority has shown the best performance during all measurements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
733f4fbb |
|
25-May-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: read device stats on mount, write modified ones during commit The device statistics are written into the device tree with each transaction commit. Only modified statistics are written. When a filesystem is mounted, the device statistics for each involved device are read from the device tree and used to initialize the counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
8a35d95f |
|
23-May-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
5d9e75c4 |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add btrfs_search_old_slot The tree modification log together with the current state of the tree gives a consistent, old version of the tree. btrfs_search_old_slot is used to search through this old version and return old (dummy!) extent buffers. Naturally, this function cannot do any tree modifications. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
bd989ba3 |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add tree modification log functions The tree mod log will log modifications made fs-tree nodes. Most modifications are done by autobalance of the tree. Such changes are recorded as long as a block entry exists. When released, the log is cleaned. With the tree modification log, it's possible to reconstruct a consistent old state of the tree. This is required to do backref walking on a busy file system. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
f29021b2 |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add tree mod log to fs_info Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
64947ec0 |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: move struct seq_list to ctree.h Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
5581a51a |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: don't set for_cow parameter for tree block functions Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed parameter for_cow = 1. In fact, these two functions should never mark their tree modification operations as for_cow, because they can change the number of blocks referenced by a tree. Hence, we remove the extra for_cow parameter from these functions and make them pass a zero down. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
25cd999e |
|
30-Mar-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: fix that check_int_data mount option was ignored The bitfield member mount_opt was too small by one bit to hold the mount option that enabled to include data extents in the integrity checker. Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR option was added (git rebase silently merges so that the increase of the size of the bitfield member is lost), the bit limit was removed entirely. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
6ed3cf2c |
|
13-Apr-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: btrfs_root_readonly() broken on big-endian ->root_flags is __le64 and all accesses to it go through the helpers that do proper conversions. Except for btrfs_root_readonly(), which checks bit 0 as in host-endian... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
94598ba8 |
|
27-Mar-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: introduce common define for max number of mirrors Readahead already has a define for the max number of mirrors. Scrub needs such a define now, the rest of the code will need something like this soon. Therefore the define was added to ctree.h and removed from the readahead code. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0c460c0d |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: move alloc_profile_is_valid() to volumes.c Header file is not a good place to define functions. This also moves a call to alloc_profile_is_valid() down the stack and removes a redundant check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
e8920a64 |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: make profile_is_valid() check more strict "0" is a valid value for an on-disk chunk profile, but it is not a valid extended profile. (We have a separate bit for single chunks in extended case) Also rename it to alloc_profile_is_valid() for clarity. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
899c81ea |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add wrappers for working with alloc profiles Add functions to abstract the conversion between chunk and extended allocation profile formats and switch everybody to use them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
cfed81a0 |
|
03-Mar-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add the ability to cache a pointer into the eb This cuts down on the CPU time used by map_private_extent_buffer Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
727011e0 |
|
06-Aug-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allow metadata blocks larger than the page size A few years ago the btrfs code to support blocks lager than the page size was disabled to fix a few corner cases in the page cache handling. This fixes the code to properly support large metadata blocks again. Since current kernels will crash early and often with larger metadata blocks, this adds an incompat bit so that older kernels can't mount it. This also does away with different blocksizes for nodes and leaves. You get a single block size for all tree blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
81c9ad23 |
|
18-Jan-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: remove search_start and search_end from find_free_extent and callers We have been passing nothing but (u64)-1 to find_free_extent for search_end in all of the callers, so it's completely useless, and we've always been passing 0 in as search_start, so just remove them as function arguments and move search_start into find_free_extent. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
49b25e05 |
|
01-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: enhance transaction abort infrastructure Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
4da35113 |
|
01-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: add varargs to btrfs_error btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
2c536799 |
|
03-Oct-2011 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_drop_snapshot should return int Commit cb1b69f4 (Btrfs: forced readonly when btrfs_drop_snapshot() fails) made btrfs_drop_snapshot return void because there were no callers checking the return value. That is the wrong order to handle error propogation since the caller will have no idea that an error has occured and continue on as if nothing went wrong. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
143bede5 |
|
01-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: return void in functions without error conditions Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
b45a9d8b |
|
03-Oct-2011 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_update_root error push-up btrfs_update_root BUG's when it can't alloc a path, yet it can recover from a search error. This patch returns -ENOMEM instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
8c342930 |
|
03-Oct-2011 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: Add btrfs_panic() As part of the effort to eliminate BUG_ON as an error handling technique, we need to determine which errors are actual logic errors, which are on-disk corruption, and which are normal runtime errors e.g. -ENOMEM. Annotating these error cases is helpful to understand and report them. This patch adds a btrfs_panic() routine that will either panic or BUG depending on the new -ofatal_errors={panic,bug} mount option. Since there are still so many BUG_ONs, it defaults to BUG for now but I expect that to change once the error handling effort has made significant progress. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
c08782da |
|
26-Jan-2012 |
David Sterba <dsterba@suse.cz> |
btrfs: fix structs where bitfields and spinlock/atomic share 8B word On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has disaterous effect. https://lkml.org/lkml/2012/2/1/220 Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
a7e99c69 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for canceling restriper Implement an ioctl for canceling restriper. Currently we wait until relocation of the current block group is finished, in future this can be done by triggering a commit. Balance item is deleted and no memory about the interrupted balance is kept. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
837d5b6e |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for pausing restriper Implement an ioctl for pausing restriper. This pauses the relocation, but balance is still considered to be "in progress": balance item is not deleted, other volume operations cannot be started, etc. If paused in the middle of profile changing operation we will continue making allocations with the target profile. Add a hook to close_ctree() to pause restriper and free its data structures on unmount. (It's safe to unmount when restriper is in "paused" state, we will resume with the same parameters on the next mount) Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
9555c6c1 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add skip_balance mount option Since restriper kthread starts involuntarily on mount and can suck cpu and memory bandwidth add a mount option to forcefully skip it. The restriper in that case hangs around in paused state and can be resumed from userspace when it's convenient. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
0940ebf6 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: save balance parameters to disk Introduce a new btree objectid for storing balance item. The reason is to be able to resume restriper after a crash with the same parameters. Balance item has a very high objectid and goes into tree of tree roots. The key for the new item is as follows: [ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ] Older kernels simply ignore it so it's safe to mount with an older kernel and then go back to the newer one. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
70922617 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: do not reduce profile in do_chunk_alloc() Every caller of do_chunk_alloc() feeds it the reduced allocation profile, so stop trying to reduce it one more time. Instead check the validity of the passed profile. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
c9e9f97b |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add basic restriper infrastructure Add basic restriper infrastructure: extended balancing ioctl and all related ioctl data structures, add data structure for tracking restriper's state to fs_info, etc. The semantics of the old balancing ioctl are fully preserved. Explicitly disallow any volume operations when balance is in progress. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
a46d11a8 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add BTRFS_AVAIL_ALLOC_BIT_SINGLE bit Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for avail_{data,metadata,system}_alloc_bits fields, which gather info about available allocation profiles in the FS. When chunk is created or read from disk, its profile is OR'ed with the corresponding avail_alloc_bits field. Since SINGLE is denoted by 0 in the on-disk format, currently there is no way to tell when such chunks become avaialble. Restriper needs that information, so add a separate bit for SINGLE profile. This bit is going to be in-memory only, it should never be written out to disk, so it's not a disk format change. However to avoid remappings in future, reserve corresponding on-disk bit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
52ba6929 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: introduce masks for chunk type and profile Chunk's type and profile are encoded in u64 flags field. Introduce masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_* constants, it should be ULL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
6fef8df1 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: get rid of *_alloc_profile fields {data,metadata,system}_alloc_profile fields have been unused for a long time now. Get rid of them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
815745cf |
|
17-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: let ->s_fs_info point to fs_info, not root... the latter can be obtained from the former (by looking as ->tree_root) just as cheaply as we currently are doing the other way round. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
66d7e7f0 |
|
12-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: mark delayed refs as for cow Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value from every call site. The for_cow parameter will later on be used to determine if a ref will change anything with respect to qgroups. Delayed refs coming from relocation are always counted as for_cow, as they don't change subvol quota. Also pass in the fs_info for later use. btrfs_find_all_roots() will use this as an optimization, as changes that are for_cow will not change anything with respect to which root points to a certain leaf. Thus, we don't need to add the current sequence number to those delayed refs. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
c7d22a3c |
|
22-Nov-2011 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: added helper btrfs_next_item() btrfs_next_item() makes the btrfs path point to the next item, crossing leaf boundaries if needed. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
21adbd5c |
|
09-Nov-2011 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: integrate integrity check module into btrfs This is the last part of the patch series. It modifies the btrfs code to use the integrity check module if configured to do so with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set, the only effective change is that code is added that handles the mount option to activate the integrity check. If the mount option is set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code complains in the log and the mount fails with EINVAL. Add the mount option to activate the usage of the integrity check code. Add invocation of btrfs integrity check code init and cleanup function on mount and umount, respectively. Add hook to call btrfs integrity check code version of submit_bh/submit_bio. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
22c44fe6 |
|
30-Nov-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: deal with enospc from dirtying inodes properly Now that we're properly keeping track of delayed inode space we've been getting a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is because a bunch of people call mark_inode_dirty, which is void so we can't return ENOSPC. This needs to be fixed in a few areas 1) file_update_time - this updates the mtime and such when writing to a file, which will call mark_inode_dirty. So copy file_update_time into btrfs so we can call btrfs_dirty_inode directly and return an error if we get one appropriately. 2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't setting ->setattr for symlinks, even though we should have been. This catches one of the cases where we were getting errors in mark_inode_dirty. 3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly instead of mark_inode_dirty. This lets us return errors properly for truncate and chown/anything related to setattr. 4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and print an error if we have one. The only remaining user we can't control for this is touch_atime(), but we don't really want to keep people from walking down the tree if we don't have space to save the atime update, so just complain but don't worry about it. With this patch xfstests 83 complains a handful of times instead of hundreds of times. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
aa38a711 |
|
18-Nov-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix deadlock on metadata reservation when evicting a inode When I ran the xfstests, I found the test tasks was blocked on meta-data reservation. By debugging, I found the reason of this bug: start transaction | v reserve meta-data space | v flush delay allocation -> iput inode -> evict inode ^ | | v wait for delay allocation flush <- reserve meta-data space And besides that, the flush on evicting inode will block the thread, which is reclaiming the memory, and make oom happen easily. Fix this bug by skipping the flush step when evicting inode. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
291c7d2f |
|
14-Nov-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: wait on caching if we're loading the free space cache We've been hitting panics when running xfstest 13 in a loop for long periods of time. And actually this problem has always existed so we've been hitting these things randomly for a while. Basically what happens is we get a thread coming into the allocator and reading the space cache off of disk and adding the entries to the free space cache as we go. Then we get another thread that comes in and tries to allocate from that block group. Since block_group->cached != BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because if we're doing the old slow way of caching we don't want to hold people up and wait for everything to finish. The problem with this is we could end up discarding the space cache at some arbitrary point in the future, which means we could very well end up allocating space that is either bad, or when the real caching happens it could end up thinking the space isn't in use when it really is and cause all sorts of other problems. The solution is to add a new flag to indicate we are loading the free space cache from disk, and always try to cache the block group if cache->cached != BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else who tries to allocate from the block group will have to wait until it's finished to make sure it completes successfully. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
f1ebcc74 |
|
14-Nov-2011 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush The btrfs snapshotting code requires that once a root has been snapshotted, we don't change it during a commit. But there are two cases to lead to tree corruptions: 1) multi-thread snapshots can commit serveral snapshots in a transaction, and this may change the src root when processing the following pending snapshots, which lead to the former snapshots corruptions; 2) the free inode cache was changing the roots when it root the cache, which lead to corruptions. This fixes things by making sure we force COW the block after we create a snapshot during commiting a transaction, then any changes to the roots will result in COW, and we get all the fs roots and snapshot roots to be consistent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c06a0e12 |
|
04-Nov-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix delayed insertion reservation We all keep getting those stupid warnings from use_block_rsv when running stress.sh, and it's because the delayed insertion stuff is being stupid. It's not the delayed insertion stuffs fault, it's all just stupid. When marking an inode dirty for oh say updating the time on it, we just do a btrfs_join_transaction, which doesn't reserve any space. This is stupid because we're going to have to have space reserve to make this change, but we do it because it's fast because chances are we're going to call it over and over again and it doesn't matter. Well thanks to the delayed insertion stuff this is mostly the case, so we do actually need to make this reservation. So if trans->bytes_reserved is 0 then try to do a normal reservation. If not return ENOSPC which will make the btrfs_dirty_inode start a proper transaction which will let it do the whole ENOSPC dance and reserve enough space for the delayed insertion to steal the reservation from the transaction. The other stupid thing we do is not reserve space for the inode when writing to the thing. Usually this is ok since we have to update the time so we'd have already done all this work before we get to the endio stuff, so it doesn't matter. But this is stupid because we could write the data after the transaction commits where we changed the mtime of the inode so we have to cow all the way down to the inode anyway. This used to be masked by the delalloc reservation stuff, but because we delay the update it doesn't get masked in this case. So again the delayed insertion stuff bites us in the ass. So if our trans->block_rsv is delalloc, just steal the reservation from the delalloc reserve. Hopefully this won't bite us in the ass, but I've said that before. With this patch stress.sh no longer spits out those stupid warnings (famous last words). Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6d668dda |
|
03-Nov-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: make a delayed_block_rsv for the delayed item insertion I've been hitting warnings in use_block_rsv when running the delayed insertion stuff. It's because we will readjust global block rsv based on what is in use, which means we could end up discarding reservations that are for the delayed insertion stuff. So instead create a seperate block rsv for the delayed insertion stuff. This will also make it easier to debug problems with the delayed insertion reservations since we will know that only the delayed insertion code touches this block_rsv. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
af31f5e5 |
|
03-Nov-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add a log of past tree roots This takes some of the free space in the btrfs super block to record information about most of the roots in the last four commits. It also adds a -o recovery to use the root history log when we're not able to read the tree of tree roots, the extent tree root, the device tree root or the csum root. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6c41761f |
|
13-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: separate superblock items out of fs_info fs_info has now ~9kb, more than fits into one page. This will cause mount failure when memory is too fragmented. Top space consumers are super block structures super_copy and super_for_commit, ~2.8kb each. Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64) Add a wrapper for freeing fs_info and all of it's dynamically allocated members. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
e688b725 |
|
31-Oct-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix extent pinning bugs in the tree log The tree log had two important bugs that could cause corruptions after a crash. Sometimes we were allowing tree log blocks to be reused after the tree log was committed but before the transaction commit was done. This allowed a future metadata write to overwrite the tree log data. It is fixed by adding a new variant of freeing reserved extents that always pins them. Credit goes to Stefan Behrens and Arne Jansen for many many hours spent tracking this bug down. During tree log replay, we do a pass through the tree log and pin all the extents we find. This makes sure the replay code won't go in and use any of those blocks for new allocations during replay. The problem is the free space cache isn't honoring these pinned extents. So the allocator can end up handing them out, leading to all kinds of problems during replay. The fix here is to force any free space cache to load while we pin the extents, and then to make sure we remove the pinned extents from the free space rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com> Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
36ba022a |
|
17-Oct-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions Currently btrfs_block_rsv_check does 2 things, it will either refill a block reserve like in the truncate or refill case, or it will check to see if there is enough space in the global reserve and possibly refill it. However because of overcommit we could be well overcommitting ourselves just to try and refill the global reserve, when really we should just be committing the transaction. So breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill will try to reserve more metadata if it can and btrfs_block_rsv_check will not, it will only tell you if the factor of the total space is still reserved. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
5b0e95bf |
|
06-Oct-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: inline checksums into the disk free space cache Yeah yeah I know this is how we used to do it and then I changed it, but damnit I'm changing it back. The fact is that writing out checksums will modify metadata, which could cause us to dirty a block group we've already written out, so we have to truncate it and all of it's checksums and re-write it which will write new checksums which could dirty a blockg roup that has already been written and you see where I'm going with this? This can cause unmount or really anything that depends on a transaction to commit to take it's sweet damned time to happen. So go back to the way it was, only this time we're specifically setting NODATACOW because we can't go through the COW pathway anyway and we're doing our own built-in cow'ing by truncating the free space cache. The other new thing is once we truncate the old cache and preallocate the new space, we don't need to do that song and dance at all for the rest of the transaction, we can just overwrite the existing space with the new cache if the block group changes for whatever reason, and the NODATACOW will let us do this fine. So keep track of which transaction we last cleared our cache in and if we cleared it in this transaction just say we're all setup and carry on. This survives xfstests and stress.sh. The inode cache will continue to use the normal csum infrastructure since it only gets written once and there will be no more modifications to the fs tree in a transaction commit. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
2bf64758 |
|
26-Sep-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: allow us to overcommit our enospc reservations One of the things that kills us is the fact that our ENOSPC reservations are horribly over the top in most normal cases. There isn't too much that can be done about this because when we are completely full we really need them to work like this so we don't under reserve. However if there is plenty of unallocated chunks on the disk we can use that to gauge how much we can overcommit. So this patch adds chunk free space accounting so we always know how much unallocated space we have. Then if we fail to make a reservation within our allocated space, check to see if we can overcommit. In the normal flushing case (like with delalloc metadata reservations) we'll take the free space and divide it by 2 if our metadata profile is setup for DUP or any of those, and then divide it by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing case (we really need this reservation now!) we only limit ourselves to half of the free space. This makes this fio test [torrent] filename=torrent-test rw=randwrite size=4g ioengine=sync directory=/mnt/btrfs-test go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB file system. This doesn't seem to break my other enospc tests, but could really use some more testing as this is a super scary change. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
3b16a4e3 |
|
21-Sep-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: use the inode's mapping mask for allocating pages Johannes pointed out we were allocating only kernel pages for doing writes, which is kind of a big deal if you are on 32bit and have more than a gig of ram. So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we don't re-enter. Thanks, Reported-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
4a92b1b8 |
|
29-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: stop passing a trans handle all around the reservation code The only thing that we need to have a trans handle for is in reserve_metadata_bytes and thats to know how much flushing we can do. So instead of passing it around, just check current->journal_info for a trans_handle so we know if we can commit a transaction to try and free up space or not. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
482e6dc5 |
|
19-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check If you run xfstest 224 it you will get lots of messages about not being able to delete inodes and that they will be cleaned up next mount. This is because btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to flush, so if there was not enough space, it simply failed. But in truncate and evict case we could easily flush space to try and get enough space to do our work, so make btrfs_block_rsv_check take a flush argument to pass down to reserve_metadata_bytes. Now xfstests 224 runs fine without all those complaints. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
07127184 |
|
19-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: reduce the amount of space needed for truncates With btrfs_truncate_inode_items we always return if we have to go to another leaf, which makes us do our reservation again. This means we will only ever modify one leaf at a time, so we only need 1 items worth of slack space. Also, since we are deleting we will not be creating nodes as we go down, if anything we'll be free'ing them as we merge them together, so make a different calculation for truncate which will only have the worst case useage of COW'ing the entire path down to the leaf. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
5e962c78 |
|
08-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill btrfs_truncate_reserve_metadata Since we've optimized the truncate path, we no longer require this function. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
dabdb640 |
|
07-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill unused parts of block_rsv The priority and refill_used flags are not used anymore, and neither is the usage counter, so just remove them from btrfs_block_rsv. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
37be25bc |
|
05-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill the durable block rsv stuff This is confusing code and isn't used by anything anymore, so delete it. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
dba68306 |
|
04-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill the orphan space calculation for snapshots This patch kills off the calculation for the amount of space needed for the orphan operations during a snapshot. The thing is we only do snapshots on commit, so any space that is in the block_rsv->freed[] isn't going to be in the new snapshot anyway, so there isn't any reason to require that space to be reserved for the snapshot to occur. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
fb25e914 |
|
26-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: use bytes_may_use for all ENOSPC reservations We have been using bytes_reserved for metadata reservations, which is wrong since we use that to keep track of outstanding reservations from the allocator. This resulted in us doing a lot of silly things to make sure we don't allocate a bunch of metadata chunks since we never had a real view of how much space was actually in use by metadata. This passes Arne's enospc test and xfstests as well as my own enospc tests. Hopefully this will get us moving in the right direction. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
7414a03f |
|
23-May-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: initial readahead code and prototypes This is the implementation for the generic read ahead framework. To trigger a readahead, btrfs_reada_add must be called. It will start a read ahead for the given range [start, end) on tree root. The returned handle can either be used to wait on the readahead to finish (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach). The read ahead works as follows: On btrfs_reada_add, the root of the tree is inserted into a radix_tree. reada_start_machine will then search for extents to prefetch and trigger some reads. When a read finishes for a node, all contained node/leaf pointers that lie in the given range will also be enqueued. The reads will be triggered in sequential order, thus giving a big win over a naive enumeration. It will also make use of multi-device layouts. Each disk will have its on read pointer and all disks will by utilized in parallel. Also will no two disks read both sides of a mirror simultaneously, as this would waste seeking capacity. Instead both disks will read different parts of the filesystem. Any number of readaheads can be started in parallel. The read order will be determined globally, i.e. 2 parallel readaheads will normally finish faster than the 2 started one after another. Changes v2: - protect root->node by transaction instead of node_lock - fix missed branches: The readahead had a too simple check to determine if a branch from a node should be checked or not. It now also records the upper bound of each node to see if the requested RA range lies within. - use KERN_CONT to debug output, to avoid line breaks - defer reada_start_machine to worker to avoid deadlock Changes v3: - protect root->node by rcu Changes v5: - changed EIO-semantics of reada_tree_block_flagged - remove spin_lock from reada_control and make elems an atomic_t - remove unused read_total from reada_control - kill reada_key_cmp, use btrfs_comp_cpu_keys instead - use kref-style release functions where possible - return struct reada_control * instead of void * from btrfs_reada_add Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
90519d66 |
|
23-May-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: state information for readahead Add state information for readahead to btrfs_fs_info and btrfs_device Changes v2: - don't wait in radix_trees - add own set of workers for readahead Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
c97c2916 |
|
03-Aug-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: use plain page_address() in header fields setget functions We've stopped using highmem for extent buffers. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cb1b69f4 |
|
09-Aug-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: forced readonly when btrfs_drop_snapshot() fails The filesystem turns readonly instead of returning the error to the caller when detected error in btrfs_drop_snapshot(). and, because the caller doesn't check the error, the function type is changed to 'void'. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9b89d95a |
|
13-Jul-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: make acl functions really no-op if acl is not enabled So there's no overhead for something we don't use. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bf5f32ec |
|
14-Jul-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: make btrfs_set_root_node void This is fairly trivial - btrfs_set_root_node() - always returns zero so we can just make it void. All callers ignore the return code now anyway. I also made sure to check that none of the functions that btrfs_set_root_node() calls returns an error that we might have needed to catch and pass back. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b6973aa6 |
|
19-Jul-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix readahead in file defrag We passed the wrong value to btrfs_force_ra(). Fix this by changing the argument of btrfs_force_ra() from last_index to nr_page. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bd681513 |
|
16-Jul-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: switch the btrfs tree locks to reader/writer The btrfs metadata btree is the source of significant lock contention, especially in the root node. This commit changes our locking to use a reader/writer lock. The lock is built on top of rw spinlocks, and it extends the lock tracking to remember if we have a read lock or a write lock when we go to blocking. Atomics count the number of blocking readers or writers at any given time. It removes all of the adaptive spinning from the old code and uses only the spinning/blocking hints inside of btrfs to decide when it should continue spinning. In read heavy workloads this is dramatically faster. In write heavy workloads we're still faster because of less contention on the root node lock. We suffer slightly in dbench because we schedule more often during write locks, but all other benchmarks so far are improved. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9e0baf60 |
|
15-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix enospc problems with delalloc So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bab39bf9 |
|
30-Jun-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: use a worker thread to do caching A user reported a deadlock when copying a bunch of files. This is because they were low on memory and kthreadd got hung up trying to migrate pages for an allocation when starting the caching kthread. The page was locked by the person starting the caching kthread. To fix this we just need to use the async thread stuff so that the threads are already created and we don't have to worry about deadlocks. Thanks, Reported-by: Roman Mamedov <rm@romanrm.ru> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
4e34e719 |
|
23-Jul-2011 |
Christoph Hellwig <hch@lst.de> |
fs: take the ACL checks to common code Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
02c24a82 |
|
16-Jul-2011 |
Josef Bacik <josef@redhat.com> |
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b2675157 |
|
18-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: implement our own ->llseek In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek. Basically for the normal SEEK_*'s we will just defer to the generic helper, and for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest hole or data. Currently this helper doesn't check for delalloc bytes for prealloc space, so for now treat prealloc as data until that is fixed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0ee5dc67 |
|
07-Jul-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: kill magical embedded struct superblock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7e40145e |
|
20-Jun-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
->permission() sanitizing: don't pass flags to ->check_acl() not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fdb5effd |
|
07-Jun-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: serialize flushers in reserve_metadata_bytes We keep having problems with early enospc, and that's because our method of making space is inherently racy. The problem is we can have one guy trying to make space for himself, and in the meantime people come in and steal his reservation. In order to stop this we make a waitqueue and put anybody who comes into reserve_metadata_bytes on that waitqueue if somebody is trying to make more space. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
b5009945 |
|
07-Jun-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: do transaction space reservation before joining the transaction We have to do weird things when handling enospc in the transaction joining code. Because we've already joined the transaction we cannot commit the transaction within the reservation code since it will deadlock, so we have to return EAGAIN and then make sure we don't retry too many times. Instead of doing this, just do the reservation the normal way before we join the transaction, that way we can do whatever we want to try and reclaim space, and then if it fails we know for sure we are out of space and we can return ENOSPC. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
0942caa3 |
|
28-Jun-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: add missing options displayed in mount output There are three missed mount options settable by user which are not currently displayed in mount output. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
46e4edbf |
|
23-Jun-2011 |
Jesper Juhl <jj@chaosbits.net> |
Remove unneeded version.h includes from fs/ It was pointed out by 'make versioncheck' that some includes of linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and fs/omfs/file.c). This patch removes them. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9fe6a50f |
|
16-Jun-2011 |
Maarten Lankhorst <m.b.lankhorst@gmail.com> |
btrfs: Remove unused sysfs code Removes code no longer used. The sysfs file itself is kept, because the btrfs developers expressed interest in putting new entries to sysfs. Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7585717f |
|
13-Jun-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix relocation races The recent commit to get rid of our trans_mutex introduced some races with block group relocation. The problem is that relocation needs to do some record keeping about each root, and it was relying on the transaction mutex to coordinate things in subtle ways. This fix adds a mutex just for the relocation code and makes sure it doesn't have a big impact on normal operations. The race is really fixed in btrfs_record_root_in_trans, which is where we step back and wait for the relocation code to finish accounting setup. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7841cb28 |
|
31-May-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: add helper for fs_info->closing wrap checking of filesystem 'closing' flag and fix a few missing memory barriers. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
4b9465cb |
|
03-Jun-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add mount -o inode_cache This makes the inode map cache default to off until we fix the overflow problem when the free space crcs don't fit inside a single page. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
aa385729 |
|
27-May-2011 |
Christoph Hellwig <hch@infradead.org> |
fs: pass exact type of data dirties to ->dirty_inode Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or anything else, so that the filesystem can track internally if it needs to push out a transaction for fdatasync or not. This is just the prototype change with no user for it yet. I plan to push large XFS changes for the next merge window, and getting this trivial infrastructure in this window would help a lot to avoid tree interdependencies. Also remove incorrect comments that ->dirty_inode can't block. That has been changed a long time ago, and many implementations rely on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4cb5300b |
|
24-May-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add mount -o auto_defrag This will detect small random writes into files and queue the up for an auto defrag process. It isn't well suited to database workloads yet, but works for smaller files such as rpm, sqlite or bdb databases. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0956c798 |
|
17-May-2011 |
Andi Kleen <ak@linux.intel.com> |
BTRFS: Remove unused node_lock 240f62c8756 replaced the node_lock with rcu_read_lock, but forgot to remove the actual lock in the data structure. Remove it here. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d82a6f1d |
|
11-May-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill BTRFS_I(inode)->block_group Originally this was going to be used as a way to give hints to the allocator, but frankly we can get much better hints elsewhere and it's not even used at all for anything usefull. In addition to be completely useless, when we initialize an inode we try and find a freeish block group to set as the inodes block group, and with a completely full 40gb fs this takes _forever_, so I imagine with say 1tb fs this is just unbearable. So just axe the thing altoghether, we don't need it and it saves us 8 bytes in the inode and saves us 500 microseconds per inode lookup in my testcase. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
fcb80c2a |
|
03-May-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix how we do space reservation for truncate The ceph guys keep running into problems where we have space reserved in our orphan block rsv when freeing it up. This is because they tend to do snapshots alot, so their truncates tend to use a bunch of space, so when we go to do things like update the inode we have to steal reservation space in order to make the reservation happen. This happens because truncate can use as much space as it freaking feels like, but we still have to hold space for removing the orphan item and updating the inode, which will definitely always happen. So in order to fix this we need to split all of the reservation stuf up. So with this patch we have 1) The orphan block reserve which only holds the space for deleting our orphan item when everything is over. 2) The truncate block reserve which gets allocated and used specifically for the space that the truncate will use on a per truncate basis. 3) The transaction will always have 1 item's worth of data reserved so we can update the inode normally. Hopefully this will make the ceph problem go away. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
a4abeea4 |
|
11-Apr-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill trans_mutex We use trans_mutex for lots of things, here's a basic list 1) To serialize trans_handles joining the currently running transaction 2) To make sure that no new trans handles are started while we are committing 3) To protect the dead_roots list and the transaction lists Really the serializing trans_handles joining is not too hard, and can really get bogged down in acquiring a reference to the transaction. So replace the trans_mutex with a trans_lock spinlock and use it to do the following 1) Protect fs_info->running_transaction. All trans handles have to do is check this, and then take a reference of the transaction and keep on going. 2) Protect the fs_info->trans_list. This doesn't get used too much, basically it just holds the current transactions, which will usually just be the currently committing transaction and the currently running transaction at most. 3) Protect the dead roots list. This is only ever processed by splicing the list so this is relatively simple. 4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using the trans_mutex before, so this is a pretty straightforward change. 5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over the entirety of the commit we need to have a way to block new people from creating a new transaction while we're doing our work. So we set no_trans_join and in join_transaction we test to see if that is set, and if it is we do a wait_on_commit. 6) Make the transaction use count atomic so we don't need to take locks to modify it when we're dropping references. 7) Add a commit_lock to the transaction to make sure multiple people trying to commit the same transaction don't race and commit at the same time. 8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl trans. I have tested this with xfstests, but obviously it is a pretty hairy change so lots of testing is greatly appreciated. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
16cdcec7 |
|
22-Apr-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: implement delayed inode items operation Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dave@jikos.cz> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4ea02885 |
|
12-May-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: use unsigned type for single bit bitfield Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
8628764e |
|
23-Mar-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: add readonly flag setting the readonly flag prevents writes in case an error is detected Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
475f6387 |
|
11-Mar-2011 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
btrfs: new ioctls for scrub adds ioctls necessary to start and cancel scrubs, to get current progress and to get info about devices to be scrubbed. Note that the scrub is done per-device and that the ioctl only returns after the scrub for this devices is finished or has been canceled. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
a2de733c |
|
08-Mar-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: scrub This adds an initial implementation for scrub. It works quite straightforward. The usermode issues an ioctl for each device in the fs. For each device, it enumerates the allocated device chunks. For each chunk, the contained extents are enumerated and the data checksums fetched. The extents are read sequentially and the checksums verified. If an error occurs (checksum or EIO), a good copy is searched for. If one is found, the bad copy will be rewritten. All enumerations happen from the commit roots. During a transaction commit, the scrubs get paused and afterwards continue from the new roots. This commit is based on the series originally posted to linux-btrfs with some improvements that resulted from comments from David Sterba, Ilya Dryomov and Jan Schmidt. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
f2a97a9d |
|
04-May-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: remove all unused functions Remove static and global declarations and/or definitions. Reduces size of btrfs.ko by ~3.4kB. text data bss dec hex filename 402081 7464 200 409745 64091 btrfs.ko.base 398620 7144 200 405964 631cc btrfs.ko.remove-all Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
621496f4 |
|
03-May-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: remove unused function prototypes function prototypes without a body Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
b3b4aa74 |
|
20-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: drop unused parameter from btrfs_release_path parameter tree root it's not used since commit 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee ("Btrfs: Create extent_buffer interface for large blocksizes") Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
306e16ce |
|
19-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: rename variables clashing with global function names reported by gcc -Wshadow: page_index, page_offset, new_inode, dev_name Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
e9c54999 |
|
27-Apr-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Revert wrong fixes for common misspellings These changes were incorrectly fixed by codespell. They were now manually corrected. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
|
#
82d5902d |
|
19-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Support reading/writing on disk free ino cache This is similar to block group caching. We dedicate a special inode in fs tree to save free ino cache. At the very first time we create/delete a file after mount, the free ino cache will be loaded from disk into memory. When the fs tree is commited, the cache will be written back to disk. To keep compatibility, we check the root generation against the generation of the special inode when loading the cache, so the loading will fail if the btrfs filesystem was mounted in an older kernel before. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
581bb050 |
|
19-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Cache free inode numbers in memory Currently btrfs stores the highest objectid of the fs tree, and it always returns (highest+1) inode number when we create a file, so inode numbers won't be reclaimed when we delete files, so we'll run out of inode numbers as we keep create/delete files in 32bits machines. This fixes it, and it works similarly to how we cache free space in block cgroups. We start a kernel thread to read the file tree. By scanning inode items, we know which chunks of inode numbers are free, and we cache them in an rb-tree. Because we are searching the commit root, we have to carefully handle the cross-transaction case. The rb-tree is a hybrid extent+bitmap tree, so if we have too many small chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram of extents, and a bitmap will be used if we exceed this threshold. The extents threshold is adjusted in runtime. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
34d52cb6 |
|
28-Mar-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Make free space cache code generic So we can re-use the code to cache free inode numbers. The change is quite straightforward. Two new structures are introduced. - struct btrfs_free_space_ctl We move those variables that are used for caching free space from struct btrfs_block_group_cache to this new struct. - struct btrfs_free_space_op We do block group specific work (e.g. calculation of extents threshold) through functions registered in this struct. And then we can remove references to struct btrfs_block_group_cache. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
6d74119f |
|
11-Apr-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: avoid taking the chunk_mutex in do_chunk_alloc Everytime we try to allocate disk space we try and see if we can pre-emptively allocate a chunk, but in the common case we don't allocate anything, so there is no sense in taking the chunk_mutex at all. So instead if we are allocating a chunk, mark it in the space_info so we don't get two people trying to allocate at the same time. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Reviewed-by: Liu Bo <liubo2009@cn.fujitsu.com>
|
#
be1a12a0 |
|
06-Apr-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: deal with the case that we run out of space in the cache Currently we don't handle running out of space in the cache, so to fix this we keep track of how far in the cache we are. Then we only dirty the pages if we successfully modify all of them, otherwise if we have an error or run out of space we can just drop them and not worry about the vm writing them out. Thanks, Tested-by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
08fe4db1 |
|
27-Mar-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Fix uninitialized root flags for subvolumes root_item->flags and root_item->byte_limit are not initialized when a subvolume is created. This bug is not revealed until we added readonly snapshot support - now you mount a btrfs filesystem and you may find the subvolumes in it are readonly. To work around this problem, we steal a bit from root_item->inode_item->flags, and use it to indicate if those fields have been properly initialized. When we read a tree root from disk, we check if the bit is set, and if not we'll set the flag and initialize the two fields of the root item. Reported-by: Andreas Philipp <philipp.andreas@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Andreas Philipp <philipp.andreas@gmail.com> cc: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c59021f8 |
|
06-Mar-2011 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: fix OOPS of empty filesystem after balance btrfs will remove unused block groups after balance. When a empty filesystem is balanced, the block group with tag "DATA" may be dropped, and after umount and mount again, it will not find "DATA" space_info and lead to OOPS. So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS. Reported-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f7039b1d |
|
24-Mar-2011 |
Li Dongyang <lidongyang@novell.com> |
Btrfs: add btrfs_trim_fs() to handle FITRIM We take an free extent out from allocator, trim it, then put it back, but before we trim the block group, we should make sure the block group is cached, so plus a little change to make cache_block_group() run without a transaction. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5378e607 |
|
24-Mar-2011 |
Li Dongyang <lidongyang@novell.com> |
Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes Callers of btrfs_discard_extent() should check if we are mounted with -o discard, as we want to make fitrim to work even the fs is not mounted with -o discard. Also we should use REQ_DISCARD to map the free extent to get a full mapping, last we only return errors if 1. the error is not a EOPNOTSUPP 2. no device supports discard Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b4d00d56 |
|
24-Mar-2011 |
Li Dongyang <lidongyang@novell.com> |
Btrfs: make update_reserved_bytes() public Make the function public as we should update the reserved extents calculations after taking out an extent for trimming. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
75e7cb7f |
|
22-Mar-2011 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: Per file/directory controls for COW and compression Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. According to Chris's comment, there should be just one true compression method(probably LZO) stored in the super. However, before this, we would wait for that one method is stable enough to be adopted into the super. So I list it as a long term goal, and just store it in ram today. After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to control file and directory's datacow and compression attribute. NOTE: - The compression type is selected by such rules: If we mount btrfs with compress options, ie, zlib/lzo, the type is it. Otherwise, we'll use the default compress type (zlib today). v1->v2: - rebase to the latest btrfs. v2->v3: - fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW will be screwed by inheritance from parent directory. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1abe9b8a |
|
24-Mar-2011 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: add initial tracepoint support for btrfs Tracepoints can provide insight into why btrfs hits bugs and be greatly helpful for debugging, e.g dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0 dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0 btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0) btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0) btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8 flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0) flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0) flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0) Here is what I have added: 1) ordere_extent: btrfs_ordered_extent_add btrfs_ordered_extent_remove btrfs_ordered_extent_start btrfs_ordered_extent_put These provide critical information to understand how ordered_extents are updated. 2) extent_map: btrfs_get_extent extent_map is used in both read and write cases, and it is useful for tracking how btrfs specific IO is running. 3) writepage: __extent_writepage btrfs_writepage_end_io_hook Pages are cirtical resourses and produce a lot of corner cases during writeback, so it is valuable to know how page is written to disk. 4) inode: btrfs_inode_new btrfs_inode_request btrfs_inode_evict These can show where and when a inode is created, when a inode is evicted. 5) sync: btrfs_sync_file btrfs_sync_fs These show sync arguments. 6) transaction: btrfs_transaction_commit In transaction based filesystem, it will be useful to know the generation and who does commit. 7) back reference and cow: btrfs_delayed_tree_ref btrfs_delayed_data_ref btrfs_delayed_ref_head btrfs_cow_block Btrfs natively supports back references, these tracepoints are helpful on understanding btrfs's COW mechanism. 8) chunk: btrfs_chunk_alloc btrfs_chunk_free Chunk is a link between physical offset and logical offset, and stands for space infomation in btrfs, and these are helpful on tracing space things. 9) reserved_extent: btrfs_reserved_extent_alloc btrfs_reserved_extent_free These can show how btrfs uses its space. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4e69b598 |
|
21-Mar-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: cleanup how we setup free space clusters This patch makes the free space cluster refilling code a little easier to understand, and fixes some things with the bitmap part of it. Currently we either want to refill a cluster with 1) All normal extent entries (those without bitmaps) 2) A bitmap entry with enough space The current code has this ugly jump around logic that will first try and fill up the cluster with extent entries and then if it can't do that it will try and find a bitmap to use. So instead split this out into two functions, one that tries to find only normal entries, and one that tries to find bitmaps. This also fixes a suboptimal thing we would do with bitmaps. If we used a bitmap we would just tell the cluster that we were pointing at a bitmap and it would do the tree search in the block group for that entry every time we tried to make an allocation. Instead of doing that now we just add it to the clusters group. I tested this with my ENOSPC tests and xfstests and it survived. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
22a94d44 |
|
16-Mar-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: add checks to verify dir items are correct We need to make sure the dir items we get are valid dir items. So any time we try and read one check it with verify_dir_item, which will do various sanity checks to make sure it looks sane. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
66b4ffd1 |
|
31-Jan-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: handle errors in btrfs_orphan_cleanup If we cannot truncate an inode for some reason we will never delete the orphan item associated with that inode, which means that we will loop forever in btrfs_orphan_cleanup. Instead of doing this just return error so we fail to mount. It sucks, but hey it's better than hanging. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
a41ad394 |
|
31-Jan-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: convert to the new truncate sequence ->truncate() is going away, instead all of the work needs to be done in ->setattr(). So this converts us over to do this. It's fairly straightforward, just get rid of our .truncate inode operation and call btrfs_truncate() directly from btrfs_setsize. This works out better for us since truncate can technically return ENOSPC, and before we had no way of letting anybody know. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
dc89e982 |
|
28-Jan-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: use a slab for the free space entries Since we alloc/free free space entries a whole lot, lets use a slab to keep track of them. This makes some of my tests slightly faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
36e39c40 |
|
12-Mar-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: break out of shrink_delalloc earlier Josef had changed shrink_delalloc to exit after three shrink attempts, which wasn't quite enough because new writers could race in and steal free space. But it also fixed deadlocks and stalls as we tried to recover delalloc reservations. The code was tweaked to loop 1024 times, and would reset the counter any time a small amount of progress was made. This was too drastic, and with a lot of writers we can end up stuck in shrink_delalloc forever. The shrink_delalloc loop is fairly complex because the caller is looping too, and the caller will go ahead and force a transaction commit to make sure we reclaim space. This reworks things to exit shrink_delalloc when we've forced some writeback and the delalloc reservations have gone down. This means the writeback has not just started but has also finished at least some of the metadata changes required to reclaim delalloc space. If we've got this wrong, we're returning ENOSPC too early, which is a big improvement over the current behavior of hanging the machine. Test 224 in xfstests hammers on this nicely, and with 1000 writers trying to fill a 1GB drive we get our first ENOSPC at 93% full. The other writers are able to continue until we get 100%. This is a worst case test for btrfs because the 1000 writers are doing small IO, and the small FS size means we don't have a lot of room for metadata chunks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c87f08ca |
|
16-Feb-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allow balance to explicitly allocate chunks as it relocates Btrfs device shrinking and balancing ends up reallocating all the blocks in order to allow COW to move them to new destinations. It is somewhat awkward in terms of ENOSPC because most of the enospc code is built around the idea that some operation on a reference counted tree triggers allocations in the non-reference counted trees. This commit changes the balancing code to deal with enospc by trying to allocate a new chunk. If that allocation succeeds, we go ahead and retry whatever failed due to enospc. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
91435650 |
|
16-Feb-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: put ENOSPC debugging under a mount option ENOSPC in btrfs is getting to the point where the extra debugging isn't required. I've put it under mount -o enospc_debug just in case someone is having difficult problems. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
acce952b |
|
06-Jan-2011 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: forced readonly mounts on errors This patch comes from "Forced readonly mounts on errors" ideas. As we know, this is the first step in being more fault tolerant of disk corruptions instead of just using BUG() statements. The major content: - add a framework for generating errors that should result in filesystems going readonly. - keep FS state in disk super block. - make sure that all of resource will be freed and released at umount time. - make sure that fter FS is forced readonly on error, there will be no more disk change before FS is corrected. For this, we should stop write operation. After this patch is applied, the conversion from BUG() to such a framework can happen incrementally. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f8b18087 |
|
12-Jan-2011 |
Stefan Schmidt <stefan@datenfreihafen.org> |
fs/btrfs: Fix build of ctree Fix the build failure in some configurations: CC [M] fs/btrfs/ctree.o In file included from fs/btrfs/ctree.c:21:0: fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type make[2]: *** [fs/btrfs/ctree.o] Error 1 make[1]: *** [fs/btrfs] Error 2 make: *** [fs] Error 2 caused by commit 57cc7215b708 ("headers: kobject.h redux") We need to include kobject.h here. Reported-by: Jeff Garzik <jeff@garzik.org> Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6d07bcec |
|
05-Jan-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: fix wrong free space information of btrfs When we store data by raid profile in btrfs with two or more different size disks, df command shows there is some free space in the filesystem, but the user can not write any data in fact, df command shows the wrong free space information of btrfs. # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10 # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 28.00KB devid 1 size 5.01GB used 2.03GB path /dev/sda9 devid 2 size 10.00GB used 2.01GB path /dev/sda10 # btrfs device scan /dev/sda9 /dev/sda10 # mount /dev/sda9 /mnt # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999 (fill the filesystem) # sync # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 3.99GB devid 1 size 5.01GB used 5.01GB path /dev/sda9 devid 2 size 10.00GB used 4.99GB path /dev/sda10 It is because btrfs cannot allocate chunks when one of the pairing disks has no space, the free space on the other disks can not be used for ever, and should be subtracted from the total space, but btrfs doesn't subtract this space from the total. It is strange to the user. This patch fixes it by calcing the free space that can be used to allocate chunks. Implementation: 1. get all the devices free space, and align them by stripe length. 2. sort the devices by the free space. 3. check the free space of the devices, 3.1. if it is not zero, and then check the number of the devices that has more free space than this device, if the number of the devices is beyond the min stripe number, the free space can be used, and add into total free space. if the number of the devices is below the min stripe number, we can not use the free space, the check ends. 3.2. if the free space is zero, check the next devices, goto 3.1 This implementation is just likely fake chunk allocation. After appling this patch, df can show correct space information: # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 0 100% /mnt Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f580eb09 |
|
12-Jan-2011 |
Stefan Schmidt <stefan@datenfreihafen.org> |
fs/btrfs: Fix build of ctree CC [M] fs/btrfs/ctree.o In file included from fs/btrfs/ctree.c:21:0: fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type make[2]: *** [fs/btrfs/ctree.o] Error 1 make[1]: *** [fs/btrfs] Error 2 make: *** [fs] Error 2 We need to include kobject.h here. Reported-by: Jeff Garzik <jeff@garzik.org> Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b74c79e9 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: provide rcu-walk aware permission i_ops Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
b83cc969 |
|
20-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Add readonly snapshots support Usage: Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call ioctl(BTRFS_I0CTL_SNAP_CREATE_V2). Implementation: - Set readonly bit of btrfs_root_item->flags. - Add readonly checks in btrfs_permission (inode_permission), btrfs_setattr, btrfs_set/remove_xattr and some ioctls. Changelog for v3: - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags. - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
a6fa6fae |
|
25-Oct-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: Add lzo compression support Lzo is a much faster compression algorithm than gzib, so would allow more users to enable transparent compression, and some users can choose from compression ratio and speed for different applications Usage: # mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt or # mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt "-o compress" without argument is still allowed for compatability. Compatibility: If we mount a filesystem with lzo compression, it will not be able be mounted in old kernels. One reason is, otherwise btrfs will directly dump compressed data, which sits in inline extent, to user. Performance: The test copied a linux source tarball (~400M) from an ext4 partition to the btrfs partition, and then extracted it. (time in second) lzo zlib nocompress copy: 10.6 21.7 14.9 extract: 70.1 94.4 66.6 (data size in MB) lzo zlib nocompress copy: 185.87 108.69 394.49 extract: 193.80 132.36 381.21 Changelog: v1 -> v2: - Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig. - Add incompability flag. - Fix error handling in compress code. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
261507a0 |
|
16-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: Allow to add new compression algorithm Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
0410c94a |
|
19-Nov-2010 |
Mariusz Kozlowski <mk@lab.zgora.pl> |
btrfs: make 1-bit signed fileds unsigned Fixes these sparse warnings: fs/btrfs/ctree.h:811:17: error: dubious one-bit signed bitfield fs/btrfs/ctree.h:812:20: error: dubious one-bit signed bitfield fs/btrfs/ctree.h:813:19: error: dubious one-bit signed bitfield Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4260f7c7 |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed Add a mount option user_subvol_rm_allowed that allows users to delete a (potentially non-empty!) subvol when they would otherwise we allowed to do an rmdir(2). We duplicate the may_delete() checks from the core VFS code to implement identical security checks (minus the directory size check). We additionally require that the user has write+exec permission on the subvol root inode. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bb9c12c9 |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: async transaction commit Add support for an async transaction commit that is ordered such that any subsequent operations will join the following transaction, but does not wait until the current commit is fully on disk. This avoids much of the latency associated with the btrfs_commit_transaction for callers concerned with serialization and not safety. The wait_for_unblock flag controls whether we wait for the 'middle' portion of commit_transaction to complete, which is necessary if the caller expects some of the modifications contained in the commit to be available (this is the case for subvol/snapshot creation). Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
88c2ba3b |
|
21-Sep-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: Add a clear_cache mount option If something goes wrong with the free space cache we need a way to make sure it's not loaded on mount and that it's cleared for everybody. When you pass the clear_cache option it will make it so all block groups are setup to be cleared, which keeps them from being loaded and then they will be truncated when the transaction is committed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
67377734 |
|
16-Sep-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: add support for mixed data+metadata block groups There are just a few things that need to be fixed in the kernel to support mixed data+metadata block groups. Mostly we just need to make sure that if we are using mixed block groups that we continue to allocate mixed block groups as we need them. Also we need to make sure __find_space_info will find our space info if we search for DATA or METADATA only. Tested this with xfstests and it works nicely. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
0cb59c99 |
|
01-Jul-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: write out free space cache This is a simple bit, just dump the free space cache out to our preallocated inode when we're writing out dirty block groups. There are a bunch of changes in inode.c in order to account for special cases. Mostly when we're doing the writeout we're holding trans_mutex, so we need to use the nolock transacation functions. Also we can't do asynchronous completions since the async thread could be blocked on already completed IO waiting for the transaction lock. This has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC tests. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
0af3d00b |
|
21-Jun-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: create special free space cache inode In order to save free space cache, we need an inode to hold the data, and we need a special item to point at the right inode for the right block group. So first, create a special item that will point to the right inode, and the number of extent entries we will have and the number of bitmaps we will have. We truncate and pre-allocate space everytime to make sure it's uptodate. This feature will be turned on as soon as you mount with -o space_cache, however it is safe to boot into old kernels, they will just generate the cache the old fashion way. When you boot back into a newer kernel we will notice that we modified and not the cache and automatically discard the cache. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
8bb8ab2e |
|
15-Oct-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: rework how we reserve metadata bytes With multi-threaded writes we were getting ENOSPC early because somebody would come in, start flushing delalloc because they couldn't make their reservation, and in the meantime other threads would come in and use the space that was getting freed up, so when the original thread went to check to see if they had space they didn't and they'd return ENOSPC. So instead if we have some free space but not enough for our reservation, take the reservation and then start doing the flushing. The only time we don't take reservations is when we've already overcommitted our space, that way we don't have people who come late to the party way overcommitting ourselves. This also moves all of the retrying and flushing code into reserve_metdata_bytes so it's all uniform. This keeps my fs_mark test from returning -ENOSPC as soon as it starts and actually lets me fill up the disk. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
0019f10d |
|
15-Oct-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: re-work delalloc flushing Currently we try and flush delalloc, but we only do that in a sort of weak way, which works fine in most cases but if we're under heavy pressure we need to be able to wait for flushing to happen. Also instead of checking the bytes reserved in the block_rsv, check the space info since it is more accurate. The sync option will be used in a future patch. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
89a55897 |
|
14-Oct-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix df regression The new ENOSPC stuff breaks out the raid types which breaks the way we were reporting df to the system. This fixes it back so that Available is the total space available to data and used is the actual bytes used by the filesystem. This means that Available is Total - data used - all of the metadata space. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
45321ac5 |
|
07-Jun-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
Make ->drop_inode() just return whether inode needs to be dropped ... and let iput_final() do the actual eviction or retention Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
bd555975 |
|
07-Jun-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
convert btrfs to ->evict_inode() NB: do we want btrfs_wait_ordered_range() on eviction of inodes with positive i_nlink on subvolume with zero root_refs? If not, btrfs_evict_inode() can be simplified by unconditionally bailing out in case of i_nlink > 0 in the very beginning... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7ea80859 |
|
26-May-2010 |
Christoph Hellwig <hch@lst.de> |
drop unused dentry argument to ->fsync Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4b46fce2 |
|
23-May-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: add basic DIO read/write support This provides basic DIO support for reading and writing. It does not do the work to recover from mismatching checksums, that will come later. A few design changes have been made from Jim's code (sorry Jim!) 1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO code in order to account for all of BTRFS's oddities, but thanks to that work it seems like the best bet is to just ignore compression and such and just opt to fallback on buffered IO. 2) Fallback on buffered IO for compressed or inline extents. Jim's code did it's own buffering to make dio with compressed extents work. Now we just fallback onto normal buffered IO. 3) Use ordered extents for the writes so that all of the lock_extent() lookup_ordered() type checks continue to work. 4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with DIO writes. I've tested this with fsx and everything works great. This patch depends on my dio and filemap.c patches to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3fd0a558 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Metadata ENOSPC handling for balance This patch adds metadata ENOSPC handling for the balance code. It is consisted by following major changes: 1. Avoid COW tree leave in the phrase of merging tree. 2. Handle interaction with snapshot creation. 3. make the backref cache can live across transactions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
efa56464 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Pre-allocate space for data relocation Pre-allocate space for data relocation. This can detect ENOPSC condition caused by fragmentation of free space. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d68fc57b |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Metadata reservation for orphan inodes reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8929ecfa |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Introduce global metadata reservation Reserve metadata space for extent tree, checksum tree and root tree Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0ca1f7ce |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Update metadata reservation for delayed allocation Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a22285a6 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Integrate metadata reservation with start_transaction Besides simplify the code, this change makes sure all metadata reservation for normal metadata operations are released after committing transaction. Changes since V1: Add code that check if unlink and rmdir will free space. Add ENOSPC handling for clone ioctl. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f0486c68 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Introduce contexts for metadata reservation Introducing metadata reseravtion contexts has two major advantages. First, it makes metadata reseravtion more traceable. Second, it can reclaim freed space and re-add them to the itself after transaction committed. Besides add btrfs_block_rsv structure and related helper functions, This patch contains following changes: Move code that decides if freed tree block should be pinned into btrfs_free_tree_block(). Make space accounting more accurate, mainly for handling read only block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5da9d01b |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Shrink delay allocated space in a synchronized Shrink delayed allocation space in a synchronized manner is more controllable than flushing all delay allocated space in an async thread. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
424499db |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Kill allocate_wait in space_info We already have fs_info->chunk_mutex to avoid concurrent chunk creation. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b742bb82 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Link block groups of different raid types The size of reserved space is stored in space_info. If block groups of different raid types are linked to separate space_info, changing allocation profile will corrupt reserved space accounting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
287a0ab9 |
|
19-Mar-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill max_extent mount option As Yan pointed out, theres not much reason for all this complicated math to account for file extents being split up into max_extent chunks, since they are likely to all end up in the same leaf anyway. Since there isn't much reason to use max_extent, just remove the option altogether so we have one less thing we need to test. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
91748467 |
|
28-Feb-2010 |
Akinobu Mita <akinobu.mita@gmail.com> |
btrfs: use memparse Use memparse() instead of its own private implementation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2ac55d41 |
|
03-Feb-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: cache the extent state everywhere we possibly can V2 This patch just goes through and fixes everybody that does lock_extent() blah unlock_extent() to use lock_extent_bits() blah unlock_extent_cached() and pass around a extent_state so we only have to do the searches once per function. This gives me about a 3 mb/s boots on my random write test. I have not converted some things, like the relocation and ioctl's, since they aren't heavily used and the relocation stuff is in the middle of being re-written. I also changed the clear_extent_bit() to only unset the cached state if we are clearing EXTENT_LOCKED and related stuff, so we can do things like this lock_extent_bits() clear delalloc bits unlock_extent_cached() without losing our cached state. I tested this thoroughly and turned on LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out fine. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1e701a32 |
|
11-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add new defrag-range ioctl. The btrfs defrag ioctl was limited to doing the entire file. This commit adds a new interface that can defrag a specific range inside the file. It can also force compression on the file, allowing you to selectively compress individual files after they were created, even when mount -o compress isn't turned on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6ef5ed0d |
|
11-Dec-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: add ioctl and incompat flag to set the default mount subvol This patch needs to go along with my previous patch. This lets us set the default dir item's location to whatever root we want to use as our default mounting subvol. With this we don't have to use mount -o subvol=<tree id> anymore to mount a different subvol, we can just set the new one and it will just magically work. I've done some moderate testing with this, mostly just switching the default mount around, mounting subvols and the default mount at the same time and such, everything seems to work. Thanks, Older kernels would generally be able to still mount the filesystem with the default subvolume set, but it would result in a different volume being mounted, which could be an even more unpleasant suprise for users. So if you set your default subvolume, you can't go back to older kernels. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
73f73415 |
|
04-Dec-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: change how we mount subvolumes This work is in preperation for being able to set a different root as the default mounting root. There is currently a problem with how we mount subvolumes. We cannot currently mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the default subvolume. So say you take a snapshot of the default subvolume and call it snap1, and then take a snapshot of snap1 and call it snap2, so now you have / /snap1 /snap1/snap2 as your available volumes. Currently you can only mount / and /snap1, you cannot mount /snap1/snap2. To fix this problem instead of passing subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is the tree id that gets spit out via the subvolume listing you get from the subvolume listing patches (btrfs filesystem list). This allows us to mount /, /snap1 and /snap1/snap2 as the root volume. In addition to the above, we also now read the default dir item in the tree root to get the root key that it points to. For now this just points at what has always been the default subvolme, but later on I plan to change it to point at whatever root you want to be the new default root, so you can just set the default mount and not have to mount with -o subvolid=<treeid>. I tested this out with the above scenario and it worked perfectly. Thanks, mount -o subvol operates inside the selected subvolid. For example: mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt /mnt will have the snap1 directory for the subvolume with id 256. mount -o subvol=snap /dev/xxx /mnt /mnt will be the snap directory of whatever the default subvolume is. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
12534832 |
|
17-Dec-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: make set/get functions for the super compat_ro flags use compat_ro Our set/get functions for compat_ro_flags actually look at compat_flags. This will mess any attempt to use compat flags up. The fix is obvious. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a9185b41 |
|
05-Mar-2010 |
Christoph Hellwig <hch@lst.de> |
pass writeback_control to ->write_inode This gives the filesystem more information about the writeback that is happening. Trond requested this for the NFS unstable write handling, and other filesystems might benefit from this too by beeing able to distinguish between the different callers in more detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a555f810 |
|
28-Jan-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount -o compress-force The default btrfs mount -o compress mode will quickly back off compressing a file if it notices that compression does not reduce the size of the data being written. This can save considerable CPU because all future writes to the file go through uncompressed. But some files are both very large and have mixed data stored in them. In that case, we want to add the ability to always try compressing data before writing it. This commit adds mount -o compress-force. A later commit will add a new inode flag that does the same thing. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
86b9f2ec |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Fix per root used space accounting The bytes_used field in root item was originally planned to trace the amount of used data and tree blocks. But it never worked right since we can't trace freeing of data accurately. This patch changes it to only trace the amount of tree blocks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
24bbcf04 |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Add delayed iput iput() can trigger new transactions if we are dropping the final reference, so calling it in btrfs_commit_transaction may end up deadlock. This patch adds delayed iput to avoid the issue. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f34f57a3 |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Pass transaction handle to security and ACL initialization functions Pass transaction handle down to security and ACL initialization functions, so we can avoid starting nested transactions Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c71bf099 |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Avoid orphan inodes cleanup while replaying log We do log replay in a single transaction, so it's not good to do unbound operations. This patch cleans up orphan inodes cleanup after replaying the log. It also avoids doing other unbound operations such as truncating a file during replaying log. These unbound operations are postponed to the orphan inode cleanup stage. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
920bbbfb |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Rewrite btrfs_drop_extents Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can avoid calling lock_extent within transaction. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ad48fd75 |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Add btrfs_duplicate_item btrfs_duplicate_item duplicates item with new key, guaranteeing the source item and the new items are in the same tree leaf and contiguous. It allows us to split file extent in place, without using lock_extent to prevent bookend extent race. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e244a0ae |
|
14-Oct-2009 |
Christoph Hellwig <hch@lst.de> |
Btrfs: add -o discard option Enable discard by default is not a good idea given the the trim speed of SSD prototypes we've seen, and the carecteristics for many high-end arrays. Turn of discards by default and require the -o discard option to enable them on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0eda294d |
|
13-Oct-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix btrfs acl #ifdef checks The btrfs acl code was #ifdefing for a define that didn't exist. This correctly matches it to the values used by the Kconfig file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
257c62e1 |
|
13-Oct-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: avoid tree log commit when there are no changes rpm has a habit of running fdatasync when the file hasn't changed. We already detect if a file hasn't been changed in the current transaction but it might have been sent to the tree-log in this transaction and not changed since the last call to fsync. In this case, we want to avoid a tree log sync, which includes a number of synchronous writes and barriers. This commit extends the existing tracking of the last transaction to change a file to also track the last sub-transaction. The end result is that rpm -ivh and -Uvh are roughly twice as fast, and on par with ext3. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
82d339d9 |
|
09-Oct-2009 |
Alexey Dobriyan <adobriyan@gmail.com> |
Btrfs: constify dentry_operations Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ff782e0a |
|
08-Oct-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: optimize fsync for the single writer case This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie for new people to join the logging transaction if there is only ever 1 writer. This helps a little bit with latency where we have something like RPM where it will fdatasync every file it writes, and so waiting the 1 jiffie for every fdatasync really starts to add up. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e3ccfa98 |
|
07-Oct-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: async delalloc flushing under space pressure This patch moves the delalloc flushing that occurs when we are under space pressure off to a async thread pool. This helps since we only free up metadata space when we actually insert the extent item, which means it takes quite a while for space to be free'ed up if we wait on all ordered extents. However, if space is freed up due to inline extents being inserted, we can wake people who are waiting up early, and they can finish their work. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
32c00aff |
|
08-Oct-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: release delalloc reservations on extent item insertion This patch fixes an issue with the delalloc metadata space reservation code. The problem is we used to free the reservation as soon as we allocated the delalloc region. The problem with this is if we are not inserting an inline extent, we don't actually insert the extent item until after the ordered extent is written out. This patch does 3 things, 1) It moves the reservation clearing stuff into the ordered code, so when we remove the ordered extent we remove the reservation. 2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear delalloc bits in the cases where we want to clear the metadata reservation when we clear the delalloc extent, in the case that we do an inline extent or we invalidate the page. 3) It adds another waitqueue to the space info so that when we start a fs wide delalloc flush, anybody else who also hits that area will simply wait for the flush to finish and then try to make their allocation. This has been tested thoroughly to make sure we did not regress on performance. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
61d92c32 |
|
02-Oct-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix deadlock on async thread startup The btrfs async worker threads are used for a wide variety of things, including processing bio end_io functions. This means that when the endio threads aren't running, the rest of the FS isn't able to do the final processing required to clear PageWriteback. The endio threads also try to exit as they become idle and start more as the work piles up. The problem is that starting more threads means kthreadd may need to allocate ram, and that allocation may wait until the global number of writeback pages on the system is below a certain limit. The result of that throttling is that end IO threads wait on kthreadd, who is waiting on IO to end, which will never happen. This commit fixes the deadlock by handing off thread startup to a dedicated thread. It also fixes a bug where the on-demand thread creation was creating far too many threads because it didn't take into account threads being started by other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
828c0950 |
|
01-Oct-2009 |
Alexey Dobriyan <adobriyan@gmail.com> |
const: constify remaining file_operations [akpm@linux-foundation.org: fix KVM] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3baf0bed |
|
29-Sep-2009 |
Chris Ball <cjb@laptop.org> |
Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're currently not using it and are testing CONFIG_FS_POSIX_ACL instead. CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs". Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9ed74f2d |
|
11-Sep-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: proper -ENOSPC handling At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1b2da372 |
|
11-Sep-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: account for space used by the super mirrors As we get closer to proper -ENOSPC handling in btrfs, we need more accurate space accounting for the space info's. Currently we exclude the free space for the super mirrors, but the space they take up isn't accounted for in any of the counters. This patch introduces bytes_super, which keeps track of the amount of bytes used for a super mirror in the block group cache and space info. This makes sure that our free space caclucations will be completely accurate. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ba1bf481 |
|
11-Sep-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: make balance code choose more wisely when relocating Currently, we can panic the box if the first block group we go to move is of a type where there is no space left to move those extents. For example, if we fill the disk up with data, and then we try to balance and we have no room to move the data nor room to allocate new chunks, we will panic. Change this by checking to see if we have room to move this chunk around, and if not, return -ENOSPC and move on to the next chunk. This will make sure we remove block groups that are moveable, like if we have alot of empty metadata block groups, and then that way we make room to be able to balance our data chunks as well. Tested this with an fs that would panic on btrfs-vol -b normally, but no longer panics with this patch. V1->V2: -actually search for a free extent on the device to make sure we can allocate a chunk if need be. -fix btrfs_shrink_device to make sure we actually try to relocate all the chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r we don't remove the device with data still on it. -check to make sure the block group we are going to relocate isn't the last one in that particular space -fix a bug in btrfs_shrink_device where we would change the device's size and not fix it if we fail to do our relocate Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
76dda93c |
|
21-Sep-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: add snapshot/subvolume destroy ioctl This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being used and doesn't contains links to other subvolumes can be destroyed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4df27c4d |
|
21-Sep-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: change how subvolumes are organized btrfs allows subvolumes and snapshots anywhere in the directory tree. If we snapshot a subvolume that contains a link to other subvolume called subvolA, subvolA can be accessed through both the original subvolume and the snapshot. This is similar to creating hard link to directory, and has the very similar problems. The aim of this patch is enforcing there is only one access point to each subvolume. Only the first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point. The first directory entry is distinguished by checking root forward reference. If the corresponding root forward reference is missing, we know the entry is not the first one. This patch also adds snapshot/subvolume rename support, the code allows rename subvolume link across subvolumes. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
13a8a7c8 |
|
21-Sep-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: do not reuse objectid of deleted snapshot/subvol The new back reference format does not allow reusing objectid of deleted snapshot/subvol. So we use ++highest_objectid to allocate objectid for new snapshot/subvol. Now we use ++highest_objectid to allocate objectid for both new inode and new snapshot/subvolume, so this patch removes 'find hole' code in btrfs_find_free_objectid. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
11833d66 |
|
11-Sep-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: improve async block group caching This patch gets rid of two limitations of async block group caching. The old code delays handling pinned extents when block group is in caching. To allocate logged file extents, the old code need wait until block group is fully cached. To get rid of the limitations, This patch introduces a data structure to track the progress of caching. Base on the caching progress, we know which extents should be added to the free space cache when handling the pinned extents. The logged file extents are also handled in a similar way. This patch also changes how pinned extents are tracked. The old code uses one tree to track pinned extents, and copy the pinned extents tree at transaction commit time. This patch makes it use two trees to track pinned extents. One tree for extents that are pinned in the running transaction, one tree for extents that can be unpinned. At transaction commit time, we swap the two trees. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a1ed835e |
|
10-Sep-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix extent replacment race Data COW means that whenever we write to a file, we replace any old extent pointers with new ones. There was a window where a readpage might find the old extent pointers on disk and cache them in the extent_map tree in ram in the middle of a given write replacing them. Even though both the readpage and the write had their respective bytes in the file locked, the extent readpage inserts may cover more bytes than it had locked down. This commit closes the race by keeping the new extent pinned in the extent map tree until after the on-disk btree is properly setup with the new extent pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
276e680d |
|
30-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: preserve commit_root for async caching The async block group caching code uses the commit_root pointer to get a stable version of the extent allocation tree for scanning. This copy of the tree root isn't going to change and it significantly reduces the complexity of the scanning code. During a commit, we have a loop where we update the extent allocation tree root. We need to loop because updating the root pointer in the tree of tree roots may allocate blocks which may change the extent allocation tree. Right now the commit_root pointer is changed inside this loop. It is more correct to change the commit_root pointer only after all the looping is done. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
68b38550 |
|
27-Jul-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: change how we unpin extents We are racy with async block caching and unpinning extents. This patch makes things much less complicated by only unpinning the extent if the block group is cached. We check the block_group->cached var under the block_group->lock spin lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters, and unpin the extent and add the free space back. If it is not set to this, we start the caching of the block group so the next time we unpin extents we can unpin the extent. This keeps us from racing with the async caching threads, lets us kill the fs wide async thread counter, and keeps us from having to set DELALLOC bits for every extent we hit if there are caching kthreads going. One thing that needed to be changed was btrfs_free_super_mirror_extents. Now instead of just looking for LOCKED extents, we also look for DIRTY extents, since we could have left some extents pinned in the previous transaction that will never get freed now that we are unmounting, which would cause us to leak memory. So btrfs_free_super_mirror_extents has been changed to btrfs_free_pinned_extents, and it will clear the extents locked for the super mirror, and any remaining pinned extents that may be present. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
817d52f8 |
|
13-Jul-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: async block group caching This patch moves the caching of the block group off to a kthread in order to allow people to allocate sooner. Instead of blocking up behind the caching mutex, we instead kick of the caching kthread, and then attempt to make an allocation. If we cannot, we wait on the block groups caching waitqueue, which the caching kthread will wake the waiting threads up everytime it finds 2 meg worth of space, and then again when its finished caching. This is how I tested the speedup from this mkfs the disk mount the disk fill the disk up with fs_mark unmount the disk mount the disk time touch /mnt/foo Without my changes this took 11 seconds on my box, with these changes it now takes 1 second. Another change thats been put in place is we lock the super mirror's in the pinned extent map in order to keep us from adding that stuff as free space when caching the block group. This doesn't really change anything else as far as the pinned extent map is concerned, since for actual pinned extents we use EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock those extents to keep from leaking memory. I've also added a check where when we are reading block groups from disk, if the amount of space used == the size of the block group, we go ahead and mark the block group as cached. This drastically reduces the amount of time it takes to cache the block groups. Using the same test as above, except doing a dd to a file and then unmounting, it used to take 33 seconds to umount, now it takes 3 seconds. This version uses the commit_root in the caching kthread, and then keeps track of how many async caching threads are running at any given time so if one of the async threads is still running as we cross transactions we can wait until its finished before handling the pinned extents. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
96303081 |
|
13-Jul-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: use hybrid extents+bitmap rb tree for free space Currently btrfs has a problem where it can use a ridiculous amount of RAM simply tracking free space. As free space gets fragmented, we end up with thousands of entries on an rb-tree per block group, which usually spans 1 gig of area. Since we currently don't ever flush free space cache back to disk this gets to be a bit unweildly on large fs's with lots of fragmentation. This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free space cache. Initially we calculate a threshold of extent entries we can handle, which is however many extent entries we can cram into 16k of ram. The maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace will be 32k of RAM, which scales much better than we did before. Once we pass the extent threshold, we start adding bitmaps and using those instead for tracking the free space. This patch also makes it so that any free space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is nice since we try and allocate out of the front of a block group, so if the front of a block group is heavily fragmented and then has a huge chunk of free space at the end, we go ahead and add the fragmented areas to bitmaps and use a normal extent entry to track the big chunk at the back of the block group. I've also taken the opportunity to revamp how we search for free space. Previously we indexed free space via an offset indexed rb tree and a bytes indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset indexed rb tree. This cuts the number of tree operations we were doing previously down by half, and gives us a little bit of a better allocation pattern since we will always start from a specific offset and search forward from there, instead of searching for the size we need and try and get it as close as possible to the offset we want. I've given this a healthy amount of testing pre-new format stuff, as well as post-new format stuff. I've booted up my fedora box which is installed on btrfs with this patch and ran with it for a few days without issues. I've not seen any performance regressions in any of my tests. Since the last patch Yan Zheng fixed a problem where we could have overlapping entries, so updating their offset inline would cause problems. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1bec1aed |
|
22-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix definition of struct btrfs_extent_inline_ref use __le64 instead of u64 in on-disk structure definition. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2c47e605 |
|
27-Jun-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: update backrefs while dropping snapshot The new backref format has restriction on type of backref item. If a tree block isn't referenced by its owner tree, full backrefs must be used for the pointers in it. When a tree block loses its owner tree's reference, backrefs for the pointers in it should be updated to full backrefs. Current btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for general use. This patch adds backrefs update code to btrfs_drop_snapshot. It isn't a problem in the restricted form btrfs_drop_snapshot is used today, but for general snapshot deletion this update is required. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5affd88a |
|
08-Jun-2009 |
Al Viro <viro@zeniv.linux.org.uk> |
switch btrfs to inode->i_acl Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7df336ec |
|
10-Jun-2009 |
Al Viro <viro@ZenIV.linux.org.uk> |
Fix btrfs when ACLs are configured out ... otherwise generic_permission() will allow *anything* for all files you don't own and that have some group permissions. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6cbff00f |
|
17-Apr-2009 |
Christoph Hellwig <hch@lst.de> |
Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION Add support for the standard attributes set via chattr and read via lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros as they were confusing the existing code and got in the way of the new additions. Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's trivial. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c289811c |
|
10-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: autodetect SSD devices During mount, btrfs will check the queue nonrot flag for all the devices found in the FS. If they are all non-rotating, SSD mode is enabled by default. If the FS was mounted with -o nossd, the non-rotating flag is ignored. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
451d7585 |
|
09-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add mount -o ssd_spread to spread allocations out Some SSDs perform best when reusing block numbers often, while others perform much better when clustering strictly allocates big chunks of unused space. The default mount -o ssd will find rough groupings of blocks where there are a bunch of free blocks that might have some allocated blocks mixed in. mount -o ssd_spread will make sure there are no allocated blocks mixed in. It should perform better on lower end SSDs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5d4f98a2 |
|
10-Jun-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e980b50c |
|
24-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix fallocate deadlock on inode extent lock The btrfs fallocate call takes an extent lock on the entire range being fallocated, and then runs through insert_reserved_extent on each extent as they are allocated. The problem with this is that btrfs_drop_extents may decide to try and take the same extent lock fallocate was already holding. The solution used here is to push down knowledge of the range that is already locked going into btrfs_drop_extents. It turns out that at least one other caller had the same bug. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
97e728d4 |
|
21-Apr-2009 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: try to keep a healthy ratio of metadata vs data block groups This patch makes the chunk allocator keep a good ratio of metadata vs data block groups. By default for every 8 data block groups, we'll allocate 1 metadata chunk, or about 12% of the disk will be allocated for metadata. This can be changed by specifying the metadata_ratio mount option. This is simply the number of data block groups that have to be allocated to force a metadata chunk allocation. By making sure we allocate metadata chunks more often, we are less likely to get into situations where the whole disk has been allocated as data block groups. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d4a78947 |
|
02-Apr-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
Btrfs: fix typos in comments Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dccae999 |
|
02-Apr-2009 |
Sage Weil <sage@newdream.net> |
Btrfs: add flushoncommit mount option The 'flushoncommit' mount option forces any data dirtied by a write in a prior transaction to commit as part of the current commit. This makes the committed state a fully consistent view of the file system from the application's perspective (i.e., it includes all completed file system operations). This was previously the behavior only when a snapshot is created. This is used by Ceph to ensure that completed writes make it to the platter along with the metadata operations they are bound to (by BTRFS_IOC_TRANS_{START,END}). Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3a5e1404 |
|
02-Apr-2009 |
Sage Weil <sage@newdream.net> |
Btrfs: notreelog mount option Add a 'notreelog' mount option to disable the tree log (used by fsync, O_SYNC writes). This is much slower, but the tree logging produces inconsistent views into the FS for ceph. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fa9c0d79 |
|
03-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: rework allocation clustering Because btrfs is copy-on-write, we end up picking new locations for blocks very often. This makes it fairly difficult to maintain perfect read patterns over time, but we can at least do some optimizations for writes. This is done today by remembering the last place we allocated and trying to find a free space hole big enough to hold more than just one allocation. The end result is that we tend to write sequentially to the drive. This happens all the time for metadata and it happens for data when mounted -o ssd. But, the way we record it is fairly racey and it tends to fragment the free space over time because we are trying to allocate fairly large areas at once. This commit gets rid of the races by adding a free space cluster object with dedicated locking to make sure that only one process at a time is out replacing the cluster. The free space fragmentation is somewhat solved by allowing a cluster to be comprised of smaller free space extents. This part definitely adds some CPU time to the cluster allocations, but it allows the allocator to consume the small holes left behind by cow. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
04018de5 |
|
03-Apr-2009 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: kill the pinned_mutex This patch removes the pinned_mutex. The extent io map has an internal tree lock that protects the tree itself, and since we only copy the extent io map when we are committing the transaction we don't need it there. We also don't need it when caching the block group since searching through the tree is also protected by the internal map spin lock. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
6226cb0a |
|
03-Apr-2009 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: kill the block group alloc mutex This patch removes the block group alloc mutex used to protect the free space tree for allocations and replaces it with a spin lock which is used only to protect the free space rb tree. This means we only take the lock when we are directly manipulating the tree, which makes us a touch faster with multi-threaded workloads. This patch also gets rid of btrfs_find_free_space and replaces it with btrfs_find_space_for_alloc, which takes the number of bytes you want to allocate, and empty_size, which is used to indicate how much free space should be at the end of the allocation. It will return an offset for the allocator to use. If we don't end up using it we _must_ call btrfs_add_free_space to put it back. This is the tradeoff to kill the alloc_mutex, since we need to make sure nobody else comes along and takes our space. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
c2ec175c |
|
31-Mar-2009 |
Nick Piggin <npiggin@suse.de> |
mm: page_mkwrite change prototype to match fault Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5a3f23d5 |
|
31-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add extra flushing for renames and truncates Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
12fcfd22 |
|
24-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: tree logging unlink/rename fixes The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b9473439 |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: leave btree locks spinning more often btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c3e69d58 |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: process the delayed reference queue in clusters The delayed reference queue maintains pending operations that need to be done to the extent allocation tree. These are processed by finding records in the tree that are not currently being processed one at a time. This is slow because it uses lots of time searching through the rbtree and because it creates lock contention on the extent allocation tree when lots of different procs are running delayed refs at the same time. This commit changes things to grab a cluster of refs for processing, using a cursor into the rbtree as the starting point of the next search. This way we walk smoothly through the rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
56bec294 |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9fa8cfe7 |
|
13-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't preallocate metadata blocks during btrfs_search_slot In order to avoid doing expensive extent management with tree locks held, btrfs_search_slot will preallocate tree blocks for use by COW without any tree locks held. A later commit moves all of the extent allocation work for COW into a delayed update mechanism, and this preallocation will no longer be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4184ea7f |
|
09-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix locking around adding new space_info Storage allocated to different raid levels in btrfs is tracked by a btrfs_space_info structure, and all of the current space_infos are collected into a list_head. Most filesystems have 3 or 4 of these structs total, and the list is only changed when new raid levels are added or at unmount time. This commit adds rcu locking on the list head, and properly frees things at unmount time. It also clears the space_info->full flag whenever new space is added to the FS. The locking for the space info list goes like this: reads: protected by rcu_read_lock() writes: protected by the chunk_mutex At unmount time we don't need special locking because all the readers are gone. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6a63209f |
|
20-Feb-2009 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: add better -ENOSPC handling This is a step in the direction of better -ENOSPC handling. Instead of checking the global bytes counter we check the space_info bytes counters to make sure we have enough space. If we don't we go ahead and try to allocate a new chunk, and then if that fails we return -ENOSPC. This patch adds two counters to btrfs_space_info, bytes_delalloc and bytes_may_use. bytes_delalloc account for extents we've actually setup for delalloc and will be allocated at some point down the line. bytes_may_use is to keep track of how many bytes we may use for delalloc at some point. When we actually set the extent_bit for the delalloc bytes we subtract the reserved bytes from the bytes_may_use counter. This keeps us from not actually being able to allocate space for any delalloc bytes. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
4008c04a |
|
12-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make a lockdep class for the extent buffer locks Btrfs is currently using spin_lock_nested with a nested value based on the tree depth of the block. But, this doesn't quite work because the max tree depth is bigger than what spin_lock_nested can deal with, and because locks are sometimes taken before the level field is filled in. The solution here is to use lockdep_set_class_and_name instead, and to set the class before unlocking the pages when the block is read from the disk and just after init of a freshly allocated tree block. btrfs_clear_path_blocking is also changed to take the locks in the proper order, and it also makes sure all the locks currently held are properly set to blocking before it tries to retake the spinlocks. Otherwise, lockdep gets upset about bad lock orderin. The lockdep magic cam from Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e00f7308 |
|
12-Feb-2009 |
Jeff Mahoney <jeffm@suse.com> |
Btrfs: remove btrfs_init_path btrfs_init_path was initially used when the path objects were on the stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path isn't required. This patch removes it, and just uses kmem_cache_zalloc to zero out the object. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b4ce94de |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Change btree locking to use explicit blocking points Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c487685d |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: hash_lock is no longer needed Before metadata is written to disk, it is updated to reflect that writeout has begun. Once this update is done, the block must be cow'd before it can be modified again. This update was originally synchronized by using a per-fs spinlock. Today the buffers for the metadata blocks are locked before writeout begins, and everyone that tests the flag has the buffer locked as well. So, the per-fs spinlock (called hash_lock for no good reason) is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7237f183 |
|
20-Jan-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix tree logs parallel sync To improve performance, btrfs_sync_log merges tree log sync requests. But it wrongly merges sync requests for different tree logs. If multiple tree logs are synced at the same time, only one of them actually gets synced. This patch has following changes to fix the bug: Move most tree log related fields in btrfs_fs_info to btrfs_root. This allows merging sync requests separately for each tree log. Don't insert root item into the log root tree immediately after log tree is allocated. Root item for log tree is inserted when log tree get synced for the first time. This allows syncing the log root tree without first syncing all log trees. At tree-log sync, btrfs_sync_log first sync the log tree; then updates corresponding root item in the log root tree; sync the log root tree; then update the super block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
95029d7d |
|
21-Jan-2009 |
Jan Engelhardt <jengelh@medozas.de> |
Btrfs: change/remove typedef Change one typedef to a regular enum, and remove an unused one. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d397712b |
|
05-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cad321ad |
|
17-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: shift all end_io work to thread pools bio_end_io for reads without checksumming on and btree writes were happening without using async thread pools. This means the extent_io.c code had to use spin_lock_irq and friends on the rb tree locks for extent state. There were some irq safe vs unsafe lock inversions between the delallock lock and the extent state locks. This patch gets rid of them by moving all end_io code into the thread pools. To avoid contention and deadlocks between the data end_io processing and the metadata end_io processing yet another thread pool is added to finish off metadata writes. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
17d217fe |
|
12-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix nodatasum handling in balancing code Checksums on data can be disabled by mount option, so it's possible some data extents don't have checksums or have invalid checksums. This causes trouble for data relocation. This patch contains following things to make data relocation work. 1) make nodatasum/nodatacow mount option only affects new files. Checksums and COW on data are only controlled by the inode flags. 2) check the existence of checksum in the nodatacow checker. If checksums exist, force COW the data extent. This ensure that checksum for a given block is either valid or does not exist. 3) update data relocation code to properly handle the case of checksum missing. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
d2fb3437 |
|
11-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix leaking block group on balance The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
459931ec |
|
10-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Delete csum items when freeing extents This finishes off the new checksumming code by removing csum items for extents that are no longer in use. The trick is doing it without racing because a single csum item may hold csums for more than one extent. Extra checks are added to btrfs_csum_file_blocks to make sure that we are using the correct csum item after dropping locks. A new btrfs_split_item is added to split a single csum item so it can be split without dropping the leaf lock. This is used to remove csum bytes from the middle of an item. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c3027eb5 |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add inode sequence number for NFS and reserved space in a few structs This adds a sequence number to the btrfs inode that is increased on every update. NFS will be able to use that to detect when an inode has changed, without relying on inaccurate time fields. While we're here, this also: Puts reserved space into the super block and inode Adds a log root transid to the super so we can pick the newest super based on the fsync log as well as the main transaction ID. For now the log root transid is always zero, but that'll get fixed. Adds a starting offset to the dev_item. This will let us do better alignment calculations if we know the start of a partition on the disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d20f7043 |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: move data checksumming into a dedicated tree Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2a7108ad |
|
02-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: rev the disk format for the inode compat and csum selection changes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
607d432d |
|
02-Dec-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: add support for multiple csum algorithms This patch gives us the space we will need in order to have different csum algorithims at some point in the future. We save the csum algorithim type in the superblock, and use those instead of define's. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
f2b636e8 |
|
02-Dec-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: add support for compat flags to btrfs This adds the necessary disk format for handling compatibility flags in the future to handle disk format changes. We have a compat_flags, compat_ro_flags and incompat_flags set for the super block. Compat flags will be to hold the features that are compatible with older versions of btrfs, compat_ro flags have features that are compatible with older versions of btrfs if the fs is mounted read only, and incompat_flags has features that are incompatible with older versions of btrfs. This also axes the compat_flags field for the inode and just makes the flags field a 64bit field, and changes the root item flags field to 64bit. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
ea6a478e |
|
19-Nov-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Fix for lockdep warnings with alloc_mutex and pinned_mutex This the lockdep complaint by having a different mutex to gaurd caching the block group, so you don't end up with this backwards dependancy. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
73e9f5be |
|
18-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Update the disk format for the seed device and new root code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ea9e8b11 |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: prevent loops in the directory tree when creating snapshots For a directory tree: /mnt/subvolA/subvolB btrfsctl -s /mnt/subvolA/subvolB /mnt Will create a directory loop with subvolA under subvolB. This commit uses the forward refs for each subvol and snapshot to error out before creating the loop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0660b5af |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add backrefs and forward refs for subvols and snapshots Subvols and snapshots can now be referenced from any point in the directory tree. We need to maintain back refs for them so we can find lost subvols. Forward refs are added so that we know all of the subvols and snapshots referenced anywhere in the directory tree of a single subvol. This can be used to do recursive snapshotting (but they aren't yet) and it is also used to detect and prevent directory loops when creating new snapshots. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3394e160 |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Give each subvol and snapshot their own anonymous devid Each subvolume has its own private inode number space, and so we need to fill in different device numbers for each subvolume to avoid confusing applications. This commit puts a struct super_block into struct btrfs_root so it can call set_anon_super() and get a different device number generated for each root. btrfs_rename is changed to prevent renames across subvols. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3de4586c |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allow subvolumes and snapshots anywhere in the directory tree Before, all snapshots and subvolumes lived in a single flat directory. This was awkward and confusing because the single flat directory was only writable with the ioctls. This commit changes the ioctls to create subvols and snapshots at any point in the directory tree. This requires making separate ioctls for snapshot and subvol creation instead of a combining them into one. The subvol ioctl does: btrfsctl -S subvol_name parent_dir After the ioctl is done subvol_name lives inside parent_dir. The snapshot ioctl does: btrfsctl -s path_for_snapshot root_to_snapshot path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up into directory and basename components. root_to_snapshot can be any file or directory in the FS. The snapshot is taken of the entire root where that file lives. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2b82032c |
|
17-Nov-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Seed device support Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
c146afad |
|
12-Nov-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: mount ro and remount support This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
f3465ca4 |
|
12-Nov-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: batch extent inserts/updates/deletions on the extent root While profiling the allocator I noticed a good amount of time was being spent in finish_current_insert and del_pending_extents, and as the filesystem filled up more and more time was being spent in those functions. This patch aims to try and reduce that problem. This happens two ways 1) track if we tried to delete an extent that we are going to update or insert. Once we get into finish_current_insert we discard any of the extents that were marked for deletion. This saves us from doing unnecessary work almost every time finish_current_insert runs. 2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for each individual extent and doing the needed operation, we instead keep the leaf around and see if there is anything else we can do on that leaf. On the insert case I introduced a btrfs_insert_some_items, which will take an array of keys with an array of data_sizes and try and squeeze in as many of those keys as possible, and then return how many keys it was able to insert. In the update case we search for an extent ref, update the ref and then loop through the leaf to see if any of the other refs we are looking to update are on that leaf, and then once we are done we release the path and search for the next ref we need to update. And finally for the deletion we try and delete the extent+ref in pairs, so we will try to find extent+ref pairs next to the extent we are trying to free and free them in bulk if possible. This along with the other cluster fix that Chris pushed out a bit ago helps make the allocator preform more uniformly as it fills up the disk. There is still a slight drop as we fill up the disk since we start having to stick new blocks in odd places which results in more COW's than on a empty fs, but the drop is not nearly as severe as it was before. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
771ed689 |
|
06-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Optimize compressed writeback and reads When reading compressed extents, try to put pages into the page cache for any pages covered by the compressed extent that readpages didn't already preload. Add an async work queue to handle transformations at delayed allocation processing time. Right now this is just compression. The workflow is: 1) Find offsets in the file marked for delayed allocation 2) Lock the pages 3) Lock the state bits 4) Call the async delalloc code The async delalloc code clears the state lock bits and delalloc bits. It is important this happens before the range goes into the work queue because otherwise it might deadlock with other work queue items that try to lock those extent bits. The file pages are compressed, and if the compression doesn't work the pages are written back directly. An ordered work queue is used to make sure the inodes are written in the same order that pdflush or writepages sent them down. This changes extent_write_cache_pages to let the writepage function update the wbc nr_written count. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
537fb067 |
|
30-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: rev the disk format for fallocate Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d899e052 |
|
30-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Add fallocate support v2 This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
80ff3856 |
|
30-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: update nodatacow code v2 This patch simplifies the nodatacow checker. If all references were created after the latest snapshot, then we can avoid COW safely. This patch also updates run_delalloc_nocow to do more fine-grained checking. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
9036c102 |
|
30-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: update hole handling v2 This patch splits the hole insertion code out of btrfs_setattr into btrfs_cont_expand and updates btrfs_get_extent to properly handle the case that file extent items are not continuous. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
09fde3c9 |
|
29-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Rev the disk format for compression and root pointer generation fields
|
#
84234f3a |
|
29-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Add root tree pointer transaction ids This patch adds transaction IDs to root tree pointers. Transaction IDs in tree pointers are compared with the generation numbers in block headers when reading root blocks of trees. This can detect some types of IO errors. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
25179201 |
|
29-Oct-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: nuke fs wide allocation mutex V2 This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
80eb234a |
|
29-Oct-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: fix enospc when there is plenty of space So there is an odd case where we can possibly return -ENOSPC when there is in fact space to be had. It only happens with Metadata writes, and happens _very_ infrequently. What has to happen is we have to allocate have allocated out of the first logical byte on the disk, which would set last_alloc to first_logical_byte(root, 0), so search_start == orig_search_start. We then need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start, block_group_bits() won't match and we'll go to choose another block group. However because search_start matches orig_search_start we go to see if we can allocate a chunk. If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC. This is kind of a big flaw of the way find_free_extent works, as it along with find_free_space loop through _all_ of the block groups, not just the ones that we want to allocate out of. This patch completely kills find_free_space and rolls it into find_free_extent. I've introduced a sort of state machine into this, which will make it easier to get cache miss information out of the allocator, and will work well with my locking changes. The basic flow is this: We have the variable loop which is 0, meaning we are in the hint phase. We lookup the block group for the hint, and lookup the space_info for what we want to allocate out of. If the block group we were pointed at by the hint either isn't of the correct type, or just doesn't have the space we need, we set head to space_info->block_groups, so we start at the beginning of the block groups for this particular space info, and loop through. This is also where we add the empty_cluster to total_needed. At this point loop is set to 1 and we just loop through all of the block groups for this particular space_info looking for the space we need, just as find_free_space would have done, except we only hit the block groups we want and not _all_ of the block groups. If we come full circle we see if we can allocate a chunk. If we cannot of course we exit with -ENOSPC and we are good. If not we start over at space_info->block_groups and loop through again, with loop == 2. If we come full circle and haven't found what we need then we exit with -ENOSPC. I've been running this for a couple of days now and it seems stable, and I haven't yet hit a -ENOSPC when there was plenty of space left. Also I've made a groups_sem to handle the group list for the space_info. This is part of my locking changes, but is relatively safe and seems better than holding the space_info spinlock over that entire search time. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
f82d02d9 |
|
29-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Improve space balancing code This patch improves the space balancing code to keep more sharing of tree blocks. The only case that breaks sharing of tree blocks is data extents get fragmented during balancing. The main changes in this patch are: Add a 'drop sub-tree' function. This solves the problem in old code that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block. Remove relocation mapping tree. Relocation mappings are stored in struct btrfs_ref_path and updated dynamically during walking up/down the reference path. This reduces CPU usage and simplifies code. This patch also fixes a bug. Root items for reloc trees should be updated in btrfs_free_reloc_root. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
c8b97818 |
|
29-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add zlib compression support This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cb8e7090 |
|
09-Oct-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: Fix subvolume creation locking rules Creating a subvolume is in many ways like a normal VFS ->mkdir, and we really need to play with the VFS topology locking rules. So instead of just creating the snapshot on disk and then later getting rid of confliting aliases do it correctly from the start. This will become especially important once we allow for subvolumes anywhere in the tree, and not just below a hidden root. Note that snapshots will need the same treatment, but do to the delay in creating them we can't do it currently. Chris promised to fix that issue, so I'll wait on that. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
833023e4 |
|
09-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Rev the disk format for the new back reference format Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3bb1a1bc |
|
09-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Remove offset field from struct btrfs_extent_ref The offset field in struct btrfs_extent_ref records the position inside file that file extent is referenced by. In the new back reference system, tree leaves holding references to file extent are recorded explicitly. We can scan these tree leaves very quickly, so the offset field is not required. This patch also makes the back reference system check the objectid when extents are in deleting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
a76a3cd4 |
|
09-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Count space allocated to file in bytes This patch makes btrfs count space allocated to file in bytes instead of 512 byte sectors. Everything else in btrfs uses a byte count instead of sector sizes or blocks sizes, so this fits better. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
30c43e24 |
|
02-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: remove last_log_alloc allocator optimization The tree logging code was trying to separate tree log allocations from normal metadata allocations to improve writeback patterns during an fsync. But, the code was not effective and ended up just mixing tree log blocks with regular metadata. That seems to be working fairly well, so the last_log_alloc code can be removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
323ac95b |
|
01-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't read leaf blocks containing only checksums during truncate Checksum items take up a significant portion of the metadata for large files. It is possible to avoid reading them during truncates by checking the keys in the higher level nodes. If a given leaf is followed by another leaf where the lowest key is a checksum item from the same file, we know we can safely delete the leaf without reading it. For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete the file with a cold cache. It is read bound during the run. With this change, Btrfs is able to delete the file in 0.5s Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d352ac68 |
|
29-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add and improve comments This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8c8bee1d |
|
29-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Wait for IO on the block device inodes of newly added devices btrfs-vol -a /dev/xxx will zero the first and last two MB of the device. The kernel code needs to wait for this IO to finish before it adds the device. btrfs metadata IO does not happen through the block device inode. A separate address space is used, allowing the zero filled buffer heads in the block device inode to be written to disk after FS metadata starts going down to the disk via the btrfs metadata inode. The end result is zero filled metadata blocks after adding new devices into the filesystem. The fix is a simple filemap_write_and_wait on the block device inode before actually inserting it into the pool of available devices. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1a40e23b |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: update space balancing code This patch updates the space balancing code to utilize the new backref format. Before, btrfs-vol -b would break any COW links on data blocks or metadata. This was slow and caused the amount of space used to explode if a large number of snapshots were present. The new code can keeps the sharing of all data extents and most of the tree blocks. To maintain the sharing of data extents, the space balance code uses a seperate inode hold data extent pointers, then updates the references to point to the new location. To maintain the sharing of tree blocks, the space balance code uses reloc trees to relocate tree blocks in reference counted roots. There is one reloc tree for each subvol, and all reloc trees share same root key objectid. Reloc trees are snapshots of the latest committed roots of subvols (root->commit_root). To relocate a tree block referenced by a subvol, there are two steps. COW the block through subvol's reloc tree, then update block pointer in the subvol to point to the new block. Since all reloc trees share same root key objectid, doing special handing for tree blocks owned by them is easy. Once a tree block has been COWed in one reloc tree, we can use the resulting new block directly when the same block is required to COW again through other reloc trees. In this way, relocated tree blocks are shared between reloc trees, so they are also shared between subvols. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5b21f2ed |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: extent_map and data=ordered fixes for space balancing * Add an EXTENT_BOUNDARY state bit to keep the writepage code from merging data extents that are in the process of being relocated. This allows us to do accounting for them properly. * The balancing code relocates data extents indepdent of the underlying inode. The extent_map code was modified to properly account for things moving around (invalidating extent_map caches in the inode). * Don't take the drop_mutex in the create_subvol ioctl. It isn't required. * Fix walking of the ordered extent list to avoid races with sys_unlink * Change the lock ordering rules. Transaction start goes outside the drop_mutex. This allows btrfs_commit_transaction to directly drop the relocation trees. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e4657689 |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: Add shared reference cache Btrfs has a cache of reference counts in leaves, allowing it to avoid reading tree leaves while deleting snapshots. To reduce contention with multiple subvolumes, this cache is private to each subvolume. This patch adds shared reference cache support. The new space balancing code plays with multiple subvols at the same time, So the old per-subvol reference cache is not well suited. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e8569813 |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: allocator fixes for space balancing update * Reserved extent accounting: reserved extents have been allocated in the rbtrees that track free space but have not been allocated on disk. They were never properly accounted for in the past, making it hard to know how much space was really free. * btrfs_find_block_group used to return NULL for block groups that had been removed by the space balancing code. This made it hard to account for space during the final stages of a balance run. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2b1f55b0 |
|
24-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Remove Btrfs compat code for older kernels Btrfs had compatibility code for kernels back to 2.6.18. These have been removed, and will be maintained in a separate backport git tree from now on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31840ae1 |
|
23-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: Full back reference support This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f9dd46c |
|
23-Sep-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: free space accounting redo 1) replace the per fs_info extent_io_tree that tracked free space with two rb-trees per block group to track free space areas via offset and size. The reason to do this is because most allocations come with a hint byte where to start, so we can usually find a chunk of free space at that hint byte to satisfy the allocation and get good space packing. If we cannot find free space at or after the given offset we fall back on looking for a chunk of the given size as close to that given offset as possible. When we fall back on the size search we also try to find a slot as close to the size we want as possible, to avoid breaking small chunks off of huge areas if possible. 2) remove the extent_io_tree that tracked the block group cache from fs_info and replaced it with an rb-tree thats tracks block group cache via offset. also added a per space_info list that tracks the block group cache for the particular space so we can lookup related block groups easily. 3) cleaned up the allocation code to make it a little easier to read and a little less complicated. Basically there are 3 steps, first look from our provided hint. If we couldn't find from that given hint, start back at our original search start and look for space from there. If that fails try to allocate space if we can and start looking again. If not we're screwed and need to start over again. 4) small fixes. there were some issues in volumes.c where we wouldn't allocate the rest of the disk. fixed cow_file_range to actually pass the alloc_hint, which has helped a good bit in making the fs_mark test I run have semi-normal results as we run out of space. Generally with data allocations we don't track where we last allocated from, so everytime we did a data allocation we'd search through every block group that we have looking for free space. Now searching a block group with no free space isn't terribly time consuming, it was causing a slight degradation as we got more data block groups. The alloc_hint has fixed this slight degredation and made things semi-normal. There is still one nagging problem I'm working on where we will get ENOSPC when there is definitely plenty of space. This only happens with metadata allocations, and only when we are almost full. So you generally hit the 85% mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm still tracking it down, but until then this seems to be pretty stable and make a significant performance gain. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d0c803c4 |
|
11-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Record dirty pages tree-log pages in an extent_io tree This is the same way the transaction code makes sure that all the other tree blocks are safely on disk. There's an extent_io tree for each root, and any blocks allocated to the tree logs are recorded in that tree. At tree-log sync, the extent_io tree is walked to flush down the dirty pages and wait for them. The main benefit is less time spent walking the tree log and skipping clean pages, and getting sequential IO down to the drive. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6527cdbe |
|
05-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: Update find free objectid function for orphan cleanup code Orphan items use BTRFS_ORPHAN_OBJECTID (-5UUL) as key objectid. This affects the find free objectid functions, inode objectid can easily overflow after orphan file cleanup. --- Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a237d2a2 |
|
05-Sep-2008 |
Christoph Hellwig <hch@lst.de> |
remove unused function btrfs_ilookup btrfs_ilookup is unused, which is good because a normal filesystem should never have to use ilookup anyway. Remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
91c0827d |
|
05-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Rev the disk format Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e02119d5 |
|
05-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a write ahead tree log to optimize synchronous operations File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f3f9931e |
|
21-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Rev the disk format Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1a54ef8c |
|
20-Jul-2008 |
Balaji Rao <balajirrao@gmail.com> |
Introduce btrfs_iget helper Date: Mon, 21 Jul 2008 02:01:04 +0530 This patch introduces a btrfs_iget helper to be used in NFS support. Signed-off-by: Balaji Rao <balajirrao@gmail.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4854ddd0 |
|
15-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Wait for kernel threads to make progress during async submission Before this change, btrfs would use a bdi congestion function to make sure there weren't too many pending async checksum work items. This change makes the process creating async work items wait instead, leading to fewer congestion returns from the bdi. This improves pdflush background_writeout scanning. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0986fe9e |
|
15-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Count async bios separately from async checksum work items Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5036f538 |
|
07-Aug-2008 |
Eric Sandeen <sandeen@redhat.com> |
Btrfs: fix RHEL test for ClearPageFsMisc Newer RHEL5 kernels define both ClearPageFSMisc and ClearPageChecked, so test for both before redefining. Signed-off-by: Eric Sandeen <sandeen@redhat.com> --- Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7ea394f1 |
|
05-Aug-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Fix nodatacow for the new data=ordered mode Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ea8c2819 |
|
04-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9ca9ee09 |
|
04-Aug-2008 |
Sage Weil <sage@newdream.net> |
Btrfs: fix ioctl-initiated transactions vs wait_current_trans() Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in progress) breaks the transaction start/stop ioctls by making btrfs_start_transaction conditionally wait for the next transaction to start. If an application artificially is holding a transaction open, things deadlock. This workaround maintains a count of open ioctl-initiated transactions in fs_info, and avoids wait_current_trans() if any are currently open (in start_transaction() and btrfs_throttle()). The start transaction ioctl uses a new btrfs_start_ioctl_transaction() that _does_ call wait_current_trans(), effectively pushing the join/wait decision to the outer ioctl-initiated transaction. This more or less neuters btrfs_throttle() when ioctl-initiated transactions are in use, but that seems like a pretty fundamental consequence of wrapping lots of write()'s in a transaction. Btrfs has no way to tell if the application considers a given operation as part of it's transaction. Obviously, if the transaction start/stop ioctls aren't being used, there is no effect on current behavior. Signed-off-by: Sage Weil <sage@newdream.net> --- ctree.h | 1 + ioctl.c | 12 +++++++++++- transaction.c | 18 +++++++++++++----- transaction.h | 2 ++ 4 files changed, 27 insertions(+), 6 deletions(-) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
65b51a00 |
|
01-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
btrfs_search_slot: reduce lock contention by cowing in two stages A btree block cow has two parts, the first is to allocate a destination block and the second is to copy the old bock over. The first part needs locks in the extent allocation tree, and may need to do IO. This changeset splits that into a separate function that can be called without any tree locks held. btrfs_search_slot is changed to drop its path and start over if it has to COW a contended block. This often means that many writers will pre-alloc a new destination for a the same contended block, but they cache their prealloc for later use on lower levels in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
61b49440 |
|
31-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix streaming read performance with checksumming on Large streaming reads make for large bios, which means each entry on the list async work queues represents a large amount of data. IO congestion throttling on the device was kicking in before the async worker threads decided a single thread was busy and needed some help. The end result was that a streaming read would result in a single CPU running at 100% instead of balancing the work off to other CPUs. This patch also changes the pre-IO checksum lookup done by reads to work on a per-bio basis instead of a per-page. This results in many extra btree lookups on large streaming reads. Doing the checksum lookup right before bio submit allows us to reuse searches while processing adjacent offsets. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bcc63abb |
|
30-Jul-2008 |
Yan <zheng.yan@oracle.com> |
Btrfs: implement memory reclaim for leaf reference cache The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f321e491 |
|
30-Jul-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Update and fix mount -o nodatacow To check whether a given file extent is referenced by multiple snapshots, the checker walks down the fs tree through dead root and checks all tree blocks in the path. We can easily detect whether a given tree block is directly referenced by other snapshot. We can also detect any indirect reference from other snapshot by checking reference's generation. The checker can always detect multiple references, but can't reliably detect cases of single reference. So btrfs may do file data cow even there is only one reference. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ab78c84d |
|
29-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Throttle operations if the reference cache gets too large A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
017e5369 |
|
28-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Leaf reference cache update This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31153d81 |
|
28-Jul-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Add a leaf reference cache Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3a115f52 |
|
23-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Rev the disk format magic Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7b128766 |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Create orphan inode records to prevent lost files after a crash Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
33268eaf |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Add ACL support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6099afe8 |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Remove unused xattr code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
aec7477b |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Implement new dir index format Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3eaa2885 |
|
24-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix the defragmention code and the block relocation code for data=ordered Before setting an extent to delalloc, the code needs to wait for pending ordered extents. Also, the relocation code needs to wait for ordered IO before scanning the block group again. This is because the extents are not removed until the IO for the new extents is finished Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4881ee5a |
|
24-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix some build problems on 2.6.18 based enterprise kernels Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c286ac48 |
|
22-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: alloc_mutex latency reduction This releases the alloc_mutex in a few places that hold it for over long operations. btrfs_lookup_block_group is changed so that it doesn't need the mutex at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6dddcbeb |
|
22-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use mutex_lock_nested for tree locking Lockdep has the notion of locking subclasses so that you can identify locks you expect to be taken after other locks of the same class. This changes the per-extent buffer btree locking routines to use a subclass based on the level in the tree. Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs max level is also 8. So when lockdep is on, use a lower max level. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f421950f |
|
22-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix some data=ordered related data corruptions Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3edf7d33 |
|
18-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Handle data checksumming on bios that span multiple ordered extents Data checksumming is done right before the bio is sent down the IO stack, which means a single bio might span more than one ordered extent. In this case, the checksumming data is split between two ordered extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f9295749 |
|
16-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
btrfs_start_transaction: wait for commits in progress to finish btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
247e743c |
|
16-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use async helpers to deal with pages that have been improperly dirtied Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e6dcd2dc |
|
16-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: New data=ordered implementation The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7d9eb12c |
|
08-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add locking around volume management (device add/remove/balance) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3f157a2f |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Online btree defragmentation fixes The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e7a84565 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add btree locking to the tree defragmentation code The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a74a4b97 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace the transaction work queue with kthreads This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5cd57b2c |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
051e1b9f |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Drop locks in btrfs_search_slot when reading a tree block. One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a2135011 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace the big fs_mutex with a collection of other locks Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
925baedd |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Start btree concurrency work. The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1cc127b5 |
|
12-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a thread pool just for submit_bio If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f46b5a66 |
|
11-Jun-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: split out ioctl.c Split the ioctl handling out of inode.c into a file of it's own. Also fix up checkpatch.pl warnings for the moved code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4543df7e |
|
11-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a mount option to control worker thread pool size mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8b712842 |
|
11-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add async worker threads for pre and post IO checksumming Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
edf24abe |
|
10-Jun-2008 |
Christoph Hellwig <hch@lst.de> |
btrfs: sanity mount option parsing and early mount code Also adds lots of comments to describe what's going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6bf13c0c |
|
10-Jun-2008 |
Sage Weil <sage@newdream.net> |
Btrfs: transaction ioctls These ioctls let a user application hold a transaction open while it performs a series of operations. A final ioctl does a sync on the fs (closing the current transaction). This is the main requirement for Ceph's OSD to be able to keep the data it's storing in a btrfs volume consistent, and AFAICS it works just fine. The application would do something like fd = ::open("some/file", O_RDONLY); ::ioctl(fd, BTRFS_IOC_TRANS_START); /* do a bunch of stuff */ ::ioctl(fd, BTRFS_IOC_TRANS_END); or just ::close(fd); And to ensure it commits to disk, ::ioctl(fd, BTRFS_IOC_SYNC); When a transaction is held open, the trans_handle is attached to the struct file (via private_data) so that it will get cleaned up if the process dies unexpectedly. A held transaction is also ended on fsync() to avoid a deadlock. A misbehaving application could also deliberately hold a transaction open, effectively locking up the FS, so it may make sense to restrict something like this to root or something. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3b96362c |
|
09-Jun-2008 |
Sven Wegener <sven.wegener@stealer.net> |
Btrfs: Invalidate dcache entry after creating snapshot and We need to invalidate an existing dcache entry after creating a new snapshot or subvolume, because a negative dache entry will stop us from accessing the new snapshot or subvolume. --- ctree.h | 23 +++++++++++++++++++++++ inode.c | 4 ++++ transaction.c | 4 ++++ 3 files changed, 31 insertions(+) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0ef3e66b |
|
24-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allocator fix variety pack * Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cb03c743 |
|
15-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Change the congestion functions to meter the number of async submits as well The async submit workqueue was absorbing too many requests, leading to long stalls where the async submitters were stalling. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dfe25020 |
|
13-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount -o degraded to allow mounts to continue with missing devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a68d5933 |
|
08-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Update nodatacow mode to support cloned single files and resizing Before, nodatacow only checked to make sure multiple roots didn't have references on a single extent. This check makes sure that multiple inodes don't have references. nodatacow needed an extra check to see if the block group was currently readonly. This way cows forced by the chunk relocation code are honored. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bf4ef679 |
|
08-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Properly find the root for snapshotted blocks during chunk relocation Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a061fc8d |
|
07-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for online device removal This required a few structural changes to the code that manages bdev pointers: The VFS super block now gets an anon-bdev instead of a pointer to the lowest bdev. This allows us to avoid swapping the super block bdev pointer around at run time. The code to read in the super block no longer goes through the extent buffer interface. Things got ugly keeping the mapping constant. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f2eb0a24 |
|
02-May-2008 |
Sage Weil <sage@newdream.net> |
Btrfs: Clone file data ioctl Add a new ioctl to clone file data Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ec44a35c |
|
28-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add balance ioctl to restripe the chunks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
788f20eb |
|
28-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add new ioctl to add devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8f18cf13 |
|
25-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Make the resizer work based on shrinking and growing devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7ae9c09d |
|
18-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for labels in the super block Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a443755f |
|
18-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Check device uuids along with devids Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e015640f |
|
16-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Write bio checksumming outside the FS mutex This significantly improves streaming write performance by allowing concurrency in the data checksumming. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
44b8bd7e |
|
16-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Create a work queue for bio writes This allows checksumming to happen in parallel among many cpus, and keeps us from bogging down pdflush with the checksumming code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
321aecc6 |
|
16-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add RAID10 support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e17cade2 |
|
15-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add chunk uuids and update multi-device back references Block headers now store the chunk tree uuid Chunk items records the device uuid for each stripes Device extent items record better back refs to the chunk tree Block groups record better back refs to the chunk tree The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY used to be the logical offset of the chunk. Now it is a chunk tree id, with the logical offset being stored in the offset field of the key. This allows a single chunk tree to record multiple logical address spaces, upping the number of bytes indexed by a chunk tree from 2^64 to 2^128. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
98d20f67 |
|
14-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Add a min size parameter to btrfs_alloc_extent On huge machines, delayed allocation may try to allocate massive extents. This change allows btrfs_alloc_extent to return something smaller than the caller asked for, and the data allocation routines will loop over the allocations until it fills the whole delayed alloc. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ce9adaa5 |
|
09-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Do metadata checksums for reads via a workqueue Before, metadata checksumming was done by the callers of read_tree_block, which would set EXTENT_CSUM bits in the extent tree to show that a given range of pages was already checksummed and didn't need to be verified again. But, those bits could go away via try_to_releasepage, and the end result was bogus checksum failures on pages that never left the cache. The new code validates checksums when the page is read. It is a little tricky because metadata blocks can span pages and a single read may end up going via multiple bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d18a2c44 |
|
04-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix allocation profile init Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
611f0e00 |
|
03-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for duplicate blocks on a single spindle Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8790d502 |
|
03-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for mirroring across drives Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
63b10fc4 |
|
01-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Reorder the flags field in struct btrfs_header and record a flag on writeout This allows detection of blocks that have already been written in the running transaction so they can be recowed instead of modified again. It is step one in trusting the transid field of the block pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
04160088 |
|
26-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Create a btrfs backing dev info This allows intelligent versions of unplug and congestion functions Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
593060d7 |
|
25-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Implement raid0 when multiple devices are present Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8a4b83cc |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for device scanning and detection ioctls Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
239b14b3 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Bring back mount -o ssd optimizations Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0d81ba5d |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Move device information into the super block so it can be scanned Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e085def2 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Make the FS tree the last objectid in the tree of tree roots Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6324fbf3 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Dynamic chunk and block group allocation Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0b86a832 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for multiple devices per filesystem Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
065631f6 |
|
19-Feb-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: checksum file data at bio submission time instead of during writepage When we checkum file data during writepage, the checksumming is done one page at a time, making it difficult to do bulk metadata modifications to insert checksums for large ranges of the file at once. This patch changes btrfs to checksum on a per-bio basis instead. The bios are checksummed before they are handed off to the block layer, so each bio is contiguous and only has pages from the same inode. Checksumming on a bio basis allows us to insert and modify the file checksum items in large groups. It also allows the checksumming to be done more easily by async worker threads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
df68b8a7 |
|
15-Feb-2008 |
David Miller <davem@davemloft.net> |
Btrfs: unaligned access fixes Btrfs set/get macros lose type information needed to avoid unaligned accesses on sparc64. ere is a patch for the kernel bits which fixes most of the unaligned accesses on sparc64. btrfs_name_hash is modified to return the hash value instead of getting a return location via a (potentially unaligned) pointer. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9069218d |
|
08-Feb-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix i_blocks accounting Now that delayed allocation accounting works, i_blocks accounting is changed to only modify i_blocks when extents inserted or removed. The fillattr call is changed to include the delayed allocation byte count in the i_blocks result. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
47b0c4f8 |
|
04-Feb-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Update magic Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4529ba49 |
|
31-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add data block hints to SSD mode too Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6f568d35 |
|
29-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: mount -o max_inline=size to control the maximum inline extent size Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9c58309d |
|
29-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add inode item and backref in one insert, reducing cpu usage Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
85e21bac |
|
29-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: During deletes and truncate, remove many items at once from the tree Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d1310b2e |
|
24-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Split the extent_map code into two parts There is now extent_map for mapping offsets in the file to disk and extent_io for state tracking, IO submission and extent_bufers. The new extent_map code shifts from [start,end] pairs to [start,len], and pushes the locking out into the caller. This allows a few performance optimizations and is easier to use. A number of extent_map usage bugs were fixed, mostly with failing to remove extent_map entries when changing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5f56406a |
|
22-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix hole insertion corner cases There were a few places that could cause duplicate extent insertion, this adjusts the code that creates holes to avoid it. lookup_extent_map is changed to correctly return all of the extents in a range, even when there are none matching at the start of the range. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e18e4809 |
|
18-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount -o ssd, which includes optimizations for seek free storage Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2da98f00 |
|
16-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Run igrab on data=ordered inodes to prevent deadlocks during writeout Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cee36a03 |
|
15-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Rework btrfs_drop_inode to avoid scheduling Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
61295eb8 |
|
14-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add drop inode func to avoid data=ordered deadlock Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
69a32ac5 |
|
14-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Change magic string to reflect new format Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fdebe2bd |
|
14-Jan-2008 |
Yan <yanzheng@21cn.com> |
Btrfs: Add readonly inode flag This patch adds readonly inode flag support. A file with this flag can't be modified, but can be deleted. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
21ad10cf |
|
09-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add flush barriers on commit Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b98b6767 |
|
08-Jan-2008 |
Yan <yanzheng@21cn.com> |
Btrfs: Add inode flags support This patch adds NODATASUM & NODATACOW inode flags support. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e2008b61 |
|
08-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add some simple throttling to wait for data=ordered and snapshot deletion Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dc17ff8f |
|
08-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add data=ordered support This forces file data extents down the disk along with the metadata that references them. The current implementation is fairly simple, and just writes out all of the dirty pages in an inode before the commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4313b399 |
|
03-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Reduce stack usage in the resizer, fix 32 bit compiles Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8f662a76 |
|
02-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add readahead to the online shrinker, and a mount -o alloc_start= for testing Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
edbd8d4e |
|
21-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Support for online FS resize (grow and shrink) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1832a6d5 |
|
21-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Implement basic support for -ENOSPC This is intended to prevent accidentally filling the drive. A determined user can still make things oops. It includes some accounting of the current bytes under delayed allocation, but this will change as things get optimized Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6da6abae |
|
18-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Back port to 2.6.18-el kernels Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c59f8951 |
|
17-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount option to enforce a max extent size Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
be20aa9d |
|
17-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount option to turn off data cow A number of workloads do not require copy on write data or checksumming. mount -o nodatasum to disable checksums and -o nodatacow to disable both copy on write and checksumming. In nodatacow mode, copy on write is still performed when a given extent is under snapshot. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b6cda9bc |
|
14-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount -o nodatasum to turn of file data checksumming Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f6dbff55 |
|
13-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Reorder extent back refs to differentiate btree blocks from file data Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3954401f |
|
12-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add back pointers from the inode to the directory that references it Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7bb86316 |
|
11-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add back pointers from extents to the btree or file referencing them Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
74493f7a |
|
11-Dec-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Implement generation numbers in block pointers Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
87ee04eb |
|
30-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add simple stripe size parameter Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
00f5c795 |
|
30-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
btrfs_drop_extents: make sure the item is getting smaller before truncate Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
324ae4df |
|
16-Nov-2007 |
Yan <yanzheng@21cn.com> |
Btrfs: Add block group pinned accounting back This patch adds a helper function 'update_pinned_extents' to extent-tree.c. The usage of the helper function is similar to 'update_block_group', the last parameter of the function indicates pin vs unpin. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5103e947 |
|
16-Nov-2007 |
Josef Bacik <jbacik@redhat.com> |
xattr support for btrfs Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e644d021 |
|
06-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
Fix recursive KM_USER1 usage in btrfs_realloc_node Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f84a8b36 |
|
06-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Optimize allocations as we need to mix data and metadata into one group Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
179e29e4 |
|
01-Nov-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix a number of inline extent problems that Yan Zheng reported. The fixes do a number of things: 1) Most btrfs_drop_extent callers will try to leave the inline extents in place. It can truncate bytes off the beginning of the inline extent if required. 2) writepage can now update the inline extent, allowing mmap writes to go directly into the inline extent. 3) btrfs_truncate_in_transaction truncates inline extents 4) extent_map.c fixed to not merge inline extent mappings and hole mappings together Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f578d4bd |
|
25-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Optimize csum insertion to create larger items when possible This reduces the number of calls to btrfs_extend_item and greatly lowers the cpu usage while writing large files. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6d7231f7 |
|
19-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix typo: owner is a 64 bit field Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a6b6e75e |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Defrag only leaves, and only when the parent node has a single objectid This allows us to defrag huge directories, but skip the expensive defrag case in more common usage, where it does not help as much. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
19c00ddc |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add back metadata checksumming Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f82731f |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
810191ff |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: extent_map optimizations to cut down on CPU usage Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3326d1b0 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allow tails larger than one page Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
14048ed0 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Cache extent buffer mappings Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
db94535d |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allow tree blocks larger than the page size Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1a5bc167 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map trees Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
96b5179d |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Stop using radix trees for the block group cache Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f510cfec |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix extent_buffer and extent_state leaks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6d36dcd4 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Avoid memcpy where possible in extent_buffers Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
479965d6 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Optimizations for the extent_buffer code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5f39d397 |
|
15-Oct-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Create extent_buffer interface for large blocksizes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
34287aa3 |
|
14-Sep-2007 |
Christoph Hellwig <hch@lst.de> |
Btrfs: use unlocked_ioctl No reason to grab the BKL before calling into the btrfs ioctl code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5ce14bbc |
|
11-Sep-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Find and remove dead roots the first time a root is loaded. Dead roots are trees left over after a crash, and they were either in the process of being removed or were waiting to be removed when the box crashed. Before, a search of the entire tree of root pointers was done on mount looking for dead roots. Now, the search is done the first time we load a root. This makes mount faster when there are a large number of snapshots, and it enables the block accounting code to properly update the block counts on the latest root as old versions of the root are reaped after a crash. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
011410bd |
|
10-Sep-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add more synchronization before creating a snapshot File data checksums are only done during writepage, so we have to make sure all pages are written when the snapshot is taken. This also adds some locking so that new writes don't race in and add new dirty pages. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
95e05289 |
|
29-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use mount -o subvol to select the subvol directory instead of dev: Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
58176a96 |
|
29-Aug-2007 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Add per-root block accounting and sysfs entries Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a52d9a80 |
|
27-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Extent based page cache code. This uses an rbtree of extents and tests instead of buffer heads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
15ee9bc7 |
|
10-Aug-2007 |
Josef Bacik <jwhiter@redhat.com> |
Btrfs: delay commits during fsync to allow more writers Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e9d0b13b |
|
10-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Btree defrag on the extent-mapping tree as well Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
26b8003f |
|
08-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace extent tree preallocation code with some bit radix magic. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f4468e94 |
|
08-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Let some locks go during defrag and snapshot dropping Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6702ed49 |
|
07-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add run time btree defrag, and an ioctl to force btree defrag This adds two types of btree defrag, a run time form that tries to defrag recently allocated blocks in the btree when they are still in ram, and an ioctl that forces defrag of all btree blocks. File data blocks are not defragged yet, but this can make a huge difference in sequential btree reads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3c69faec |
|
07-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fold some btree readahead routines into something more generic. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9f3a7427 |
|
07-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Do snapshot deletion in smaller chunks. Before, snapshot deletion was a single atomic unit. This caused considerable lock contention and required an unbounded amount of space. Now, the drop_progress field in the root item is used to indicate how far along snapshot deletion is, and to resume where it left off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ccd467d6 |
|
28-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: crash recovery fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4b52dff6 |
|
26-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix super block updates during transaction commit The super block written during commit was not consistent with the state of the trees. This change adds an in-memory copy of the super so that we can make sure to write out consistent data during a commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5eda7b5e |
|
22-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add the ability to find and remove dead roots after a crash. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
54aa1f4d |
|
22-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
11bd143f |
|
22-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Switch to libcrc32c to avoid problems with cryptomgr on highmem machines Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9ebefb18 |
|
15-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: patch queue: page_mkwrite Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6cbd5570 |
|
12-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add GPLv2 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
39279cc3 |
|
12-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: split up super.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5276aeda |
|
11-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix oops after block group lookup Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0cf6c620 |
|
09-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: remove device tree Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
facda1e7 |
|
08-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: get forced transaction commits via workqueue Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
08607c1b |
|
08-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add compat ioctl Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fabb5681 |
|
07-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: d_type optimization Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1e2677e0 |
|
29-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: block group switching Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1de037a4 |
|
29-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fixup various fsx failures Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3a686375 |
|
24-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: sparse files! Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e06afa83 |
|
23-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: rename Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f9f3c6b6 |
|
21-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: 2.6.21-git fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
de428b63 |
|
18-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allocator optimizations, truncate readahead Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
509659cd |
|
09-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: switch to crc32c instead of sha256 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e37c9e69 |
|
09-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: many allocator fixes, pretty solid Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3e1ad54f |
|
07-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allocator and tuning Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
be744175 |
|
06-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: more allocator enhancements Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
be08c1b9 |
|
03-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early metadata/data split Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
35b7e476 |
|
02-May-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix page cache memory leak Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31f3c99b |
|
30-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allocator improvements, inode block groups Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cd1bc465 |
|
27-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: more block allocator work Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9078a3e1 |
|
26-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: start of block group code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f2458e1d |
|
25-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: change around extent-tree prealloc Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c62a1920 |
|
23-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: get rid of the extent_item type field Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4d775673 |
|
20-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add owner and type fields to the extents aand block headers Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e66f709b |
|
20-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: write barriers on commit, balance level before split Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8fd17795 |
|
19-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early fsync support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7e38180e |
|
19-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: directory inode index is back Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
236454df |
|
19-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: many file_write fixes, inline data Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a429e513 |
|
18-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: working file_write, reorganized key flags Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
70b2befd |
|
17-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: rework csums and extent item ordering Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b18c6685 |
|
17-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: progress on file_write Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6567e837 |
|
16-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early work to file_write in big extents Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b4100d64 |
|
11-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add a device id to device items Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7eccb903 |
|
11-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: create a logical->phsyical block number mapping scheme Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0bd93ba0 |
|
11-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early support for multiple devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cac87faa |
|
11-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use a dedicated inode num for root root dir Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d0dbc624 |
|
09-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: drop owner and parentid Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1b05da2e |
|
09-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: drop the inode map tree Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c5739bba |
|
10-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: snapshot progress Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f7d52f4 |
|
09-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: groundwork for subvolume and snapshot roots Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d6e4a428 |
|
06-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: start of support for many FS volumes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5be6f7f1 |
|
05-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: dirindex optimizations Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7fcde0e3 |
|
04-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: finish off inode indexing in dirs, add overflows Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5f26f772 |
|
05-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: more inode indexed directory work Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bae45de0 |
|
04-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add dir inode index Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b1a4d965 |
|
04-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: tweak the inode-map and free extent search starts on cold mount Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2c90e5d6 |
|
02-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: still corruption hunting Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d6025579 |
|
30-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: corruption hunt continues Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f254e52c |
|
29-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: verify csums on read Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
87cbda5c |
|
28-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: sha256 csums on metadata Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d98237b3 |
|
28-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use a btree inode instead of sb_getblk Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9773a788 |
|
27-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: byte offsets for file keys Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
71951f35 |
|
27-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add generation field to file extent Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9a6f11ed |
|
27-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: split out level field in struct header Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6407bf6d |
|
27-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: reference counts on data extents Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dee26a9f |
|
26-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
btrfs_get_block, file read/write Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8ef97622 |
|
26-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add a radix back bit tree Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d561c025 |
|
23-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: very minimal locking Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7f5c1516 |
|
23-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Add generation number to btrfs_header, readdir fixes, hash collision fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d5719762 |
|
23-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
btrfs_create, btrfs_write_super, btrfs_sync_fs Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
79154b1b |
|
22-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: transaction rework Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e20d96d6 |
|
21-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Mountable btrfs, with readdir Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2e635a27 |
|
21-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: initial move to kernel module land Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1261ec42 |
|
20-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Better block record keeping, real mkfs Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
293ffd5f |
|
20-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: change dir-test to insert inode_items Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9f5fae2f |
|
20-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add inode map, and the start of file extent items Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e089f05c |
|
16-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: transaction handles everywhere Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
88fd146c |
|
16-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: pin freed blocks from the FS tree too Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a8a2ee0c |
|
16-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add a name_len to dir items, reorder key Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1e1d2701 |
|
15-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add inode item Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1d4f6404 |
|
15-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: directory testing code and dir item fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
62e2749e |
|
14-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use a chunk of the key flags to record the item type. Add (untested and simple) directory item code Fix comp_keys to use the new key ordering Add btrfs_insert_empty_item Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a1516c89 |
|
14-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: reorder key offset and flags Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
123abc88 |
|
14-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: variable block size support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4beb1b8b |
|
14-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add leaf data casting helper Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3768f368 |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Change the super to point to a tree of trees to enable persistent snapshots Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
234b63a0 |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
rename funcs and structs to btrfs Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cf27e1ee |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: struct extent_item endian Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1d4f8a0c |
|
13-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: node->blockptrs endian fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0783fcfc |
|
12-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: struct item endian fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e2fa7227 |
|
12-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: struct key endian fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bb492bb0 |
|
11-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add sparse endian annotations to struct header rename struct header to btrfs_header Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7518a238 |
|
11-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: get/set for struct header fields Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0579da42 |
|
07-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fixup last found extent caching Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a28ec197 |
|
06-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fixup reference counting on cows Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
02217ed2 |
|
02-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early reference counting Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ed2ff2cb |
|
01-Mar-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: pretend page cache & commit code Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fec577fb |
|
26-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add fsx-style randomized tree tester Add debug-tree command to print the tree Add extent-tree.c to the repo Comment ctree.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5de08d7d |
|
24-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Break up ctree.c a little Extent fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9a8dd150 |
|
23-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Block sized tree extents and extent deletion Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cfaa7295 |
|
21-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: extent fixes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d97e63b6 |
|
20-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: early extent mapping support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
eb60ceac |
|
02-Feb-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add backing store, memory management Signed-off-by: Chris Mason <chris.mason@oracle.com>
|