History log of /linux-master/fs/btrfs/extent-io-tree.c
Revision Date Author Comments
# ef5a05c5 24-Feb-2024 Chengming Zhou <zhouchengming@bytedance.com>

btrfs: remove SLAB_MEM_SPREAD flag use

The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.

[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b712e3b 25-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: remove unused included headers

With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8fd2b12e 02-Jan-2024 Josef Bacik <josef@toxicpanda.com>

btrfs: WARN_ON_ONCE() in our leak detection code

fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak
detection code (enabled only in debug builds) so that fstests will fail
if these things trip at all. This will allow us to easily catch
problems with our reference counting that may otherwise go unnoticed.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 637e6e0f 30-Nov-2023 David Sterba <dsterba@suse.com>

btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLES

The file_extent_tree was added in 41a2ee75aab0 ("btrfs: introduce
per-inode file extent tree") so we have an explicit mapping of the file
extents to know where it is safe to update i_size. When the feature
NO_HOLES is enabled, and it's been a mkfs default since 5.15, the tree
is not necessary.

To save some space in the inode, allocate the tree only when necessary.
This reduces size by 16 bytes from 1096 to 1080 on a x86_64 release
config.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 738290c0 21-Nov-2023 David Sterba <dsterba@suse.com>

btrfs: always set extent_io_tree::inode and drop fs_info

The extent_io_tree is embedded in several structures, notably in struct
btrfs_inode. The fs_info is only used for reporting errors and for
reference in trace points. We can get to the pointer through the inode,
but not all io trees set it. However, we always know the owner and
can recognize if inode is valid. For access helpers are provided, const
variant for the trace points.

This reduces size of extent_io_tree by 8 bytes and following structures
in turn:

- btrfs_inode 1104 -> 1088
- btrfs_device 520 -> 512
- btrfs_root 1360 -> 1344
- btrfs_transaction 456 -> 440
- btrfs_fs_info 3600 -> 3592
- reloc_control 1520 -> 1512

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 70146f2b 21-Nov-2023 David Sterba <dsterba@suse.com>

btrfs: enhance extent_io_tree error reports

Pass the type of the extent io tree operation which failed in the report
helper. The message wording and contents is updated, though locking
might be the cause of the error it's probably not the only one and we're
interested in the state.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab76c43e 21-Nov-2023 David Sterba <dsterba@suse.com>

btrfs: drop error message in extent_io_tree insert_state()

The helper insert_state errors are handled in all callers and reported
by extent_io_tree_panic so we don't need to do it twice.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 516095cd 21-Nov-2023 David Sterba <dsterba@suse.com>

btrfs: move lockdep class setting out of extent_io_tree_init

The per-inode file extent tree was added in 41a2ee75aab0 ("btrfs:
introduce per-inode file extent tree"), it's the only tree type
that requires the lockdep class. Move it to the file where it is
actually used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 49542050 20-Nov-2023 David Sterba <dsterba@suse.com>

btrfs: remove unused definition of tree_entry in extent-io-tree.c

The declaration was temporarily moved in a4055213bf69 ("btrfs: unexport
all the temporary exports for extent-io-tree.c") and then should have
been removed in 6.0 in 071d19f5130f ("btrfs: remove struct tree_entry in
extent-io-tree.c") but was not. This was found by tool
https://github.com/jirislaby/clang-struct .

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# efba1454 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: make sure we cache next state in find_first_extent_bit()

Currently, at find_first_extent_bit(), when we are given a cached extent
state that happens to have its end offset match the desired range start,
we find the next extent state using that cached state, with next_state()
calls, and then return it.

We then try to cache that next state by calling cache_state_if_flags(),
but that will not cache the state because we haven't reset *cached_state
to NULL, so we end up with the cached_state unchanged, and if the caller
is iterating over extent states in the io tree, its next call to
find_first_extent_bit() will not use the current cached state as its end
offset does not match the minimum start range offset, therefore the cached
state is reset and we have to search the rbtree to find the next suitable
extent state record.

So fix this by resetting the cached state to NULL (and dropping our ref
on it) when we have a suitable cached state and we found a next state by
using next_state() starting from the cached state. This makes use cases
of calling find_first_extent_bit() to go over all ranges in the io tree
to do a single rbtree full search, only on the first call, and the next
calls will just do next_state() (rb_next() wrapper) calls, which is more
efficient.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 63ffc1f7 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: make tree iteration in extent_io_tree_release() more efficient

Currently extent_io_tree_release() is a loop that keeps getting the first
node in the io tree, using rb_first() which is a loop that gets to the
leftmost node of the rbtree, and then for each node it calls rb_erase(),
which often requires rebalancing the rbtree.

We can make this more efficient by using
rbtree_postorder_for_each_entry_safe() to free each node without having
to delete it from the rbtree and without looping to get the first node.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# df2a8e70 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: collapse wait_on_state() to its caller wait_extent_bit()

The wait_on_state() function is very short and has a single caller, which
is wait_extent_bit(), so remove the function and put its code into the
caller.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 28967c76 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: remove redundant memory barrier from extent_io_tree_release()

The memory barrier at extent_io_tree_release() is redundant. Holding
spin_lock here is not enough to drop the barrier completely. We only
change the waitqueue of an extent state record while holding the tree
lock - see wait_on_state().

The update to waitqueue state will not become stale because there will
be an spin_unlock/spin_lock sequence between the change and waiting,
this implies a full memory barrier.

So remove the explicit smp_mb() barrier.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reword reasoning ]
Signed-off-by: David Sterba <dsterba@suse.com>


# a1c20d15 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: make wait_extent_bit() static

The function wait_extent_bit() is not used outside extent-io-tree.c so
make it static. Furthermore the function doesn't have the 'btrfs_' prefix.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bea22a58 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: update stale comment at extent_io_tree_release()

There's this comment at extent_io_tree_release() that mentions io btrees,
but this function is no longer used only for io btrees. Originally it was
added as a static function named clear_btree_io_tree() at transaction.c,
in commit 663dfbb07774 ("Btrfs: deal with convert_extent_bit errors to
avoid fs corruption"), as it was used only for cleaning one of the io
trees that track dirty extent buffers, the dirty_log_pages io tree of a
a root and the dirty_pages io tree of a transaction. Later it was renamed
and exported and now it's used to cleanup other io trees such as the
allocation state io tree of a device or the csums range io tree of a log
root.

So remove that comment and replace it with one at the top of the function
that is more complete, mentioning what the function does and that it's
expected to be called only when a task is sure no one else will need to
use the tree anymore, as well as there should be no locked ranges in the
tree and therefore no waiters on its extent state records. Also add an
assertion to check that there are no locked extent state records in the
tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c91ea4bf 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: make extent state merges more efficient during insertions

When inserting a new extent state record into an io tree that happens to
be mergeable, we currently do the following:

1) Insert the extent state record in the io tree's rbtree. This requires
going down the tree to find where to insert it, and during the
insertion we often need to balance the rbtree;

2) We then check if the previous node is mergeable, so we call rb_prev()
to find it, which requires some looping to find the previous node;

3) If the previous node is mergeable, we adjust our node to include the
range of the previous node and then delete the previous node from the
rbtree, which again may need to balance the rbtree;

4) Then we check if the next node is mergeable with the node we inserted,
so we call rb_next(), which requires some looping too. If the next node
is indeed mergeable, we expand the range of our node to include the
next node's range and then delete the next node from the rbtree, which
again may need to balance the tree.

So these are quite of lot of iterations and looping over the rbtree, and
some of the operations may need to rebalance the rb tree. This can be made
a bit more efficient by:

1) When iterating the rbtree, once we find a node that is mergeable with
the node we want to insert, we can just adjust that node's range with
the range of the node to insert - this avoids continuing iterating
over the tree and deleting a node from the rbtree;

2) If we expand the range of a mergeable node, then we find the next or
the previous node, depending on other we merged a range to the right or
to the left of the node we are currently at during the iteration. This
merging is as before, we find the next or previous node with rb_next()
or rb_prev() and if that other node is mergeable with the current one,
we adjust the range of the current node and remove the other node from
the rbtree;

3) Whenever we need to insert the new extent state record it's because
we don't have any extent state record in the rbtree which can be
merged, so we can remove the call to merge_state() after the insertion,
saving rb_next() and rb_prev() calls, which require some looping.

So update the insertion function insert_state() to have this behaviour.

Running dbench for 120 seconds and capturing the execution times of
set_extent_bit() at pin_down_extent(), resulted in the following data
(time values are in nanoseconds):

Before this change:

Count: 2278299
Range: 0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952
Percentiles: 90th: 1187.000; 95th: 1350.000; 99th: 1724.000
0.000 - 7.534: 5 |
7.534 - 35.418: 36 |
35.418 - 154.403: 273 |
154.403 - 662.138: 1244016 #####################################################
662.138 - 2828.745: 1031335 ############################################
2828.745 - 12074.102: 1395 |
12074.102 - 51525.930: 806 |
51525.930 - 219874.955: 162 |
219874.955 - 938254.688: 22 |
938254.688 - 4003728.000: 3 |

After this change:

Count: 2275862
Range: 0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785
Percentiles: 90th: 1105.000; 95th: 1245.000; 99th: 1590.000
0.000 - 10.219: 10 |
10.219 - 40.957: 36 |
40.957 - 155.907: 262 |
155.907 - 585.789: 1127214 ####################################################
585.789 - 2193.431: 1145134 #####################################################
2193.431 - 8205.578: 1648 |
8205.578 - 30689.378: 1039 |
30689.378 - 114772.699: 362 |
114772.699 - 429221.537: 52 |
429221.537 - 1605175.000: 10 |

Maximum duration (range), average duration, percentiles and standard
deviation are all better.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 893fe243 14-Aug-2020 David Sterba <dsterba@suse.com>

btrfs: change test_range_bit to scan the whole range

The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.

Signed-off-by: David Sterba <dsterba@suse.com>


# 99be1a66 11-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add specific helper for range bit test exists

The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence. By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.

There's no caller that uses the cached state pointer so this reduces the
argument count further.

Signed-off-by: David Sterba <dsterba@suse.com>


# e5860f82 30-Jun-2023 Filipe Manana <fdmanana@suse.com>

btrfs: make find_first_extent_bit() return a boolean

Currently find_first_extent_bit() returns a 0 if it found a range in the
given io tree and 1 if it didn't find any. There's no need to return any
errors, so make the return value a boolean and invert the logic to make
more sense: return true if it found a range and false if it didn't find
any range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1d126800 24-May-2023 David Sterba <dsterba@suse.com>

btrfs: drop gfp from parameter extent state helpers

Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.

Net effect on module on a release build:

text data bss dec hex filename
1250432 20985 16088 1287505 13a551 pre/btrfs.ko
1247074 20985 16088 1284147 139833 post/btrfs.ko

DELTA: -3358

Signed-off-by: David Sterba <dsterba@suse.com>


# 62bc6047 24-May-2023 David Sterba <dsterba@suse.com>

btrfs: pass NOWAIT for set/clear extent bits as another bit

The only flags we now pass to set_extent_bit/__clear_extent_bit are
GFP_NOFS and GFP_NOWAIT (a few functions handling mappings). This
requires an extra parameter to be passed everywhere but is almost always
the same.

Encode the GFP_NOWAIT as an artificial extent bit and extract the
real bits and gfp mask in the lowest level helpers. Now the passed
gfp mask is not actually used and can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>


# 67da05b3 17-Jan-2023 Colin Ian King <colin.i.king@gmail.com>

btrfs: fix spelling mistakes found using codespell

There quite a few spelling mistakes as found using codespell. Fix them.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 59864325 16-Dec-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: fix uninitialized variable warnings in __set_extent_bit and convert_extent_bit

We will pass in the parent and p pointer into our tree_search function
to avoid doing a second search when inserting a new extent state into
the tree. However because this is conditional upon passing in these
pointers the compiler seems to think these values can be uninitialized
if we're using -Wmaybe-uninitialized. Fix this by initializing these
values.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2f2e84ca 23-Dec-2022 Filipe Manana <fdmanana@suse.com>

btrfs: fix off-by-one in delalloc search during lseek

During lseek, when searching for delalloc in a range that represents a
hole and that range has a length of 1 byte, we end up not doing the actual
delalloc search in the inode's io tree, resulting in not correctly
reporting the offset with data or a hole. This actually only happens when
the start offset is 0 because with any other start offset we round it down
by sector size.

Reproducer:

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc

$ xfs_io -f -c "pwrite -q 0 1" /mnt/sdc/foo

$ xfs_io -c "seek -d 0" /mnt/sdc/foo
Whence Result
DATA EOF

It should have reported an offset of 0 instead of EOF.

Fix this by updating btrfs_find_delalloc_in_range() and count_range_bits()
to deal with inclusive ranges properly. These functions are already
supposed to work with inclusive end offsets, they just got it wrong in a
couple places due to off-by-one mistakes.

A test case for fstests will be added later.

Reported-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20221223020509.457113-1-joanbrugueram@gmail.com/
Fixes: b6e833567ea1 ("btrfs: make hole and data seeking a lot more efficient")
CC: stable@vger.kernel.org # 6.1
Tested-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 26df39a9 18-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: fix uninitialized variable in find_first_clear_extent_bit

This was caught when syncing extent-io-tree.c into btrfs-progs. This
however isn't really a problem, the only way next would be uninitialized
is if we found the range we were looking for, and in this case we don't
care about next. However it's a compile error, so fix it up.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d7c9e1be 18-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: fix uninitialized parent in insert_state

I don't know how this isn't caught when we build this in the kernel, but
while syncing extent-io-tree.c into btrfs-progs I got an error because
parent could potentially be uninitialized when we link in a new node,
specifically when the extent_io_tree is empty. This means we could have
garbage in the parent color. I don't know what the ramifications are of
that, but it's probably not great, so fix this by initializing parent to
NULL. I spot checked all of our other usages in btrfs and we appear to
be doing the correct thing everywhere else.

Fixes: c7e118cf98c7 ("btrfs: open code rbtree search in insert_state")
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1ee51a06 11-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: update stale comment for count_range_bits()

The comment for count_range_bits() mentions that the search is fast if we
are asking for a range with the EXTENT_DIRTY bit set. However that is no
longer true since we don't use that bit and the optimization for that was
removed in:

commit 71528e9e16c7 ("btrfs: get rid of extent_io_tree::dirty_bytes")

So remove that part of the comment mentioning the no longer existing
optimized case, and, while at it, add proper documentation describing the
purpose, arguments and return value of the function.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8c6e53a7 11-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: allow passing a cached state record to count_range_bits()

An inode's io_tree can be quite large and there are cases where due to
delalloc it can have thousands of extent state records, which makes the
red black tree have a depth of 10 or more, making the operation of
count_range_bits() slow if we repeatedly call it for a range that starts
where, or after, the previous one we called it for. Such use cases are
when searching for delalloc in a file range that corresponds to a hole or
a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap.

So introduce a cached state parameter to count_range_bits() which we use
to store the last extent state record we visited, and then allow the
caller to pass it again on its next call to count_range_bits(). The next
patches in the series will make fiemap and lseek use the new parameter.

This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:

1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd54766e 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_clear_delalloc_extent

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 62798a49 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_split_delalloc_extent

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4c5d166f 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_set_delalloc_extent

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2454151c 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_merge_delalloc_extent

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0988fc7b 27-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: switch extent_io_tree::private_data to btrfs_inode and rename

The extent_io_tree::private_data was meant to be a preparatory work for
the metadata inode rework but that never materialized. Now it's used
only for an inode so it's better to change the appropriate type and
rename it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 35da5a7e 27-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: drop private_data parameter from extent_io_tree_init

All callers except one pass NULL, so the parameter can be dropped and
the inode::io_tree initialization can be open coded.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9b569ea0 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the printk helpers out of ctree.h

We have a bunch of printk helpers that are in ctree.h. These have
nothing to do with ctree.c, so move them into their own header.
Subsequent patches will cleanup the printk helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a75034e 14-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: do not panic if we can't allocate a prealloc extent state

We sometimes have to allocate new extent states when clearing or setting
new bits in an extent io tree. Generally we preallocate this before
taking the tree spin lock, but we can use this preallocated extent state
sometimes and then need to try to do a GFP_ATOMIC allocation under the
lock.

Unfortunately sometimes this fails, and then we hit the BUG_ON() and
bring the box down. This happens roughly 20 times a week in our fleet.

However the vast majority of callers use GFP_NOFS, which means that if
this GFP_ATOMIC allocation fails, we could simply drop the spin lock, go
back and allocate a new extent state with our given gfp mask, and begin
again from where we left off.

For the remaining callers that do not use GFP_NOFS, they are generally
using GFP_NOWAIT, which still allows for some reclaim. So allow these
allocations to attempt to happen outside of the spin lock so we don't
need to rely on GFP_ATOMIC allocations.

This in essence creates an infinite loop for anything that isn't
GFP_NOFS. To address this we may want to migrate to using mempools for
extent states so that we will always have emergency reserves in order to
make our allocations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 123a7f00 30-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: cache the failed state when locking extents

Currently if we fail to lock a range we'll return the start of the range
that we failed to lock. We'll then search down to this range and wait
on any extent states in this range.

However we can avoid this search altogether if we simply cache the
extent_state that had the contention. We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search. In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 83ae4133 30-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: add a cached_state to try_lock_extent

With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent(). This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9e769bd7 30-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: unlock locked extent area if we have contention

In production we hit the following deadlock

task 1 task 2 task 3
------ ------ ------
fiemap(file) falloc(file) fsync(file)
write(0, 1MiB)
btrfs_commit_transaction()
wait_on(!pending_ordered)
lock(512MiB, 1GiB)
start_transaction
wait_on_transaction

lock(0, 1GiB)
wait_extent_bit(512MiB)

task 4
------
finish_ordered_extent(0, 1MiB)
lock(0, 1MiB)
**DEADLOCK**

This occurs because when task 1 does it's lock, it locks everything from
0-512MiB, and then waits for the 512MiB chunk to unlock. task 2 will
never unlock because it's waiting on the transaction commit to happen,
the transaction commit is waiting for the outstanding ordered extents,
and then the ordered extent thread is blocked waiting on the 0-1MiB
range to unlock.

To fix this we have to clear anything we've locked so far, wait for the
extent_state that we contended on, and then try to re-lock the entire
range again.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 23408d81 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove is_data_inode() checks in extent-io-tree.c

We're only initializing extent_io_tree's with a private data if we're a
normal inode, so we don't need this extra check.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd015294 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS

Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 71528e9e 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: get rid of extent_io_tree::dirty_bytes

This was used as an optimization for count_range_bits(EXTENT_DIRTY),
which was used by the failed record code. However this was removed in
this series by patch "btrfs: convert the io_failure_tree to a plain
rb_tree" which was the last user of this optimization. Remove the
->dirty_bytes as nobody cares anymore.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 570eb97b 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: unify the lock/unlock extent variants

We have two variants of lock/unlock extent, one set that takes a cached
state, another that does not. This is slightly annoying, and generally
speaking there are only a few places where we don't have a cached state.
Simplify this by making lock_extent/unlock_extent the only variant and
make it take a cached state, then convert all the callers appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 291bbb1e 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: drop extent_changeset from set_extent_bit

The only places that set extent_changeset is set_record_extent_bits,
everywhere else sets it to NULL. Drop this argument from
set_extent_bit.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 994bcd1e 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove failed_start argument from set_extent_bit

This is only used for internal locking related helpers, everybody else
just passes in NULL. I've changed set_extent_bit to __set_extent_bit
and made it static, removed failed_start from set_extent_bit and have it
call __set_extent_bit with a NULL failed_start, and I've moved some code
down below the now static __set_extent_bit.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dbbf4992 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove the wake argument from clear_extent_bits

This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c07d1004 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: drop exclusive_bits from set_extent_bit

This is only ever set if we have EXTENT_LOCKED set, so simply push this
into the function itself and remove the function argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e63b81ae 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: use next_state/prev_state in merge_state

We use rb_next/rb_prev and then get the entry for the adjacent items in
an extent io tree. We have helpers for this, so convert merge_state to
use next_state/prev_state and simplify the code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 43b068ca 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: make tree_search_prev_next return extent_state's

Instead of doing the rb_entry again once we return from this function,
simply return the actual states themselves, and then clean up the only
user of this helper to handle states instead of nodes.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e349fd3b 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: make tree_search_for_insert return extent_state

We use this to search for an extent state, or return the nodes we need
to insert a new extent state. This means we have the following pattern

node = tree_search_for_insert();
if (!node) {
/* alloc and insert. */
goto again;
}
state = rb_entry(node, struct extent_state, rb_node);

we don't use the node for anything else. Making
tree_search_for_insert() return the extent_state means we can drop the
rb_node and clean this up by eliminating the rb_entry.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# aa852dab 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: make tree_search return struct extent_state

We have a consistent pattern of

n = tree_search();
if (!n)
goto out;
state = rb_entry(n, struct extent_state, rb_node);
while (state) {
/* do something. */
}

which is a bit redundant. If we make tree_search return the state we
can simply have

state = tree_search();
while (state) {
/* do something. */
}

which cleans up the code quite a bit.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ccaeff92 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: use next_state instead of rb_next where we can

We can simplify a lot of these functions where we have to cycle through
extent_state's by simply using next_state() instead of rb_next(). In
many spots this allows us to do things like

while (state) {
/* whatever */
state = next_state(state);
}

instead of

while (1) {
state = rb_entry(n, struct extent_state, rb_node);
n = rb_next(n);
if (!n)
break;
}

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 071d19f5 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove struct tree_entry in extent-io-tree.c

This existed when we overloaded the tree manipulation functions for both
the extent_io_tree and the extent buffer tree. However the extent
buffers are now stored in a radix tree, so we no longer need this
abstraction. Remove struct tree_entry and use extent_state directly
instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a4055213 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: unexport all the temporary exports for extent-io-tree.c

Now that we've moved everything we can unexport all the temporary
exports, move the random helpers, and mark everything as static again.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d8038a1f 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: unexport btrfs_debug_check_extent_io_range

We no longer need to export this as all users are in extent-io-tree.c,
remove it from the header and put it into extent-io-tree.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e3974c66 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move core extent_io_tree functions to extent-io-tree.c

This is still huge, but unfortunately I cannot make it smaller without
renaming tree_search() and changing all the callers to use the new name,
then moving those chunks and then changing the name back. This feels
like too much churn for code movement, so I've limited this to only
things that called tree_search(). With this patch all of the
extent_io_tree code is now in extent-io-tree.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38830018 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move a few exported extent_io_tree helpers to extent-io-tree.c

These are the last few helpers that do not rely on tree_search() and
who's other helpers are exported and in extent-io-tree.c already. Move
these across now in order to make the core move smaller.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 04eba893 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: temporarily export and then move extent state helpers

In order to avoid moving all of the related code at once temporarily
export all of the extent state related helpers. Then move these helpers
into extent-io-tree.c. We will clean up the exports and make them
static in followup patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 91af24e4 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: temporarily export and move core extent_io_tree tree functions

A lot of the various internals of extent_io_tree call these two
functions for insert or searching the rb tree for entries, so
temporarily export them and then move them to extent-io-tree.c. We
can't move tree_search() without renaming it, and I don't want to
introduce a bunch of churn just to do that, so move these functions
first and then we can move a few big functions and then the remaining
users of tree_search().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6962541e 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_debug_check_extent_io_range into extent-io-tree.c

This helper is used by a lot of the core extent_io_tree helpers, so
temporarily export it and move it into extent-io-tree.c in order to make
it straightforward to migrate the helpers in batches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a6631887 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move simple extent bit helpers out of extent_io.c

These are just variants and wrappers around the actual work horses of
the extent state. Extract these out of extent_io.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 83cf709a 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move extent state init and alloc functions to their own file

Start cleaning up extent_io.c by moving the extent state code out of it.
This patch starts with the extent state allocation code and the
extent_io_tree init code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>