History log of /linux-master/fs/btrfs/extent_io.c
Revision Date Author Comments
# 1db7959a 25-Mar-2024 Qu Wenruo <wqu@suse.com>

btrfs: do not wait for short bulk allocation

[BUG]
There is a recent report that when memory pressure is high (including
cached pages), btrfs can spend most of its time on memory allocation in
btrfs_alloc_page_array() for compressed read/write.

[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed (fell back to single page
allocation) we still retry but with extra memalloc_retry_wait().

If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.

The behavior was introduced in commit 395cb57e8560 ("btrfs: wait between
incomplete batch memory allocations").

[FIX]
Although the commit mentioned that other filesystems do the wait, it's
not the case at least nowadays.

All the mainlined filesystems only call memalloc_retry_wait() if they
failed to allocate any page (not only for bulk allocation).
If there is any progress, they won't call memalloc_retry_wait() at all.

For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait()
if there is no allocation progress at all, and the call is not for
metadata readahead.

So I don't believe we should call memalloc_retry_wait() unconditionally
for short allocation.

Call memalloc_retry_wait() if it fails to allocate any page for tree
block allocation (which goes with __GFP_NOFAIL and may not need the
special handling anyway), and reduce the latency for
btrfs_alloc_page_array().

Reported-by: Julian Taylor <julian.taylor@1und1.de>
Tested-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Fixes: 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 073bda7a 25-Mar-2024 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling

Add an ASSERT to catch a faulty delayed reference item resulting from
prematurely cleared extent buffer.

Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which
is suspicious as its update will be lost.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 68879386 25-Mar-2024 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer

Btrfs clears the content of an extent buffer marked as
EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is
introduced to prevent a write hole of an extent buffer, which is once
allocated, marked dirty, but turns out unnecessary and cleaned up within
one transaction operation.

Currently, btrfs_clear_buffer_dirty() marks the extent buffer as
EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call
happens while the buffer is under IO (with the WRITEBACK flag set,
without the DIRTY flag), we can add the ZEROOUT flag and clear the
buffer's content just before a bio submission. As a result:

1) it can lead to adding faulty delayed reference item which leads to a
FS corrupted (EUCLEAN) error, and

2) it writes out cleared tree node on disk

The former issue is previously discussed in [1]. The corruption happens
when it runs a delayed reference update. So, on-disk data is safe.

[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/

The latter one can reach on-disk data. But, as that node is already
processed by btrfs_clear_buffer_dirty(), that will be invalidated in the
next transaction commit anyway. So, the chance of hitting the corruption
is relatively small.

Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to
keep the content under IO intact.

Fixes: aa6313e6ff2b ("btrfs: zoned: don't clear dirty flag of extent buffer")
CC: stable@vger.kernel.org # 6.8
Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ef1e6823 15-Mar-2024 Tavian Barnes <tavianator@tavianator.com>

btrfs: fix race in read_extent_buffer_pages()

There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.

To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:

/* (1) */
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;

/* (2) */
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
goto done;

At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does

/* (3) */
set_extent_buffer_uptodate(eb);

/* (4) */
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);

Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning. Unfortunately, there is
a racey interleaving:

Thread A | Thread B | Thread C
---------+----------+---------
(1) | |
| (1) |
(2) | |
(3) | |
(4) | |
| (2) |
| | (1)

When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress. This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:

BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]

Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.

Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cab1375 28-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations

During fiemap we may have to visit multiple leaves of the subvolume's
inode tree, and each time we are freeing and allocating an extent buffer
to use as a clone of each visited leaf. Optimize this by reusing cloned
extent buffers, to avoid the freeing and re-allocation both of the extent
buffer structure itself and more importantly of the pages attached to the
extent buffer.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 978b63f7 28-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix race when detecting delalloc ranges during fiemap

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.

The race happens like this:

1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
has delalloc in the file range [64M, 65M[, which is currently a hole;

2) Fiemap locks the inode in shared mode, then starts iterating the
inode's subvolume tree searching for file extent items, without having
the whole fiemap target range locked in the inode's io tree - the
change introduced recently by commit b0ad381fa769 ("btrfs: fix
deadlock with fiemap and extent locking"). It only locks ranges in
the io tree when it finds a hole or prealloc extent since that
commit;

3) Note that fiemap clones each leaf before using it, and this is to
avoid deadlocks when locking a file range in the inode's io tree and
the fiemap buffer is memory mapped to some file, because writing
to the page with btrfs_page_mkwrite() will wait on any ordered extent
for the page's range and the ordered extent needs to lock the range
and may need to modify the same leaf, therefore leading to a deadlock
on the leaf;

4) While iterating the file extent items in the cloned leaf before
finding the hole in the range [64M, 65M[, the delalloc in that range
is flushed and its ordered extent completes - meaning the corresponding
file extent item is in the inode's subvolume tree, but not present in
the cloned leaf that fiemap is iterating over;

5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
the cloned leaf (or a file extent item with disk_bytenr == 0 in case
the NO_HOLES feature is not enabled), it will lock that file range in
the inode's io tree and then search for delalloc by checking for the
EXTENT_DELALLOC bit in the io tree for that range and ordered extents
(with btrfs_find_delalloc_in_range()). But it finds nothing since the
delalloc in that range was already flushed and the ordered extent
completed and is gone - as a result fiemap will not report that there's
delalloc or an extent for the range [64M, 65M[, so user space will be
mislead into thinking that there's a hole in that range.

This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:

generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
--- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
@@ -1,3 +1,9 @@
QA output created by 094
fiemap run with sync
fiemap run without sync
+ERROR: couldn't find extent at 7
+map is 'HHDDHPPDPHPH'
+logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2
+logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)

So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:

1) Always lock the whole range in the inode's io tree before starting to
iterate the inode's subvolume tree searching for file extent items,
just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with
fiemap and extent locking");

2) Now instead of writing to the fiemap buffer every time we have an extent
to report, write instead to a temporary buffer (1 page), and when that
buffer becomes full, stop iterating the file extent items, unlock the
range in the io tree, release the search path, submit all the entries
kept in that buffer to the fiemap buffer, and then resume the search
for file extent items after locking again the remainder of the range in
the io tree.

The buffer having a size of a page, allows for 146 entries in a system
with 4K pages. This is a large enough value to have a good performance
by avoiding too many restarts of the search for file extent items.
In other words this preserves the huge performance gains made in the
last two years to fiemap, while avoiding the deadlocks in case the
fiemap buffer is memory mapped to the same file (useless in practice,
but possible and exercised by fuzz testing and syzbot).

Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ef5a05c5 24-Feb-2024 Chengming Zhou <zhouchengming@bytedance.com>

btrfs: remove SLAB_MEM_SPREAD flag use

The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.

[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 970ea374 06-Feb-2024 David Sterba <dsterba@suse.com>

btrfs: pass a valid extent map cache pointer to __get_extent_map()

We can pass a valid em cache pointer down to __get_extent_map() and
drop the validity check. This avoids the special case, the call stacks
are simple:

btrfs_read_folio
btrfs_do_readpage
__get_extent_map

extent_readahead
contiguous_readpages
btrfs_do_readpage
__get_extent_map

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e84bfffc 24-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_read

The fs_info and sectorsize remain the same during the loops, no need to
set them on each iteration.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 41044b41 14-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add helper to get fs_info from struct inode pointer

Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b33d2e53 14-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add helpers to get fs_info from page/folio pointers

Add convenience helpers to get a fs_info from a page or folio pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct page, folio,
btrfs_root and btrfs_fs_info. The latter can't be static inlines as this
would create loop between ctree.h <-> fs.h, or the headers would have to
be restructured.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c8293894 13-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add helpers to get inode from page/folio pointers

Add convenience helpers to get a struct btrfs_inode from a page or folio
pointer instead of open coding the chain or intermediate BTRFS_I. This
is implemented as a macro (still with type checking) so we don't need
full definitions of struct page or address_space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b712e3b 25-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: remove unused included headers

With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4e00422e 16-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: replace sb::s_blocksize by fs_info::sectorsize

The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.

Replace all uses, in most cases it's fewer pointer indirections.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dfba9f47 14-Dec-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: add set_folio_extent_mapped() helper

Turn set_page_extent_mapped() into a wrapper around this version.
Saves a call to compound_head() for callers who already have a folio
and removes a couple of users of page->mapping.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8fd2b12e 02-Jan-2024 Josef Bacik <josef@toxicpanda.com>

btrfs: WARN_ON_ONCE() in our leak detection code

fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak
detection code (enabled only in debug builds) so that fstests will fail
if these things trip at all. This will allow us to easily catch
problems with our reference counting that may otherwise go unnoticed.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 84cda1a6 04-Jan-2024 Qu Wenruo <wqu@suse.com>

btrfs: cache folio size and shift in extent_buffer

After the conversion to folio interfaces (but without the patch to
enable larger folio allocation), there is an LTP report about observable
performance drop on metadata heavy operations.

https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/

This drop is caused by the extra code of calculating the
folio_size()/folio_shift(), instead of the old hard coded
PAGE_SIZE/PAGE_SHIFT.

To slightly reduce the overhead, just cache both folio_size and
folio_shift in extent_buffer.

The two new members (u32 folio_size and u8 folio_shift) are stored
inside the holes of extent_buffer. folio_size is shared with len, which
is reduced to u32. The size of eb does not change.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4d02b543 07-Jan-2024 Qu Wenruo <wqu@suse.com>

btrfs: remove unused variable bio_offset from end_bbio_data_read()

The variable @bio_offset was introduced in commit 7ffd27e378d2 ("btrfs:
pass bio_offset to check_data_csum() directly"), when we are still using
the same endio function for both data and metadata.

Later we had several changes to data and metadata endio functions:

- Data verification is handled by btrfs bio layer

- Split data and metadata endio paths

Now for data path we no longer do any verification in
end_bbio_data_read(), as the verification is handled by btrfs bio layer
already.

Thus there is no need for such bio_offset variable.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8bab0a30 07-Jan-2024 Qu Wenruo <wqu@suse.com>

btrfs: remove the pg_offset parameter from btrfs_get_extent()

The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to
make sure we can only get inline extent at file offset 0.

Any invalid inline extent with non-zero file offset would be rejected by
tree-checker in the first place.

Thus the @pg_offset parameter is not really necessary, just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 418b0902 21-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given

When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a1a4a9ca 21-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix race between ordered extent completion and fiemap

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:

1) The fiemap task finishes processing a file extent item that covers
the file range [512K, 1M[, and that file extent item is the last item
in the leaf currently being processed;

2) And ordered extent for the file range [768K, 2M[, in COW mode,
completes (btrfs_finish_one_ordered()) and the file extent item
covering the range [512K, 1M[ is trimmed to cover the range
[512K, 768K[ and then a new file extent item for the range [768K, 2M[
is inserted in the inode's subvolume tree;

3) The fiemap task calls fiemap_next_leaf_item(), which then calls
btrfs_next_leaf() to find the next leaf / item. This finds that the
the next key following the one we previously processed (its type is
BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
to the new file extent item inserted by the ordered extent, which has
a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;

4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
the warning:

if (cache->offset + cache->len > offset) {
WARN_ON(1);
return -EINVAL;
}

Since we get 1M > 768K, because the previously emitted entry for the
old extent covering the file range [512K, 1M[ ends at an offset that
is greater than the new extent's start offset (768K). This makes fiemap
fail with -EINVAL besides triggering the warning that produces a stack
trace like the following:

[1621.677651] ------------[ cut here ]------------
[1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.677899] Modules linked in: btrfs blake2b_generic (...)
[1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
[1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678033] Code: 2b 4c 89 63 (...)
[1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
[1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
[1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
[1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
[1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
[1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
[1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
[1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
[1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1621.678056] Call Trace:
[1621.678074] <TASK>
[1621.678076] ? __warn+0x80/0x130
[1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678159] ? report_bug+0x1f4/0x200
[1621.678164] ? handle_bug+0x42/0x70
[1621.678167] ? exc_invalid_op+0x14/0x70
[1621.678170] ? asm_exc_invalid_op+0x16/0x20
[1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678253] extent_fiemap+0x766/0xa30 [btrfs]
[1621.678339] btrfs_fiemap+0x45/0x80 [btrfs]
[1621.678420] do_vfs_ioctl+0x1e4/0x870
[1621.678431] __x64_sys_ioctl+0x6a/0xc0
[1621.678434] do_syscall_64+0x52/0x120
[1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76

There's also another case where before calling btrfs_next_leaf() we are
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset smaller than (or equals to) the offset in cache->offset.

So fix this by changing emit_fiemap_extent() to address these three
scenarios like this:

1) For the first case, steps listed above, adjust the length of the
previously cached extent so that it does not overlap with the current
extent, emit the previous one and cache the current file extent item;

2) For the second case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range,
and the current file extent item has an offset that matches the offset
in the fiemap cache, just discard what we have in the fiemap cache and
assign the current file extent item to the cache, since it's more up
to date;

3) For the third case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range
and the offset of the file extent item we just found is smaller than
what we have in the cache, just skip the current file extent item
if its range end at or behind the cached extent's end, because we may
have emitted (to the fiemap user space buffer) delalloc ranges that
overlap with the current file extent item's range. If the file extent
item's range goes beyond the end offset of the cached extent, just
emit the cached extent and cache a subrange of the file extent item,
that goes from the end offset of the cached extent to the end offset
of the file extent item.

Dealing with those cases in those ways makes everything consistent by
reflecting the current state of file extent items in the btree and
without emitting extents that have overlapping ranges (which would be
confusing and violating expectations).

This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b0ad381f 12-Feb-2024 Josef Bacik <josef@toxicpanda.com>

btrfs: fix deadlock with fiemap and extent locking

While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.

This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.

Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace

[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.

To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.

The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.

With this patch applied we no longer deadlock with my testcase.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f4521b01 11-Dec-2023 Qu Wenruo <wqu@suse.com>

btrfs: migrate eb_bitmap_offset() to folio interfaces

[BUG]
Test case btrfs/002 would fail if larger folios are enabled for
metadata:

assertion failed: folio, in fs/btrfs/extent_io.c:4358
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:4358!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 30916 Comm: fsstress Tainted: G OE 6.7.0-rc3-custom+ #128
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:assert_eb_folio_uptodate+0x98/0xe0 [btrfs]
Call Trace:
<TASK>
extent_buffer_test_bit+0x3c/0x70 [btrfs]
free_space_test_bit+0xcd/0x140 [btrfs]
modify_free_space_bitmap+0x27a/0x430 [btrfs]
add_to_free_space_tree+0x8d/0x160 [btrfs]
__btrfs_free_extent.isra.0+0xef1/0x13c0 [btrfs]
__btrfs_run_delayed_refs+0x786/0x13c0 [btrfs]
btrfs_run_delayed_refs+0x33/0x120 [btrfs]
btrfs_commit_transaction+0xa2/0x1350 [btrfs]
iterate_supers+0x77/0xe0
ksys_sync+0x60/0xa0
__do_sys_sync+0xa/0x20
do_syscall_64+0x3f/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
</TASK>

[CAUSE]
The function extent_buffer_test_bit() is not folio compatible.

It still assumes the old fixed page size, when an extent buffer with
large folio passed in, only eb->folios[0] is populated.

Then if the target bit range falls in the 2nd page of the folio, then we
would check eb->folios[1], and trigger the ASSERT().

[FIX]
Just migrate eb_bitmap_offset() to folio interfaces, using the
folio_size() to replace PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a700ca5e 11-Dec-2023 Qu Wenruo <wqu@suse.com>

btrfs: migrate various end io functions to folios

If we still go the old page based iterator functions, like
bio_for_each_segment_all(), we can hit middle pages of a folio (compound
page).

In that case if we set any page flag on those middle pages, we can
easily trigger VM_BUG_ON(), as for compound page flags, they should
follow their flag policies (normally only set on leading or tail pages).

To avoid such problem in the future full folio migration, here we do:

- Change from bio_for_each_segment_all() to bio_for_each_folio_all()
This completely removes the ability to access the middle page.

- Add extra ASSERT()s for data read/write paths
To ensure we only get single paged folio for data now.

- Rename those end io functions to follow a certain schema
* end_bbio_compressed_read()
* end_bbio_compressed_write()

These two endio functions don't set any page flags, as they use pages
not mapped to any address space.
They can be very good candidates for higher order folio testing.

And they are shared between compression and encoded IO.

* end_bbio_data_read()
* end_bbio_data_write()
* end_bbio_meta_read()
* end_bbio_meta_write()

The old function names are not unified:
- end_bio_extent_writepage()
- end_bio_extent_readpage()
- extent_buffer_write_end_io()
- extent_buffer_read_end_io()

They share no schema on where the "end_*io" string should be, nor can
be confusing just using "extent_buffer" and "extent" to distinguish
data and metadata paths.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55151ea9 11-Dec-2023 Qu Wenruo <wqu@suse.com>

btrfs: migrate subpage code to folio interfaces

Although subpage itself is conflicting with higher folio, since subpage
(sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never
need higher order folio, there is a hidden pitfall:

- btrfs_page_*() helpers

Those helpers are an abstraction to handle both subpage and non-subpage
cases, which means we're going to pass pages pointers to those helpers.

And since those helpers are shared between data and metadata paths, it's
unavoidable to let them to handle folios, including higher order
folios).

Meanwhile for true subpage case, we should only have a single page
backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert()
to ensure that.

Also since those helpers are shared between both data and metadata, add
some extra ASSERT()s for data path to make sure we only get single page
backed folio for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8d993618 11-Dec-2023 Qu Wenruo <wqu@suse.com>

btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios

These two functions are still using the old page based code, which is
not going to handle larger folios at all.

The migration itself is going to involve the following changes:

- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()

And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:

- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.

- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.

- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4a565c80 14-Dec-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: don't double put our subpage reference in alloc_extent_buffer

This fixes as case in "btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method".

We have been seeing panics in the CI for the subpage stuff recently, it
happens on btrfs/187 but could potentially happen anywhere.

In the subpage case, if we race with somebody else inserting the same
extent buffer, the error case will end up calling
detach_extent_buffer_page() on the page twice.

This is done first in the bit

for (int i = 0; i < attached; i++)
detach_extent_buffer_page(eb, eb->pages[i];

and then again in btrfs_release_extent_buffer().

This works fine for !subpage because we're the only person who ever has
ourselves on the private, and so when we do the initial
detach_extent_buffer_page() we know we've completely removed it.

However for subpage we could be using this page private elsewhere, so
this results in a double put on the subpage, which can result in an
early freeing.

The fix here is to clear eb->pages[i] for everything we detach. Then
anything still attached to the eb is freed in
btrfs_release_extent_buffer().

Because of this change we must update
btrfs_release_extent_buffer_pages() to not use num_extent_folios,
because it assumes eb->folio[0] is set properly. Since this is only
interested in freeing any pages we have on the extent buffer we can
simply use INLINE_EXTENT_BUFFER_PAGES.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 13df3775 06-Dec-2023 Qu Wenruo <wqu@suse.com>

btrfs: cleanup metadata page pointer usage

Although we have migrated extent_buffer::pages[] to folios[], we're
still mostly using the folio_page() help to grab the page.

This patch would do the following cleanups for metadata:

- Introduce num_extent_folios() helper
This is to replace most num_extent_pages() callers.

- Use num_extent_folios() to iterate future large folios
This allows us to use things like
bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
for the folio (aka the leading/tailing page), which reduces the loop
iteration to 1 for large folios.

- Change metadata related functions to use folio pointers
Including their function name, involving:
* attach_extent_buffer_page()
* detach_extent_buffer_page()
* page_range_has_eb()
* btrfs_release_extent_buffer_pages()
* btree_clear_page_dirty()
* btrfs_page_inc_eb_refs()
* btrfs_page_dec_eb_refs()

- Change btrfs_is_subpage() to accept an address_space pointer
This is to allow both page->mapping and folio->mapping to be utilized.
As data is still using the old per-page code, and may keep so for a
while.

- Special corner case place holder for future order mismatches between
extent buffer and inode filemap
For now it's just a block of comments and a dead ASSERT(), no real
handling yet.

The subpage code would still go page, just because subpage and large
folio are conflicting conditions, thus we don't need to bother subpage
with higher order folios at all. Just folio_page(folio, 0) would be
enough.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor styling tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 082d5bb9 06-Dec-2023 Qu Wenruo <wqu@suse.com>

btrfs: migrate extent_buffer::pages[] to folio

For now extent_buffer::pages[] are still only accepting single page
pointer, thus we can migrate to folios pretty easily.

As for single page, page and folio are 1:1 mapped, including their page
flags.

This patch would just do the conversion from struct page to struct
folio, providing the first step to higher order folio in the future.

This conversion is pretty simple:

- extent_buffer::pages[] -> extent_buffer::folios[]

- page_address(eb->pages[i]) -> folio_address(eb->pages[i])

- eb->pages[i] -> folio_page(eb->folios[i], 0)

There would be more specific cleanups preparing for the incoming higher
order folio support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 09e6cef1 29-Nov-2023 Qu Wenruo <wqu@suse.com>

btrfs: refactor alloc_extent_buffer() to allocate-then-attach method

Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.

This method has the following disadvantages:

- find_or_create_page() is the legacy way of allocating new pages
With the new folio infrastructure, find_or_create_page() is just
redirected to filemap_get_folio().

- Lacks the way to support higher order (order >= 1) folios
As we can not yet let filemap give us a higher order folio.

This patch would change the workflow by the following way:

Old | new
-----------------------------------+-------------------------------------
| ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) { | for (i = 0; i < num_pages; i++) {
p = find_or_create_page(); | ret = filemap_add_folio();
/* Attach page private */ | /* Reuse page cache if needed */
/* Reused eb if needed */ |
| /* Attach page private and
| reuse eb if needed */
| }

By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eefaf0a1 05-Dec-2023 David Sterba <dsterba@suse.com>

btrfs: fix typos found by codespell

Signed-off-by: David Sterba <dsterba@suse.com>


# f86f7a75 04-Dec-2023 Filipe Manana <fdmanana@suse.com>

btrfs: use the flags of an extent map to identify the compression type

Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.

We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.

So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 397239ed 15-Nov-2023 Qu Wenruo <wqu@suse.com>

btrfs: allow extent buffer helpers to skip cross-page handling

Currently btrfs extent buffer helpers are doing all the cross-page
handling, as there is no guarantee that all those eb pages are
contiguous.

However on systems with enough memory, there is a very high chance the
page cache for btree_inode are allocated with physically contiguous
pages.

In that case, we can skip all the complex cross-page handling, thus
speeding up the code.

This patch adds a new member, extent_buffer::addr, which is only set to
non-NULL if all the extent buffer pages are physically contiguous.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b0d82384 23-Nov-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: use memset_page instead of opencoding it

Use memset_page() in memset_extent_buffer() instead of opencoding it.

This does not not change any functionality.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# aa6313e6 23-Nov-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: don't clear dirty flag of extent buffer

One a zoned filesystem, never clear the dirty flag of an extent buffer,
but instead mark it as zeroout.

On writeout, when encountering a marked extent_buffer, zero it out.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cbf44cd9 23-Nov-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: rename EXTENT_BUFFER_NO_CHECK to EXTENT_BUFFER_ZONED_ZEROOUT

EXTENT_BUFFER_ZONED_ZEROOUT better describes the state of the extent buffer,
namely it is written as all zeros. This is needed in zoned mode, to
preserve I/O ordering.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cfbf07e2 16-Nov-2023 Qu Wenruo <wqu@suse.com>

btrfs: migrate to use folio private instead of page private

As a cleanup and preparation for future folio migration, this patch
would replace all page->private to folio version. This includes:

- PagePrivate()
-> folio_test_private()

- page->private
-> folio_get_private()

- attach_page_private()
-> folio_attach_private()

- detach_page_private()
-> folio_detach_private()

Since we're here, also remove the forced cast on page->private, since
it's (void *) already, we don't really need to do the cast.

For now even if we missed some call sites, it won't cause any problem
yet, as we're only using order 0 folio (single page), thus all those
folio/page flags should be synced.

But for the future conversion to utilize higher order folio, the page
<-> folio flag sync is no longer guaranteed, thus we have to migrate to
utilize folio flags.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 600f111e 17-Nov-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Rename mapping private members

It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel. To fit
with other members of struct address_space, rename them all to have an
i_ prefix. Tested with an allmodconfig build.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# a8680550 01-Dec-2023 Boris Burkov <boris@bur.io>

btrfs: don't clear qgroup reserved bit in release_folio

The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for
duplicate reservations. That is two writes to that range in one
transaction shouldn't create two reservations, as the reservation will
only be freed once when the write finally goes down. Therefore, it is
never OK to clear that bit without freeing the associated qgroup
reserve. At this point, we don't want to be freeing the reserve, so mask
off the bit.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 94dbf7c0 23-Nov-2023 Qu Wenruo <wqu@suse.com>

btrfs: free the allocated memory if btrfs_alloc_page_array() fails

[BUG]
If btrfs_alloc_page_array() fail to allocate all pages but part of the
slots, then the partially allocated pages would be leaked in function
btrfs_submit_compressed_read().

[CAUSE]
As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
caller is responsible to free the partially allocated pages.

For the existing call sites, most of them are fine:

- btrfs_raid_bio::stripe_pages
Handled by free_raid_bio().

- extent_buffer::pages[]
Handled btrfs_release_extent_buffer_pages().

- scrub_stripe::pages[]
Handled by release_scrub_stripe().

But there is one exception in btrfs_submit_compressed_read(), if
btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
the array pointer directly.

Initially there is still the error handling in commit dd137dd1f2d7
("btrfs: factor out allocating an array of pages"), but later in commit
544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"),
the error handling is removed, leading to the possible memory leak.

[FIX]
This patch would add back the error handling first, then to prevent such
situation from happening again, also
Make btrfs_alloc_page_array() to free the allocated pages as a extra
safety net, then we don't need to add the error handling to
btrfs_submit_compressed_read().

Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 893fe243 14-Aug-2020 David Sterba <dsterba@suse.com>

btrfs: change test_range_bit to scan the whole range

The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.

Signed-off-by: David Sterba <dsterba@suse.com>


# 99be1a66 11-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add specific helper for range bit test exists

The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence. By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.

There's no caller that uses the cached state pointer so this reduces the
argument count further.

Signed-off-by: David Sterba <dsterba@suse.com>


# 6d3a6194 24-Aug-2023 Qu Wenruo <wqu@suse.com>

btrfs: warn on tree blocks which are not nodesize aligned

A long time ago, we had some metadata chunks which started at sector
boundary but not aligned to nodesize boundary.

This led to some older filesystems which can have tree blocks only
aligned to sectorsize, but not nodesize.

Later 'btrfs check' gained the ability to detect and warn about such tree
blocks, and kernel fixed the chunk allocation behavior, nowadays those
tree blocks should be pretty rare.

But in the future, if we want to migrate metadata to folio, we cannot
have such tree blocks, as filemap_add_folio() requires the page index to
be aligned with the folio number of pages. Such unaligned tree blocks
can lead to VM_BUG_ON().

So this patch adds extra warning for those unaligned tree blocks, as a
preparation for the future folio migration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb2a836d 08-Sep-2023 Qu Wenruo <wqu@suse.com>

btrfs: check-integrity: remove btrfsic_unmount() function

The function btrfsic_mount() is part of the deprecated check-integrity
functionality.

Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9580503b 07-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: reformat remaining kdoc style comments

Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 74ee7914 18-Sep-2023 Qu Wenruo <wqu@suse.com>

btrfs: reset destination buffer when read_extent_buffer() gets invalid range

Commit f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer
read write functions") changed how we handle invalid extent buffer range
for read_extent_buffer().

Previously if the range is invalid we just set the destination to zero,
but after the patch we do nothing and error out.

This can lead to smatch static checker errors like:

fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'.
fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'.
fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'.
fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'.
fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'.

Fix those warnings by reverting back to the old memset() behavior.
By this we keep the static checker happy and would still make a lot of
noise when such invalid ranges are passed in.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b595d259 08-Sep-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: don't clear uptodate on write errors

We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup. This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.

The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap. We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.

The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.

We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect. This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.

Fix this by not clearing uptodate when we have a write error. The only
thing we should be doing in this case is setting the mapping error and
carrying on. This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.

With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.

Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0356ad41 07-Aug-2023 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: defer advancing meta write pointer

We currently advance the meta_write_pointer in
btrfs_check_meta_write_pointer(). That makes it necessary to revert it
when locking the buffer failed. Instead, we can advance it just before
sending the buffer.

Also, this is necessary for the following commit. In the commit, it needs
to release the zoned_meta_io_lock to allow IOs to come in and wait for them
to fill the currently active block group. If we advance the
meta_write_pointer before locking the extent buffer, the following extent
buffer can pass the meta_write_pointer check, resulting in an unaligned
write failure.

Advancing the pointer is still thread-safe as the extent buffer is locked.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ad8c051 07-Aug-2023 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: return int from btrfs_check_meta_write_pointer

Now that we have writeback_control passed to
btrfs_check_meta_write_pointer(), we can move the wbc condition in
submit_eb_page() to btrfs_check_meta_write_pointer() and return int.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7db94301 07-Aug-2023 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: introduce block group context to btrfs_eb_write_context

For metadata write out on the zoned mode, we call
btrfs_check_meta_write_pointer() to check if an extent buffer to be written
is aligned to the write pointer.

We look up a block group containing the extent buffer for every extent
buffer, which takes unnecessary effort as the writing extent buffers are
mostly contiguous.

Introduce "zoned_bg" to cache the block group working on. Also, while
at it, rename "cache" to "block_group".

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 861093ef 07-Aug-2023 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: introduce struct to consolidate extent buffer write context

Introduce btrfs_eb_write_context to consolidate writeback_control and the
exntent buffer context. This will help adding a block group context as
well.

While at it, move the eb context setting before
btrfs_check_meta_write_pointer(). We can set it here because we anyway need
to skip pages in the same eb if that eb is rejected by
btrfs_check_meta_write_pointer().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 096d2301 15-Jul-2023 Qu Wenruo <wqu@suse.com>

btrfs: refactor main loop in memmove_extent_buffer()

[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).

This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).

The current code would be a burden for future changes.

[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.

Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.

The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)

Src Page boundary
|///////|
|///|////|
Dst

In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.

But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.

So we have to follow the old behavior of handling both page boundaries.

And since we're the last caller of copy_pages(), we can remove it
completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 13840f3f 15-Jul-2023 Qu Wenruo <wqu@suse.com>

btrfs: refactor main loop in memcpy_extent_buffer()

[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).

This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).

The current code would be a burden for future changes.

[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.

So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().

Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.

This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 682a0bc5 15-Jul-2023 Qu Wenruo <wqu@suse.com>

btrfs: copy all pages at once at the end of btrfs_clone_extent_buffer()

btrfs_clone_extent_buffer() calls copy_page() at each iteration but we
can copy all pages at the end in one go if there were no errors.
This would make later conversion to folios easier.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 54948681 15-Jul-2023 Qu Wenruo <wqu@suse.com>

btrfs: refactor main loop in copy_extent_buffer_full()

[BACKGROUND]
copy_extent_buffer_full() currently does different handling for regular
and subpage cases, for regular cases it does a page by page copying.
For subpage cases, it just copies the content.

This is fine for the page based extent buffer code, but for the incoming
folio conversion, it can be a burden to add a new branch just to handle
all the different combinations (subpage vs regular, one single folio vs
multi pages).

[ENHANCE]
Instead of handling the different combinations, just go one single
handling for all cases, utilizing write_extent_buffer() to do the
copying.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 730c374e 15-Jul-2023 Qu Wenruo <wqu@suse.com>

btrfs: use write_extent_buffer() to implement write_extent_buffer_*id()

Helpers write_extent_buffer_chunk_tree_uuid() and
write_extent_buffer_fsid(), they can be implemented by
write_extent_buffer().

These two helpers are not that frequently used, they only get called
during initialization of a new tree block. There is not much need for
those slightly optimized versions. And since they can be easily
converted to one write_extent_buffer() call, define them as inline
helpers.

This would make later page/folio switch much easier, as all change only
need to happen in write_extent_buffer().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb22964f 15-Jul-2023 Qu Wenruo <wqu@suse.com>

btrfs: refactor extent buffer bitmaps operations

[BACKGROUND]
Currently we handle extent bitmaps manually in
extent_buffer_bitmap_set() and extent_buffer_bitmap_clear().

Although with various helpers like eb_bitmap_offset() it's still a little
messy to read. The code seems to be a copy of bitmap_set(), but with
all the cross-page handling embedded into the code.

[ENHANCEMENT]
This patch would enhance the readability by introducing two helpers:

- memset_extent_buffer()
To handle the byte aligned range, thus all the cross-page handling is
done there.

- extent_buffer_get_byte()
This for the first and the last byte operations, which only need to
grab one byte, thus no need for any cross-page handling.

So we can split both extent_buffer_bitmap_set() and
extent_buffer_bitmap_clear() into 3 parts:

- Handle the first byte
If the range fits inside the first byte, we can exit early.

- Handle the byte aligned part
This is the part which can have cross-page operations, and it would
be handled by memset_extent_buffer().

- Handle the last byte

This refactoring does not only make the code a little easier to read,
but also makes later folio/page switch much easier, as the switch only
needs to be done inside memset_extent_buffer() and extent_buffer_get_byte().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 778b8785 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't redirty locked_page in run_delalloc_zoned

extent_write_locked_range currently expects that either all or no
pages are dirty when it is called. Bur run_delalloc_zoned is called
directly in the writepages path, and has the dirty bit cleared only
for locked_page and which the extent_write_cache_pages currently
operates. It currently works around this by redirtying locked_page,
but that is a bit inefficient and cumbersome. Pass a locked_page
argument to run_delalloc_zoned so that clearing the dirty bit can
be skipped on just that page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 44962ca3 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't redirty pages in compress_file_range

compress_file_range needs to clear the dirty bit before handing off work
to the compression worker threads to prevent processes coming in through
mmap and changing the file contents while the compression is accessing
the data (See commit 4adaa611020f ("Btrfs: fix race between mmap writes
and compression").

But when compress_file_range decides to not compress the data, it falls
back to submit_uncompressed_range which uses extent_write_locked_range
to write the uncompressed data. extent_write_locked_range currently
expects all pages to be marked dirty so that it can clear the dirty
bit itself, and thus compress_file_range has to redirty the page range.

Redirtying the page range is rather inefficient and also pointless,
so instead pass a pages_dirty parameter to extent_write_locked_range
and skip the redirty game entirely.

Note that compress_file_range was even redirtying the locked_page twice
given that extent_range_clear_dirty_for_io already redirties all pages
in the range, which must include locked_page if there is one.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c56cbe90 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: reduce the number of arguments to btrfs_run_delalloc_range

Instead of a separate page_started argument that tells the callers that
btrfs_run_delalloc_range already started writeback by itself, overload
the return value with a positive 1 in additio to 0 and a negative error
code to indicate that is has already started writeback, and remove the
nr_written argument as that caller can calculate it directly based on
the range, and in fact already does so for the case where writeback
wasn't started yet.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2c73162d 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: improve the delalloc_to_write calculation in writepage_delalloc

Currently writepage_delalloc adds to delalloc_to_write in every loop
operation. That is not only more work than doing it once after the
loop, but can also over-increment the counter due to rounding errors
when a new loop iteration starts with an offset into a page.

Add a new page_start variable instead of recaculation that value over
and over, move the delalloc_to_write calculation out of the loop, use
the DIV_ROUND_UP helper instead of open coding it and remove the pointless
found local variable.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0835d1e6 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the return value from extent_write_locked_range

The return value from extent_write_locked_range is ignored, and that's
fine because the error reporting happens through the mapping and
ordered_extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9783e4de 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove end_extent_writepage

end_extent_writepage is a small helper that combines a call to
btrfs_mark_ordered_io_finished with conditional error-only calls to
btrfs_page_clear_uptodate and mapping_set_error with a somewhat
unfortunate calling convention that passes and inclusive end instead
of the len expected by the underlying functions.

Remove end_extent_writepage and open code it in the 4 callers. Out
of those two already are error-only and thus don't need the extra
conditional, and one already has the mapping_set_error, so a duplicate
call can be avoided.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6648cedd 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove btrfs_writepage_endio_finish_ordered

btrfs_writepage_endio_finish_ordered is a small wrapper around
btrfs_mark_ordered_io_finished that just changs the argument passing
slightly, and adds a tracepoint.

Move the tracpoint to btrfs_mark_ordered_io_finished, which means
it now also covers the error handling in btrfs_cleanup_ordered_extent
and switch all callers to just call btrfs_mark_ordered_io_finished
directly.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ef4e88e6 28-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: split page locking out of __process_pages_contig

There is a lot of complexity in __process_pages_contig to deal with the
PAGE_LOCK case that can return an error unlike all the other actions.

Open code the page iteration for page locking in lock_delalloc_pages and
remove all the now unused code from __process_pages_contig.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 52ea5bfb 09-Jul-2023 Qu Wenruo <wqu@suse.com>

btrfs: move eb subpage preallocation out of the loop

Initially we preallocate btrfs_subpage structure in the main loop of
alloc_extent_buffer().

But later commit fbca46eb46ec ("btrfs: make nodesize >= PAGE_SIZE case
to reuse the non-subpage routine") has made sure we only go subpage
routine if our nodesize is smaller than PAGE_SIZE.

This means for that case, we only need to allocate the subpage structure
once anyway.

So this patch would make the preallocation out of the main loop. This
would slightly reduce the workload when we hold the page lock, and make
code a little easier to read.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7b365a2a 17-Jul-2023 Minjie Du <duminjie@vivo.com>

btrfs: use folio_next_index() helper in extent_write_cache_pages

Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 09c3717c 01-Aug-2023 Chris Mason <clm@fb.com>

btrfs: only subtract from len_to_oe_boundary when it is tracking an extent

bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.

Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:

- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?

The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.

This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.

The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.

The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.

Sample reproducer, note you'll need to change the path to the bdi and
device:

SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192

mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs

btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync

echo 4 > /proc/sys/vm/drop_caches

echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb

while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done

Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c256998 24-Jul-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't wait for writeback on clean pages in extent_write_cache_pages

__extent_writepage could have started on more pages than the one it was
called for. This happens regularly for zoned file systems, and in theory
could happen for compressed I/O if the worker thread was executed very
quickly. For such pages extent_write_cache_pages waits for writeback
to complete before moving on to the next page, which is highly inefficient
as it blocks the flusher thread.

Port over the PageDirty check that was added to write_cache_pages in
commit 515f4a037fb ("mm: write_cache_pages optimise page cleaning") to
fix this.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# effa24f6 24-Jul-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't stop integrity writeback too early

extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero. That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range. Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.

This is a port of write_cache_pages changes in commit 05fe478dd04e
("mm: write_cache_pages integrity fix").

Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comment ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d394cca 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: use btrfs_finish_ordered_extent to complete buffered writes

Use the btrfs_finish_ordered_extent helper to complete compressed writes
using the bbio->ordered pointer instead of requiring an rbtree lookup
in the otherwise equivalent btrfs_mark_ordered_io_finished called from
btrfs_writepage_endio_finish_ordered.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ba8223d 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: open code end_extent_writepage in end_bio_extent_writepage

This prepares for switching to more efficient ordered_extent processing
and already removes the forth and back conversion from len to end back to
len.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ec63b84d 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: add an ordered_extent pointer to struct btrfs_bio

Add a pointer to the ordered_extent to the existing union in struct
btrfs_bio, so all code dealing with data write bios can just use a
pointer dereference to retrieve the ordered_extent instead of doing
multiple rbtree lookups per I/O.

The reference to this ordered_extent is dropped at end I/O time,
which implies that an extra one must be acquired when the bio is split.
This also requires moving the btrfs_extract_ordered_extent call into
btrfs_split_bio so that the invariant of always having a valid
ordered_extent reference for the btrfs_bio is kept.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a39da514 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: limit write bios to a single ordered extent

Currently buffered writeback bios are allowed to span multiple
ordered_extents, although that basically never actually happens since
commit 4a445b7b6178 ("btrfs: don't merge pages into bio if their page
offset is not contiguous").

Supporting bios than span ordered_extents complicates the file
checksumming code, and prevents us from adding an ordered_extent pointer
to the btrfs_bio structure. Use the existing code to limit a bio to
single ordered_extent for zoned device writes for all writes.

This allows to remove the REQ_BTRFS_ONE_ORDERED flags, and the
handling of multiple ordered_extents in btrfs_csum_one_bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7027f871 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't treat zoned writeback as being from an async helper thread

When extent_write_locked_range was originally added, it was only used
writing back compressed pages from an async helper thread. But it is
now also used for writing back pages on zoned devices, where it is
called directly from the ->writepage context. In this case we want to
be able to pass on the writeback_control instead of creating a new one,
and more importantly want to use all the normal cgroup interaction
instead of potentially deferring writeback to another helper.

Fixes: 898793d992c2 ("btrfs: zoned: write out partially allocated region")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb34dcea 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: only call __extent_writepage_io from extent_write_locked_range

__extent_writepage does a lot of things that make no sense for
extent_write_locked_range, given that extent_write_locked_range itself is
called from __extent_writepage either directly or through a workqueue,
and all this work has already been done in the first invocation and the
pages haven't been unlocked since. Call __extent_writepage_io directly
instead and open code the logic tracked in
btrfs_bio_ctrl::extent_locked.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9ecdbee8 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: move writeback_control::nr_to_write update to __extent_writepage

Move the nr_to_write accounting from __extent_writepage_io to
__extent_writepage_io as we'll grow another __extent_writepage_io that
doesn't want this accounting soon. Also drop the obsolete comment -
decrementing a counter in the on-stack writeback_control data structure
doesn't need the page lock.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f22b5dcb 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove non-standard extent handling in __extent_writepage_io

__extent_writepage_io is never called for compressed or inline extents,
or holes. Remove the not quite working code for them and replace it with
asserts that these cases don't happen.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a994310a 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove PAGE_SET_ERROR

Now that the btrfs writeback code has stopped using PageError, using
PAGE_SET_ERROR to just set the per-address_space error flag is confusing.
Open code the mapping_set_error calls in the callers and remove
the PAGE_SET_ERROR flag.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b2553f1 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: stop setting PageError in the data I/O path

PageError is not used by the VFS/MM and deprecated because it uses up a
page bit and has no coherent rules. Instead read errors are usually
propagated by not setting or clearing the uptodate bit, and write errors
are propagated through the address_space. Btrfs now only sets the flag
and never clears it for data pages, so just remove all places setting it,
and the subpage error bit.

Note that the error propagation for superblock writes that work on the
block device mapping still uses PageError for now, but that will be
addressed in a separate series.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3e92499e 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't check PageError in __extent_writepage

__extent_writepage currenly sets PageError whenever any error happens,
and the also checks for PageError to decide if to call error handling.
This leads to very unclear responsibility for cleaning up on errors.
In the VM and generic writeback helpers the basic idea is that once
I/O is fired off all error handling responsibility is delegated to the
end I/O handler. But if that end I/O handler sets the PageError bit,
and the submitter checks it, the bit could in some cases leak into the
submission context for fast enough I/O.

Fix this by simply not checking PageError and just using the local
ret variable to check for submission errors. This also fundamentally
solves the long problem documented in a comment in __extent_writepage
by never leaking the error bit into the submission context.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 57201ddd 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't check PageError in btrfs_verify_page

btrfs_verify_page is called from the readpage completion handler, which
is only used to read pages, or parts of pages that aren't uptodate yet.
The only case where PageError could be set on a page in btrfs is if we
had a previous writeback error, but in that case we won't called readpage
on it, as it has previously been marked uptodate.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2c14f0ff 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: fix fsverify read error handling in end_page_read

Also clear the uptodate bit to make sure the page isn't seen as uptodate
in the page cache if fsverity verification fails.

Fixes: 146054090b08 ("btrfs: initial fsverity support")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ed9ee98e 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: factor out a btrfs_verify_page helper

Split all the conditionals for the fsverity calls in end_page_read into
a btrfs_verify_page helper to keep the code readable and make additional
refactoring easier.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 36614a3b 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: fix range_end calculation in extent_write_locked_range

The range_end field in struct writeback_control is inclusive, just like
the end parameter passed to extent_write_locked_range. Not doing this
could cause extra writeout, which is harmless but suboptimal.

Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 31dd8c81 29-May-2023 Qu Wenruo <wqu@suse.com>

btrfs: use the same uptodate variable for end_bio_extent_readpage()

In function end_bio_extent_readpage() we call
endio_readpage_release_extent() to unlock the extent io tree.

However we pass PageUptodate(page) as @uptodate parameter for it, while
for previous end_page_read() call, we use a dedicated @uptodate local
variable.

This is not a big deal, as even for subpage cases, either the bio only
covers part of the page, then the @uptodate is always false, and the
subpage ranges can still be merged.

But for the sake of consistency, always use @uptodate variable when
possible.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a963419 29-May-2023 Qu Wenruo <wqu@suse.com>

btrfs: subpage: make alloc_extent_buffer() handle previously uptodate range efficiently

Currently alloc_extent_buffer() would make the extent buffer uptodate if
the corresponding pages are also uptodate.

But this check is only checking PageUptodate, which is fine for regular
cases, but not for subpage cases, as we can have multiple extent buffers
in the same page.

So here we go btrfs_page_test_uptodate() instead.

The old code doesn't cause any problem, but is not efficient, as it
would cause extra metadata read even if the range is already uptodate.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 75258f20 26-May-2023 Qu Wenruo <wqu@suse.com>

btrfs: subpage: dump extra subpage bitmaps for debug

There is a bug report that assert_eb_page_uptodate() gets triggered for
free space tree metadata.

Without proper dump for the subpage bitmaps it's much harder to debug.

Thus this patch would dump all the subpage bitmaps (split them into
their own bitmaps) for a easier debugging.

The output would look like this:
(Dumped after a tree block got read from disk)

page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9
memcg:ffff0000d7d62000
aops:btree_aops [btrfs] ino:1
flags: 0x8000000000002002(referenced|private|zone=2)
page_type: 0xffffffff()
raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0
raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000
page dumped because: btrfs subpage dump
BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked=

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1d126800 24-May-2023 David Sterba <dsterba@suse.com>

btrfs: drop gfp from parameter extent state helpers

Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.

Net effect on module on a release build:

text data bss dec hex filename
1250432 20985 16088 1287505 13a551 pre/btrfs.ko
1247074 20985 16088 1284147 139833 post/btrfs.ko

DELTA: -3358

Signed-off-by: David Sterba <dsterba@suse.com>


# 46672a44 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: merge write_one_subpage_eb into write_one_eb

Most of the code in write_one_subpage_eb and write_one_eb is shared,
so merge the two functions into one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d7172f52 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: use per-buffer locking for extent_buffer reading

Instead of locking and unlocking every page or the extent, just add a
new EXTENT_BUFFER_READING bit that mirrors EXTENT_BUFFER_WRITEBACK
for synchronizing threads trying to read an extent_buffer and to wait
for I/O completion.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f3d315eb 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't check for uptodate pages in read_extent_buffer_pages

The only place that reads in pages and thus marks them uptodate for
the btree inode is read_extent_buffer_pages. Which means that either
pages are already uptodate from an old buffer when creating a new
one in alloc_extent_buffer, or they will be updated by ca call
to read_extent_buffer_pages. This means the checks for uptodate
pages in read_extent_buffer_pages and read_extent_buffer_subpage are
superfluous and can be removed.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 011134f4 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: stop using PageError for extent_buffers

PageError is only used to limit the uptodate check in
assert_eb_page_uptodate. But we have a much more useful flag indicating
the exact condition we are about with the EXTENT_BUFFER_WRITE_ERR flag,
so use that instead and help the kernel toward eventually removing
PageError.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 113fa05c 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the io_pages field in struct extent_buffer

No need to track the number of pages under I/O now that each
extent_buffer is read and written using a single bio. For the
read side we need to grab an extra reference for the duration of
the I/O to prevent eviction, though.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cd88a4fd 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: use a separate end_io handler for extent_buffer writing

Now that we always use a single bio to write an extent_buffer, the buffer
can be passed to the end_io handler as private data. This allows
to simplify the metadata write end I/O handler, and merge the subpage
end_io handler into the main one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b51e6b4b 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't use btrfs_bio_ctrl for extent buffer writing

The btrfs_bio_ctrl machinery is overkill for writing extent_buffers
as we always operate on PAGE_SIZE chunks (or one smaller one for the
subpage case) that are contiguous and are guaranteed to fit into a
single bio. Replace it with open coded btrfs_bio_alloc, __bio_add_page
and btrfs_submit_bio calls.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 81a79b6a 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: move page locking from lock_extent_buffer_for_io to write_one_eb

Locking the pages in lock_extent_buffer_for_io only for the non-subpage
case is very confusing. Move it to write_one_eb to mirror the subpage
case and simplify the code. Now lock_extent_buffer_for_io does not leave
all the pages locked and each is individually locked/unlocked in
write_one_eb.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 50b21d7a 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: submit a writeback bio per extent_buffer

Stop trying to cluster writes of multiple extent_buffers into a single
bio. There is no need for that as the blk_plug mechanism used all the
way up in writeback_inodes_wb gives us the same I/O pattern even with
multiple bios. Removing the clustering simplifies
lock_extent_buffer_for_io a lot and will also allow passing the eb
as private data to the end I/O handler.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9fdd1601 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: return bool from lock_extent_buffer_for_io

lock_extent_buffer_for_io never returns a negative error value, so switch
the return value to a simple bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d66b4b2 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: do not try to unlock the extent for non-subpage metadata reads

Only subpage metadata reads lock the extent. Don't try to unlock it and
waste cycles in the extent tree lookup for PAGE_SIZE or larger metadata.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 046b562b 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: use a separate end_io handler for read_extent_buffer

Now that we always use a single bio to read an extent_buffer, the buffer
can be passed to the end_io handler as private data. This allows
implementing a much simplified dedicated end I/O handler for metadata
reads.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# e1949310 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the mirror_num argument to btrfs_submit_compressed_read

Given that read recovery for data I/O is handled in the storage layer,
the mirror_num argument to btrfs_submit_compressed_read is always 0,
so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b78b98e0 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't use btrfs_bio_ctrl for extent buffer reading

The btrfs_bio_ctrl machinery is overkill for reading extent_buffers
as we always operate on PAGE_SIZE chunks (or one smaller one for the
subpage case) that are contiguous and are guaranteed to fit into a
single bio. Replace it with open coded btrfs_bio_alloc, __bio_add_page
and btrfs_submit_bio calls in a helper function shared between
the subpage and node size >= PAGE_SIZE cases.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e9538283 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: always read the entire extent_buffer

Currently read_extent_buffer_pages skips pages that are already uptodate
when reading in an extent_buffer. While this reduces the amount of data
read, it increases the number of I/O operations as we now need to do
multiple I/Os when reading an extent buffer with one or more uptodate
pages in the middle of it. On any modern storage device, be that hard
drives or SSDs this actually decreases I/O performance. Fortunately
this case is pretty rare as the pages are always initially read together
and then aged the same way. Besides simplifying the code a bit as-is
this will allow for major simplifications to the I/O completion handler
later on.

Note that the case where all pages are uptodate is still handled by an
optimized fast path that does not read any data from disk.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 243984b3 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: subpage: fix error handling in end_bio_subpage_eb_writepage

Call btrfs_page_clear_uptodate instead of ClearPageUptodate to properly
manage the uptodate bit for the subpage case.

Reported-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7f26fb1c 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: mark extent_buffer_under_io static

extent_buffer_under_io is only used in extent_io.c, so mark it static.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f880fe6e 08-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't hold an extra reference for redirtied buffers

When btrfs_redirty_list_add redirties a buffer, it also acquires
an extra reference that is released on transaction commit. But
this is not required as buffers that are dirty or under writeback
are never freed (look for calls to extent_buffer_under_io())).

Remove the extra reference and the infrastructure used to drop it
again.

History behind redirty logic:

In the first place, it used releasing_list to hold all the
to-be-released extent buffers, and decided which buffers to re-dirty at
the commit time. Then, in a later version, the behaviour got changed to
re-dirty a necessary buffer and add re-dirtied one to the list in
btrfs_free_tree_block(). In short, the list was there mostly for the
patch series' historical reason.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ add Naohiro's comment regarding history ]
Signed-off-by: David Sterba <dsterba@suse.com>


# f18cc978 08-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: fix dirty_metadata_bytes for redirtied buffers

dirty_metadata_bytes is decremented in both places that clear the dirty
bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which
means that a buffer that is redirtied using btrfs_redirty_list_add won't
be added to dirty_metadata_bytes, but it will be subtracted when written
out, leading an inconsistency in the counter.

Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into
set_extent_buffer_dirty to also account for the redirty case, and remove
the now unused set_extent_buffer_dirty return value.

Fixes: d3575156f662 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4317ff00 23-Mar-2023 Qu Wenruo <wqu@suse.com>

btrfs: introduce btrfs_bio::fs_info member

Currently we're doing a lot of work for btrfs_bio:

- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios

However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.

Thus here we do the following changes:

- Introduce btrfs_bio::fs_info
This is for the new scrub specific btrfs_bio, which would not populate
btrfs_bio::inode.
Thus we need such new member to grab a fs_info

This new member will always be populated.

- Replace @inode argument with @fs_info for btrfs_bio_init() and its
caller
Since @inode is no longer a mandatory member, replace it with
@fs_info, and let involved users populate @inode.

- Skip checksum verification and generation if @bbio->inode is NULL

- Add extra ASSERT()s
To make sure:

* bbio->inode is properly set for involved read repair path
* if @file_offset is set, bbio->inode is also populated

- Grab @fs_info from @bbio directly
We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
NULL. This involves:

* btrfs_simple_end_io()
* should_async_write()
* btrfs_wq_submit_bio()
* btrfs_use_zone_append()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3480373e 26-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs, block: move REQ_CGROUP_PUNT to btrfs

REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path. To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly. Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0a0596fb 26-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs, mm: remove the punt_to_cgroup field in struct writeback_control

punt_to_cgroup is only used by extent_write_locked_range, but that
function also directly controls the bio flags for the actual submission.
Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly
in extent_write_locked_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# b41bbd29 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: return a btrfs_bio from btrfs_bio_alloc

Return the containing struct btrfs_bio instead of the less type safe
struct bio from btrfs_bio_alloc.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9dfde1b4 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: store a pointer to a btrfs_bio in struct btrfs_bio_ctrl

The bio in struct btrfs_bio_ctrl must be a btrfs_bio, so store a pointer
to the btrfs_bio for better type checking.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# d733ea01 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: simplify finding the inode in submit_one_bio

struct btrfs_bio now has an always valid inode pointer that can be used
to find the inode in submit_one_bio, so use that and initialize all
variables for which it is possible at declaration time.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 690834e4 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: pass a btrfs_bio to btrfs_submit_compressed_read

btrfs_submit_compressed_read expects the bio passed to it to be embedded
into a btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# ae42a154 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: pass a btrfs_bio to btrfs_submit_bio

btrfs_submit_bio expects the bio passed to it to be embedded into a
btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 198bd49e 21-Feb-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: sink calc_bio_boundaries into its only caller

Nowadays calc_bio_boundaries() is a relatively simple function that only
guarantees the one bio equals to one ordered extent rule for uncompressed
Zone Append bios.

Sink it into it's only caller alloc_new_bio().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 24e6c808 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: simplify main loop in submit_extent_page

bio_add_page always adds either the entire range passed to it or nothing.
Based on that btrfs_bio_add_page can only return a length smaller than
the passed in one when hitting the ordered extent limit, which can only
happen for writes. Given that compressed writes never even use this code
path, this means that all the special cases for compressed extent offset
handling are dead code.

Reflow submit_extent_page to take advantage of this by inlining
btrfs_bio_add_page and handling the ordered extent limit by decrementing
it for each added range and thus significantly simplifying the loop.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 78a2ef1b 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: check for contiguity in submit_extent_page

Different loop iterations in btrfs_bio_add_page not only have the same
contiguity parameters, but also any non-initial operation operates on a
fresh bio anyway.

Factor out the contiguity check into a new btrfs_bio_is_contig and only
call it once in submit_extent_page before descending into the
bio_add_page loop.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5380311f 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: simplify the error handling in __extent_writepage_io

Remove the has_error and saved_ret variables, and just jump to a goto
label for error handling from the only place returning an error from the
main loop.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55173337 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the submit_extent_page return value

submit_extent_page always returns 0 since commit d5e4377d5051 ("btrfs:
split zone append bios in btrfs_submit_bio"). Change it to a void return
type and remove all the unreachable error handling code in the callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f8ed4852 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the compress_type argument to submit_extent_page

Update the compress_type in the btrfs_bio_ctrl after forcing out the
previous bio in btrfs_do_readpage, so that alloc_new_bio can just use
the compress_type member in struct btrfs_bio_ctrl instead of passing the
same information redundantly as a function argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a140453b 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: rename the this_bio_flag variable in btrfs_do_readpage

Rename this_bio_flag to compress_type to match the surrounding code
and better document the intent. Also use the proper enum type instead
of unsigned long.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c9bc621f 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: move the compress_type check out of btrfs_bio_add_page

The compress_type can only change on a per-extent basis. So instead of
checking it for every page in btrfs_bio_add_page, do the check once in
btrfs_do_readpage, which is the only caller of btrfs_bio_add_page and
submit_extent_page that deals with compressed extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 72b505dc 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: add a wbc pointer to struct btrfs_bio_ctrl

Instead of passing down the wbc pointer the deep call chain, just
add it to the btrfs_bio_ctrl structure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 794c26e2 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the sync_io flag in struct btrfs_bio_ctrl

The sync_io flag is equivalent to wbc->sync_mode == WB_SYNC_ALL, so
just check for that and remove the separate flag.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c000bc04 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: store the bio opf in struct btrfs_bio_ctrl

The bio op and flags never change over the life time of a bio_ctrl,
so move it in there instead of passing it down the deep call chain
all the way down to alloc_new_bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb8d0c6d 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the force_bio_submit to submit_extent_page

If force_bio_submit, submit_extent_page simply calls submit_one_bio as
the first thing. This can just be moved to the only caller that sets
force_bio_submit to true.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 67998cf4 27-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't set force_bio_submit in read_extent_buffer_subpage

When read_extent_buffer_subpage calls submit_extent_page, it does
so on a freshly initialized btrfs_bio_ctrl structure that can't have
a valid bio to submit. Clear the force_bio_submit parameter to false
as there is nothing to submit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 544fe4a9 10-Feb-2023 Christoph Hellwig <hch@lst.de>

btrfs: embed a btrfs_bio into struct compressed_bio

Embed a btrfs_bio into struct compressed_bio. This avoids potential
(so far theoretical) deadlocks due to nesting of btrfs_bioset allocations
for the original read bio and the compressed bio, and avoids an extra
memory allocation in the I/O path.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9f50fd2e 04-Jan-2023 Vishal Moola (Oracle) <vishal.moola@gmail.com>

btrfs: convert extent_write_cache_pages() to use filemap_get_folios_tag()

Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). Now also supports large folios.

Link: https://lkml.kernel.org/r/20230104211448.4804-8-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 51c5cd3b 04-Jan-2023 Vishal Moola (Oracle) <vishal.moola@gmail.com>

btrfs: convert btree_write_cache_pages() to use filemap_get_folio_tag()

Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag().

Link: https://lkml.kernel.org/r/20230104211448.4804-7-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 921603c7 12-Dec-2022 Christoph Hellwig <hch@lst.de>

btrfs: pass a btrfs_bio to btrfs_use_append

struct btrfs_bio has all the information needed for btrfs_use_append, so
pass that instead of a btrfs_inode and file_offset.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d495430 12-Dec-2022 Christoph Hellwig <hch@lst.de>

btrfs: set bbio->file_offset in alloc_new_bio

Instead of digging into the bio_vec in submit_one_bio, set file_offset at
bio allocation time from the provided parameter. This also ensures that
the file_offset is available all the time when building up the bio
payload.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 72fcf1a4 12-Dec-2022 Christoph Hellwig <hch@lst.de>

btrfs: use file_offset to limit bios size in calc_bio_boundaries

btrfs_ordered_extent->disk_bytenr can be rewritten by the zoned I/O
completion handler, and thus in general is not a good idea to limit I/O
size. But the maximum bio size calculation can easily be done using the
file_offset fields in the btrfs_ordered_extent and btrfs_bio structures,
so switch to that instead.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 98c8d683 26-Jan-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: combine btrfs_clear_buffer_dirty and clear_extent_buffer_dirty

btrfs_clear_buffer_dirty just does the test_clear_bit() and then calls
clear_extent_buffer_dirty and does the dirty metadata accounting.
Combine this into clear_extent_buffer_dirty and make the result
btrfs_clear_buffer_dirty.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f88fd650 26-Jan-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: do not increment dirty_metadata_bytes in set_btree_ioerr

We only add if we set the extent buffer dirty, and we subtract when we
clear the extent buffer dirty. If we end up in set_btree_ioerr we have
already cleared the buffer dirty, and we aren't resetting dirty on the
extent buffer, so this is simply wrong.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d5e4377d 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: split zone append bios in btrfs_submit_bio

The current btrfs zoned device support is a little cumbersome in the data
I/O path as it requires the callers to not issue I/O larger than the
supported ZONE_APPEND size of the underlying device. This leads to a lot
of extra accounting. Instead change btrfs_submit_bio so that it can take
write bios of arbitrary size and form from the upper layers, and just
split them internally to the ZONE_APPEND queue limits. Then remove all
the upper layer warts catering to limited write sized on zoned devices,
including the extra refcount in the compressed_bio.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 35a8d7da 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove now spurious bio submission helpers

Call btrfs_submit_bio and btrfs_submit_compressed_read directly from
submit_one_bio now that all additional functionality has moved into
btrfs_submit_bio.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2380220e 20-Jan-2023 Qu Wenruo <wqu@suse.com>

btrfs: remove stripe boundary calculation for buffered I/O

Remove btrfs_bio_ctrl::len_to_stripe_boundary, so that buffer
I/O will no longer limit its bio size according to stripe length
now that btrfs_submit_bio can split bios at stripe boundaries.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: simplify calc_bio_boundaries a little more]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 69ccf3f4 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: handle recording of zoned writes in the storage layer

Move the code that splits the ordered extents and records the physical
location for them to the storage layer so that the higher level consumers
don't have to care about physical block numbers at all. This will also
allow to eventually remove accounting for the zone append write sizes in
the upper layer with a little bit more block layer work.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0571b635 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove the io_failure_record infrastructure

struct io_failure_record and the io_failure_tree tree are unused now,
so remove them. This in turn makes struct btrfs_inode smaller by 16
bytes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7609afac 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: handle checksum validation and repair at the storage layer

Currently btrfs handles checksum validation and repair in the end I/O
handler for the btrfs_bio. This leads to a lot of duplicate code
plus issues with varying semantics or bugs, e.g.

- the until recently broken repair for compressed extents
- the fact that encoded reads validate the checksums but do not kick
of read repair
- the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag

This commit revamps the checksum validation and repair code to instead
work below the btrfs_submit_bio interfaces.

In case of a checksum failure (or a plain old I/O error), the repair
is now kicked off before the upper level ->end_io handler is invoked.

Progress of an in-progress repair is tracked by a small structure
that is allocated using a mempool for each original bio with failed
sectors, which holds a reference to the original bio. This new
structure is allocated using a mempool to guarantee forward progress
even under memory pressure. The mempool will be replenished when
the repair completes, just as the mempools backing the bios.

There is one significant behavior change here: If repair fails or
is impossible to start with, the whole bio will be failed to the
upper layer. This is the behavior that all I/O submitters except
for buffered I/O already emulated in their end_io handler. For
buffered I/O this now means that a large readahead request can
fail due to a single bad sector, but as readahead errors are ignored
the following readpage if the sector is actually accessed will
still be able to read. This also matches the I/O failure handling
in other file systems.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7276aa7d 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: save the bio iter for checksum validation in common code

All callers of btrfs_submit_bio that want to validate checksums
currently have to store a copy of the iter in the btrfs_bio. Move
the assignment into common code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d0e5cb2b 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: add a btrfs_inode pointer to struct btrfs_bio

All btrfs_bio I/Os are associated with an inode. Add a pointer to that
inode, which will allow to simplify a lot of calling conventions, and
which will be needed in the I/O completion path in the future.

This grow the btrfs_bio structure by a pointer, but that grows will
be offset by the removal of the device pointer soon.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 519b7e13 23-Jan-2023 Filipe Manana <fdmanana@suse.com>

btrfs: lock the inode in shared mode before starting fiemap

Currently fiemap does not take the inode's lock (VFS lock), it only locks
a file range in the inode's io tree. This however can lead to a deadlock
if we have a concurrent fsync on the file and fiemap code triggers a fault
when accessing the user space buffer with fiemap_fill_next_extent(). The
deadlock happens on the inode's i_mmap_lock semaphore, which is taken both
by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by
syzbot and triggers a trace like the following:

task:syz-executor361 state:D stack:20264 pid:5668 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
wait_on_state fs/btrfs/extent-io-tree.c:707 [inline]
wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751
lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742
find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488
writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863
__extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174
extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091
extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211
do_writepages+0x3c3/0x680 mm/page-writeback.c:2581
filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388
__filemap_fdatawrite_range mm/filemap.c:421 [inline]
filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439
btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline]
start_ordered_ops fs/btrfs/file.c:1737 [inline]
btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839
generic_write_sync include/linux/fs.h:2885 [inline]
btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684
call_write_iter include/linux/fs.h:2189 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x7dc/0xc50 fs/read_write.c:584
ksys_write+0x177/0x2a0 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9
RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006
RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8
</TASK>
INFO: task syz-executor361:5697 blocked for more than 145 seconds.
Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor361 state:D stack:21216 pid:5697 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095
__down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260
btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526
do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947
wp_page_shared+0x15e/0x380 mm/memory.c:3295
handle_pte_fault mm/memory.c:4949 [inline]
__handle_mm_fault mm/memory.c:5073 [inline]
handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219
do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428
handle_page_fault arch/x86/mm/fault.c:1519 [inline]
exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575
asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233
Code: 74 0a 89 (...)
RSP: 0018:ffffc9000570f330 EFLAGS: 00050202
RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007
RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120
RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83
R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038
R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0
copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline]
raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline]
_copy_to_user+0xe9/0x130 lib/usercopy.c:34
copy_to_user include/linux/uaccess.h:169 [inline]
fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144
emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458
fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716
extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922
btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209
ioctl_fiemap fs/ioctl.c:219 [inline]
do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810
__do_sys_ioctl fs/ioctl.c:868 [inline]
__se_sys_ioctl+0x83/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9
RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005
RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000
R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8
</TASK>

What happens is the following:

1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc
before locking the inode and the i_mmap_lock semaphore, that is, before
calling btrfs_inode_lock();

2) After task A flushes delalloc and before it calls btrfs_inode_lock(),
another task dirties a page;

3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied
at step 2 remains dirty and unflushed. Then when it enters
extent_fiemap() and it locks a file range that includes the range of
the page dirtied in step 2;

4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the
inode's i_mmap_lock semaphore in write mode. Then it tries to flush
delalloc by calling start_ordered_ops(), which will block, at
find_lock_delalloc_range(), when trying to lock the range of the page
dirtied at step 2, since this range was locked by the fiemap task (at
step 3);

5) Task B generates a page fault when accessing the user space fiemap
buffer with a call to fiemap_fill_next_extent().

The fault handler needs to call btrfs_page_mkwrite() for some other
page of our inode, and there we deadlock when trying to lock the
inode's i_mmap_lock semaphore in read mode, since the fsync task locked
it in write mode (step 4) and the fsync task can not progress because
it's waiting to lock a file range that is currently locked by us (the
fiemap task, step 3).

Fix this by taking the inode's lock (VFS lock) in shared mode when
entering fiemap. This effectively serializes fiemap with fsync (except the
most expensive part of fsync, the log sync), preventing this deadlock.

Reported-by: syzbot+cc35f55c41e34c30dcb5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000032dc7305f2a66f46@google.com/
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1d854e4f 28-Dec-2022 Qu Wenruo <wqu@suse.com>

btrfs: fix false alert on bad tree level check

[BUG]
There is a bug report that on a RAID0 NVMe btrfs system, under heavy
write load the filesystem can flip RO randomly.

With extra debugging, it shows some tree blocks failed to pass their
level checks, and if that happens at critical path of a transaction, we
abort the transaction:

BTRFS error (device nvme0n1p3): level verify failed on logical 5446121209856 mirror 1 wanted 0 found 1
BTRFS error (device nvme0n1p3: state A): Transaction aborted (error -5)
BTRFS: error (device nvme0n1p3: state A) in btrfs_finish_ordered_io:3343: errno=-5 IO failure
BTRFS info (device nvme0n1p3: state EA): forced readonly

[CAUSE]
The reporter has already bisected to commit 947a629988f1 ("btrfs: move
tree block parentness check into validate_extent_buffer()").

And with extra debugging, it shows we can have btrfs_tree_parent_check
filled with all zeros in the following call trace:

submit_one_bio+0xd4/0xe0
submit_extent_page+0x142/0x550
read_extent_buffer_pages+0x584/0x9c0
? __pfx_end_bio_extent_readpage+0x10/0x10
? folio_unlock+0x1d/0x50
btrfs_read_extent_buffer+0x98/0x150
read_tree_block+0x43/0xa0
read_block_for_search+0x266/0x370
btrfs_search_slot+0x351/0xd30
? lock_is_held_type+0xe8/0x140
btrfs_lookup_csum+0x63/0x150
btrfs_csum_file_blocks+0x197/0x6c0
? sched_clock_cpu+0x9f/0xc0
? lock_release+0x14b/0x440
? _raw_read_unlock+0x29/0x50
btrfs_finish_ordered_io+0x441/0x860
btrfs_work_helper+0xfe/0x400
? lock_is_held_type+0xe8/0x140
process_one_work+0x294/0x5b0
worker_thread+0x4f/0x3a0
? __pfx_worker_thread+0x10/0x10
kthread+0xf5/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50

Currently we only copy the btrfs_tree_parent_check structure into bbio
at read_extent_buffer_pages() after we have assembled the bbio.

But as shown above, submit_extent_page() itself can already submit the
bbio, leaving the bbio->parent_check uninitialized, and cause the false
alert.

[FIX]
Instead of copying @check into bbio after bbio is assembled, we pass
@check in btrfs_bio_ctrl::parent_check, and copy the content of
parent_check in submit_one_bio() for metadata read.

By this we should be able to pass the needed info for metadata endio
verification, and fix the false alert.

Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABXGCsNzVxo4iq-tJSGm_kO1UggHXgq6CdcHDL=z5FL4njYXSQ@mail.gmail.com/
Fixes: 947a629988f1 ("btrfs: move tree block parentness check into validate_extent_buffer()")
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8009adf3 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove BTRFS_LEAF_DATA_OFFSET

This is simply the same thing as btrfs_item_nr_offset(leaf, 0), so
remove this helper and replace it's usage with the above statement.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e23efd8e 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: add eb to btrfs_node_key_ptr_offset

This is a change needed for extent tree v2, as we will be growing the
header size. This exists in btrfs-progs currently, and not having it
makes syncing accessors.[ch] more problematic. So make this change to
set us up for extent tree v2 and match what btrfs-progs does to make
syncing easier.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 42c9419a 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: pass the extent buffer for the btrfs_item_nr helpers

This is actually a change for extent tree v2, but it exists in
btrfs-progs but not in the kernel. This makes it annoying to sync
accessors.h with btrfs-progs, and since this is the way I need it for
extent-tree v2 simply update these helpers to take the extent buffer in
order to make syncing possible now, and make the extent tree v2 stuff
easier moving forward.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3a3178c7 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move leaf_data_end into ctree.c

This is only used in ctree.c, with the exception of zero'ing out extent
buffers we're getting ready to write out. In theory we shouldn't have
an extent buffer with 0 items that we're writing out, however I'd rather
be safe than sorry so open code it in extent_io.c, and then copy the
helper into ctree.c. This will make it easier to sync accessors.[ch]
into btrfs-progs, as this requires a helper that isn't defined in
accessors.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bacf60e5 15-Nov-2022 Christoph Hellwig <hch@lst.de>

btrfs: move repair_io_failure to bio.c

repair_io_failure ties directly into all the glory low-level details of
mapping a bio with a logic address to the actual physical location.
Move it right below btrfs_submit_bio to keep all the related logic
together.

Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that
repair_io_failure is available in a header.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 103c1972 15-Nov-2022 Christoph Hellwig <hch@lst.de>

btrfs: split the bio submission path into a separate file

The code used by btrfs_submit_bio only interacts with the rest of
volumes.c through __btrfs_map_block (which itself is a more generic
version of two exported helpers) and does not really have anything
to do with volumes.c. Create a new bio.c file and a bio.h header
going along with it for the btrfs_bio-based storage layer, which
will grow even more going forward.

Also update the file with my copyright notice given that a large
part of the moved code was written or rewritten by me.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb3e217b 12-Nov-2022 Qu Wenruo <wqu@suse.com>

btrfs: use btrfs_dev_name() helper to handle missing devices better

[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:

BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)

Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".

[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).

Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.

[FIX]
Fix the output by:

- Move the btrfs_dev_name() helper to volumes.h

- Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
Only zoned code is not touched, as I'm not familiar with degraded
zoned code.

- Constify return value and parameter

Now the output looks pretty sane:

BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b3e744fe 11-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: use cached state when looking for delalloc ranges with fiemap

During fiemap, whenever we find a hole or prealloc extent, we will look
for delalloc in that range, and one of the things we do for that is to
find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using
calls to count_range_bits().

Since we process file extents from left to right, if we have a file with
several holes or prealloc extents, we benefit from keeping a cached extent
state record for calls to count_range_bits(). Most of the time the last
extent state record we visited in one call to count_range_bits() matches
the first extent state record we will use in the next call to
count_range_bits(), so there's a benefit here. So use an extent state
record to cache results from count_range_bits() calls during fiemap.

This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:

1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2c8f5e8c 11-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree

We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a
range as uptodate, we rely on the pages themselves being uptodate - page
reading is not triggered for already uptodate pages. Recently we removed
most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f42751
("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"),
but there were a few leftovers, namely when reading from holes and
successfully finishing read repair.

These leftovers are unnecessarily making an inode's tree larger and deeper,
slowing down searches on it. So remove all the leftovers.

This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:

1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 947a6299 13-Sep-2022 Qu Wenruo <wqu@suse.com>

btrfs: move tree block parentness check into validate_extent_buffer()

[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().

On the other hand, all the data verification is done at endio context.

[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.

With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.

And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.

This brings the following benefits:

- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.

- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e55cf7ca 27-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_add_delayed_iput

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d8f9268e 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_repair_one_sector

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c5ca391b 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to submit_one_bio

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d781c1c3 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_submit_dio_repair_bio

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b7620416 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_submit_data_read_bio

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 535a7e5d 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_submit_data_write_bio

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 644094fd 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode to btrfs_submit_metadata_bio

The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7920b773 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: drop parameter compression_type from btrfs_submit_dio_repair_bio

Compression and direct io don't work together so the compression
parameter can be dropped after previous patch that changed the call
to direct.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 19af6a7d 26-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: change how repair action is passed to btrfs_repair_one_sector

There's a function pointer passed to btrfs_repair_one_sector that will
submit the right bio for repair. However there are only two callbacks,
for buffered and for direct IO. This can be simplified to a bool-based
switch and call either function, indirect calls in this case is an
unnecessary abstraction. This allows to remove the submit_bio_hook_t
typedef.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ee5f017d 27-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: merge struct extent_page_data to btrfs_bio_ctrl

The two structures appear on the same call paths, btrfs_bio_ctrl is
embedded in extent_page_data and we pass bio_ctrl to some functions.
After merging there are fewer indirections and we have only one control
structure. The packing remains same.

The btrfs_bio_ctrl was selected as the target structure as the operation
is closer to bio processing.

Structure layout:

struct btrfs_bio_ctrl {
struct bio * bio; /* 0 8 */
int mirror_num; /* 8 4 */
enum btrfs_compression_type compress_type; /* 12 4 */
u32 len_to_stripe_boundary; /* 16 4 */
u32 len_to_oe_boundary; /* 20 4 */
btrfs_bio_end_io_t end_io_func; /* 24 8 */
bool extent_locked; /* 32 1 */
bool sync_io; /* 33 1 */

/* size: 40, cachelines: 1, members: 8 */
/* padding: 6 */
/* last cacheline: 40 bytes */
};

Signed-off-by: David Sterba <dsterba@suse.com>


# 8ec8519b 27-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: switch extent_page_data bit fields to bools

The semantics of the two members is a boolean, so change the type
accordingly. We have space in extent_page_data due to alignment there's
no change in size.

Signed-off-by: David Sterba <dsterba@suse.com>


# 7f0add25 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move super_block specific helpers into super.h

This will make syncing fs.h to user space a little easier if we can pull
the super block specific helpers out of fs.h and put them in super.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 77407dc0 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move dev-replace prototypes into dev-replace.h

We already have a dev-replace.h, simply move these prototypes and
helpers into dev-replace.h where they belong.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# af142b6f 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move file prototypes to file.h

Move these out of ctree.h into file.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7c8ede16 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move file-item prototypes into their own header

Move these prototypes out of ctree.h and into file-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 43dd529a 27-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: update function comments

Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.

Changes made:

- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos

Signed-off-by: David Sterba <dsterba@suse.com>


# 07e81dc9 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move accessor helpers into accessors.h

This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>


# ec8eb376 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h

We're going to use fs.h to hold fs wide related helpers and definitions,
move the FS_STATE enum and related helpers to fs.h, and then update all
files that need these definitions to include fs.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 48acc47d 14-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: do not use GFP_ATOMIC in the read endio

We have done read endio in an async thread for a very, very long time,
which makes the use of GFP_ATOMIC and unlock_extent_atomic() unneeded in
our read endio path. We've noticed under heavy memory pressure in our
fleet that we can fail these allocations, and then often trip a
BUG_ON(!allocation), which isn't an ideal outcome. Begin to address
this by simply not using GFP_ATOMIC, which will allow us to do things
like actually allocate a extent state when doing
set_extent_bits(UPTODATE) in the endio handler.

End io handlers are not called in atomic context, besides we have been
allocating failrec with GFP_NOFS so we'd notice there's a problem.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 877c1476 11-Oct-2022 Filipe Manana <fdmanana@suse.com>

btrfs: avoid duplicated resolution of indirect backrefs during fiemap

During fiemap, when determining if a data extent is shared or not, if we
don't find the extent is directly shared, then we need to determine if
it's shared through subtrees. For that we need to resolve the indirect
reference we found in order to figure out the path in the inode's fs tree,
which is a path starting at the fs tree's root node and going down to the
leaf that contains the file extent item that points to the data extent.
We then proceed to determine if any extent buffer in that path is shared
with other trees or not.

Currently whenever we find the data extent that a file extent item points
to is not directly shared, we always resolve the path in the fs tree, and
then check if any extent buffer in the path is shared. This is a lot of
work and when we have file extent items that belong to the same leaf, we
have the same path, so we only need to calculate it once.

This change does that, it keeps track of the current and previous leaf,
and when we find that a data extent is not directly shared, we try to
compute the fs tree path only once and then use it for every other file
extent item in the same leaf, using the existing cached path result for
the leaf as long as the cache results are valid.

This saves us from doing expensive b+tree searches in the fs tree of our
target inode, as well as other minor work.

The following test was run on a non-debug kernel (Debian's default kernel
config):

$ cat test-with-snapshots.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT

# 40G gives 327680 extents, each with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

# Add some more files to increase the size of the fs and extent
# trees (in the real world there's a lot of files and extents
# from other files).
xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3

# Create a snapshot so all the extents become indirectly shared
# through subtrees, with a generation less than or equals to the
# generation used to create the snapshot.
btrfs subvolume snapshot -r $MNT $MNT/snap1

umount $MNT
mount -o compress=lzo $DEV $MNT

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
echo

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"

umount $MNT

Result before applying this patch:

(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1204 milliseconds (metadata not cached)

/mnt/sdi/foobar: 327680 extents found
fiemap took 729 milliseconds (metadata cached)

Result after applying this patch:

(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 732 milliseconds (metadata not cached)

/mnt/sdi/foobar: 327680 extents found
fiemap took 421 milliseconds (metadata cached)

That's a -46.1% total reduction for the metadata not cached case, and
a -42.2% reduction for the cached metadata case.

The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 84a7949d 11-Oct-2022 Filipe Manana <fdmanana@suse.com>

btrfs: move ulists to data extent sharedness check context

When calling btrfs_is_data_extent_shared() we pass two ulists that were
allocated by the caller. This is because the single caller, fiemap, calls
btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
instead of allocating new ones before each call and freeing them after
each call.

Now that we have a context structure/object that we pass to
btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
their allocation and the context's allocation in a helper function, as
well as the freeing of the ulists and the context object. This allows to
reduce the number of parameters passed to btrfs_is_data_extent_shared(),
the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
and having the caller deal with allocating and releasing the ulists.

Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
that's a much better name as it reflects what the list is used for (and
matching the argument name for find_parent_nodes()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 61dbb952 11-Oct-2022 Filipe Manana <fdmanana@suse.com>

btrfs: turn the backref sharedness check cache into a context object

Right now we are using a struct btrfs_backref_shared_cache to pass state
across multiple btrfs_is_data_extent_shared() calls. The structure's name
closely follows its current purpose, which is to cache previous checks
for the sharedness of metadata extents. However we will start using the
structure for more things other than caching sharedness checks, so rename
it to struct btrfs_backref_share_check_ctx.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ceb707da 11-Oct-2022 Filipe Manana <fdmanana@suse.com>

btrfs: directly pass the inode to btrfs_is_data_extent_shared()

Currently we pass a root and an inode number as arguments for
btrfs_is_data_extent_shared() and the inode number is always from an
inode that belongs to that root (it wouldn't make sense otherwise).
In every context that we call btrfs_is_data_extent_shared() (fiemap only),
we have an inode available, so directly pass the inode to the function
instead of a root and inode number. This reduces the number of parameters
and it makes the function's signature conform to most other functions we
have.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 206c1d32 11-Oct-2022 Filipe Manana <fdmanana@suse.com>

btrfs: drop redundant bflags initialization when allocating extent buffer

When allocating an extent buffer, at __alloc_extent_buffer(), there's no
point in explicitly assigning zero to the bflags field of the new extent
buffer because we allocated it with kmem_cache_zalloc().

So just remove the redundant initialization, it saves one mov instruction
in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b98c6cd5 11-Oct-2022 Filipe Manana <fdmanana@suse.com>

btrfs: drop pointless memset when cloning extent buffer

At btrfs_clone_extent_buffer(), before allocating the pages array for the
new extent buffer we are calling memset() to zero out the pages array of
the extent buffer. This is pointless however, because the extent buffer
already has every element in its pages array pointing to NULL, as it was
allocated with kmem_cache_zalloc(). The memset() was introduced with
commit dd137dd1f2d719 ("btrfs: factor out allocating an array of pages"),
but even before that commit we already depended on the pages array being
initialized to NULL for the error paths that need to call
btrfs_release_extent_buffer().

So remove the memset(), it's useless and slightly increases the object
text size.

Before this change:

$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70580 5469 40 76089 12939 fs/btrfs/extent_io.o

After this change:

$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70564 5469 40 76073 12929 fs/btrfs/extent_io.o

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e5e886ba 30-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: add cached_state to read_extent_buffer_subpage

We don't use a cached state here at all, which generally makes sense as
async reads are going to unlock at endio time. However for blocking
reads we will call wait_extent_bit() for our range. Since the
lock_extent() stuff will return the cached_state for the start of the
range this is a helpful optimization to have for this case, we'll have
the exact state we want to wait on. Add a cached state here and simply
throw it away if we're a non-blocking read, otherwise we'll get a small
improvement by eliminating some tree searches.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 123a7f00 30-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: cache the failed state when locking extents

Currently if we fail to lock a range we'll return the start of the range
that we failed to lock. We'll then search down to this range and wait
on any extent states in this range.

However we can avoid this search altogether if we simply cache the
extent_state that had the contention. We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search. In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 83ae4133 30-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: add a cached_state to try_lock_extent

With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent(). This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 04c6b79a 23-Aug-2022 Vishal Moola (Oracle) <vishal.moola@gmail.com>

btrfs: convert __process_pages_contig() to use filemap_get_folios_contig()

Convert to use folios throughout. This is in preparation for the removal
of find_get_pages_contig(). Now also supports large folios.

Since we may receive more than nr_pages pages, nr_pages may underflow.
Since nr_pages > 0 is equivalent to index <= end_index, we replaced it
with this check instead.

Link: https://lkml.kernel.org/r/20220824004023.77310-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterb@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 5467abba 12-Sep-2022 Qu Wenruo <wqu@suse.com>

btrfs: move end_io_func argument to btrfs_bio_ctrl structure

For function submit_extent_page() and alloc_new_bio(), we have an
argument @end_io_func to indicate the end io function.

But that function never change inside any call site of them, thus no
need to pass the pointer around everywhere.

There is a better match for the lifespan of all the call sites, as we
have btrfs_bio_ctrl structure, thus we can put the endio function
pointer there, and grab the pointer every time we allocate a new bio.

Also add extra ASSERT()s to make sure every call site of
submit_extent_page() and alloc_new_bio() has properly set the pointer
inside btrfs_bio_ctrl.

This removes one argument from the already long argument list of
submit_extent_page().

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 209ecde5 12-Sep-2022 Qu Wenruo <wqu@suse.com>

btrfs: switch page and disk_bytenr argument position for submit_extent_page()

Normally we put (page, pg_len, pg_offset) arguments together, just like
what __bio_add_page() does.

But in submit_extent_page(), what we got is, (page, disk_bytenr, pg_len,
pg_offset), which sometimes can be confusing.

Change the order to (disk_bytenr, page, pg_len, pg_offset) to make it
to follow the common schema.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 814b6f91 12-Sep-2022 Qu Wenruo <wqu@suse.com>

btrfs: update the comment for submit_extent_page()

Since commit 390ed29b817e ("btrfs: refactor submit_extent_page() to make
bio and its flag tracing easier"), we are using bio_ctrl structure to
replace some of arguments of submit_extent_page().

But unfortunately that commit didn't update the comment for
submit_extent_page(), thus some arguments are stale like:

- bio_ret
- mirror_num
Those are all contained in bio_ctrl now.

- prev_bio_flags
We no longer use this flag to determine if we can merge bios.

Update the comment for submit_extent_page() to keep it up-to-date.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ee8ba05c 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: open code and remove btrfs_inode_sectorsize helper

This is defined in btrfs_inode.h, and dereferences btrfs_root and
btrfs_fs_info, both of which aren't defined in btrfs_inode.h.
Additionally, in many places we already have root or fs_info, so this
helper often makes the code harder to read. So delete the helper and
simply open code it in the few places that we use it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd86a532 07-Sep-2022 Christoph Hellwig <hch@lst.de>

btrfs: stop tracking failed reads in the I/O tree

There is a separate I/O failure tree to track the fail reads, so remove
the extra EXTENT_DAMAGED bit in the I/O tree as it's set but never used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd015294 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS

Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b71fb16b 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: don't clear CTL bits when trying to release extent state

When trying to release the extent states due to memory pressure we'll
set all the bits except LOCKED, NODATASUM, and DELALLOC_NEW. This
includes some of the CTL bits, which isn't really a problem but isn't
correct either. Exclude the CTL bits from this clearing.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 570eb97b 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: unify the lock/unlock extent variants

We have two variants of lock/unlock extent, one set that takes a cached
state, another that does not. This is slightly annoying, and generally
speaking there are only a few places where we don't have a cached state.
Simplify this by making lock_extent/unlock_extent the only variant and
make it take a cached state, then convert all the callers appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dbbf4992 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove the wake argument from clear_extent_bits

This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e3974c66 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move core extent_io_tree functions to extent-io-tree.c

This is still huge, but unfortunately I cannot make it smaller without
renaming tree_search() and changing all the callers to use the new name,
then moving those chunks and then changing the name back. This feels
like too much churn for code movement, so I've limited this to only
things that called tree_search(). With this patch all of the
extent_io_tree code is now in extent-io-tree.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38830018 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move a few exported extent_io_tree helpers to extent-io-tree.c

These are the last few helpers that do not rely on tree_search() and
who's other helpers are exported and in extent-io-tree.c already. Move
these across now in order to make the core move smaller.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 04eba893 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: temporarily export and then move extent state helpers

In order to avoid moving all of the related code at once temporarily
export all of the extent state related helpers. Then move these helpers
into extent-io-tree.c. We will clean up the exports and make them
static in followup patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 91af24e4 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: temporarily export and move core extent_io_tree tree functions

A lot of the various internals of extent_io_tree call these two
functions for insert or searching the rb tree for entries, so
temporarily export them and then move them to extent-io-tree.c. We
can't move tree_search() without renaming it, and I don't want to
introduce a bunch of churn just to do that, so move these functions
first and then we can move a few big functions and then the remaining
users of tree_search().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6962541e 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_debug_check_extent_io_range into extent-io-tree.c

This helper is used by a lot of the core extent_io_tree helpers, so
temporarily export it and move it into extent-io-tree.c in order to make
it straightforward to migrate the helpers in batches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ec39e39b 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: export wait_extent_bit

This is used by the subpage code in addition to lock_extent_bits, so
export it so we can move it out of extent_io.c

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a6631887 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move simple extent bit helpers out of extent_io.c

These are just variants and wrappers around the actual work horses of
the extent state. Extract these out of extent_io.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ad795329 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: convert BUG_ON(EXTENT_BIT_LOCKED) checks to ASSERT's

We only call these functions from the qgroup code which doesn't call
with EXTENT_BIT_LOCKED. These are BUG_ON()'s that exist to keep us
developers from using these functions with EXTENT_BIT_LOCKED, so convert
them to ASSERT()'s.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 83cf709a 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move extent state init and alloc functions to their own file

Start cleaning up extent_io.c by moving the extent state code out of it.
This patch starts with the extent state allocation code and the
extent_io_tree init code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c45379a2 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: temporarily export alloc_extent_state helpers

We're going to move this code in stages, but while we're doing that we
need to export these helpers so we can more easily move the code into
the new file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a40246e8 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: separate out the eb and extent state leak helpers

Currently we have the add/del functions generic so that we can use them
for both extent buffers and extent states. We want to separate this
code however, so separate these helpers into per-object helpers in
anticipation of the split.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a62a3bd9 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: separate out the extent state and extent buffer init code

In order to help separate the extent buffer from the extent io tree code
we need to break up the init functions.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cdca85b0 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: use find_first_extent_bit in btrfs_clean_io_failure

Currently we're using find_first_extent_bit_state to check if our state
contains the given failrec range, however this is more of an internal
extent_io_tree helper, and is technically unsafe to use because we're
accessing the state outside of the extent_io_tree lock.

Instead use the normal helper find_first_extent_bit which returns the
range of the extent state we find in find_first_extent_bit_state and use
that to do our sanity checking.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 87c11705 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: convert the io_failure_tree to a plain rb_tree

We still have this oddity of stashing the io_failure_record in the
extent state for the io_failure_tree, which is leftover from when we
used to stuff private pointers in extent_io_trees.

However this doesn't make a lot of sense for the io failure records, we
can simply use a normal rb_tree for this. This will allow us to further
simplify the extent_io_tree code by removing the io_failure_rec pointer
from the extent state.

Convert the io_failure_tree to an rb tree + spinlock in the inode, and
then use our rb tree simple helpers to insert and find failed records.
This greatly cleans up this code and makes it easier to separate out the
extent_io_tree code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a2061748 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: unexport internal failrec functions

These are internally used functions and are not used outside of
extent_io.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d0a762c 09-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: rename clean_io_failure and remove extraneous args

This is exported, so rename it to btrfs_clean_io_failure. Additionally
we are passing in the io tree's and such from the inode, so instead of
doing all that simply pass in the inode itself and get all the
components we need directly inside of btrfs_clean_io_failure.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ac3c0d36 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: make fiemap more efficient and accurate reporting extent sharedness

The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.

To understand why the part of finding which extents a file has is very
inefficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:

1) Before entering the main loop, we call get_extent_skip_holes() to get
the first extent map. This leads us to btrfs_get_extent_fiemap(), which
in turn calls btrfs_get_extent(), to find the first extent map that
covers the file range [0, LLONG_MAX).

btrfs_get_extent() will first search the inode's extent map tree, to
see if we have an extent map there that covers the range. If it does
not find one, then it will search the inode's subvolume b+tree for a
fitting file extent item. After finding the file extent item, it will
allocate an extent map, fill it in with information extracted from the
file extent item, and add it to the inode's extent map tree (which
requires a search for insertion in the tree).

2) Then we enter the main loop at extent_fiemap(), emit the details of
the extent, and call again get_extent_skip_holes(), with a start
offset matching the end of the extent map we previously processed.

We end up at btrfs_get_extent() again, will search the extent map tree
and then search the subvolume b+tree for a file extent item if we could
not find an extent map in the extent tree. We allocate an extent map,
fill it in with the details in the file extent item, and then insert
it into the extent map tree (yet another search in this tree).

3) The second step is repeated over and over, until we have processed the
whole file range. Each iteration ends at btrfs_get_extent(), which
does a red black tree search on the extent map tree, then searches the
subvolume b+tree, allocates an extent map and then does another search
in the extent map tree in order to insert the extent map.

In the best scenario we have all the extent maps already in the extent
tree, and so for each extent we do a single search on a red black tree,
so we have a complexity of O(n log n).

In the worst scenario we don't have any extent map already loaded in
the extent map tree, or have very few already there. In this case the
complexity is much higher since we do:

- A red black tree search on the extent map tree, which has O(log n)
complexity, initially very fast since the tree is empty or very
small, but as we end up allocating extent maps and adding them to
the tree when we don't find them there, each subsequent search on
the tree gets slower, since it's getting bigger and bigger after
each iteration.

- A search on the subvolume b+tree, also O(log n) complexity, but it
has items for all inodes in the subvolume, not just items for our
inode. Plus on a filesystem with concurrent operations on other
inodes, we can block doing the search due to lock contention on
b+tree nodes/leaves.

- Allocate an extent map - this can block, and can also fail if we
are under serious memory pressure.

- Do another search on the extent maps red black tree, with the goal
of inserting the extent map we just allocated. Again, after every
iteration this tree is getting bigger by 1 element, so after many
iterations the searches are slower and slower.

- We will not need the allocated extent map anymore, so it's pointless
to add it to the extent map tree. It's just wasting time and memory.

In short we end up searching the extent map tree multiple times, on a
tree that is growing bigger and bigger after each iteration. And
besides that we visit the same leaf of the subvolume b+tree many times,
since a leaf with the default size of 16K can easily have more than 200
file extent items.

This is very inefficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.

Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.

This reproducer triggers that bug:

$ cat fiemap-bug.sh
#!/bin/bash

DEV=/dev/sdj
MNT=/mnt/sdj

mkfs.btrfs -f $DEV
mount $DEV $MNT

# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo

# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo

umount $MNT

Running the reproducer:

$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This patch is part of a larger patchset that is comprised of the following
patches:

btrfs: allow hole and data seeking to be interruptible
btrfs: make hole and data seeking a lot more efficient
btrfs: remove check for impossible block start for an extent map at fiemap
btrfs: remove zero length check when entering fiemap
btrfs: properly flush delalloc when entering fiemap
btrfs: allow fiemap to be interruptible
btrfs: rename btrfs_check_shared() to a more descriptive name
btrfs: speedup checking for extent sharedness during fiemap
btrfs: skip unnecessary extent buffer sharedness checks during fiemap
btrfs: make fiemap more efficient and accurate reporting extent sharedness

The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.

The following test for a large compressed file without holes:

$ cat fiemap-perf-test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT

# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar

umount $MNT
mount -o compress=lzo $DEV $MNT

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"

umount $MNT

Before patchset:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)

After patchset:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1214 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 684 milliseconds (metadata cached)

That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).

The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:

$ cat pavels-test.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>

#include <sys/stat.h>
#include <sys/time.h>
#include <sys/ioctl.h>

#include <linux/fs.h>
#include <linux/fiemap.h>

#define FILE_INTERVAL (1<<13) /* 8Kb */

long long interval(struct timeval t1, struct timeval t2)
{
long long val = 0;
val += (t2.tv_usec - t1.tv_usec);
val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
return val;
}

int main(int argc, char **argv)
{
struct fiemap fiemap = {};
struct timeval t1, t2;
char data = 'a';
struct stat st;
int fd, off, file_size = FILE_INTERVAL;

if (argc != 3 && argc != 2) {
printf("usage: %s <path> [size]\n", argv[0]);
return 1;
}

if (argc == 3)
file_size = atoi(argv[2]);
if (file_size < FILE_INTERVAL)
file_size = FILE_INTERVAL;
file_size -= file_size % FILE_INTERVAL;

fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
perror("open");
return 1;
}

for (off = 0; off < file_size; off += FILE_INTERVAL) {
if (pwrite(fd, &data, 1, off) != 1) {
perror("pwrite");
close(fd);
return 1;
}
}

if (ftruncate(fd, file_size)) {
perror("ftruncate");
close(fd);
return 1;
}

if (fstat(fd, &st) < 0) {
perror("fstat");
close(fd);
return 1;
}

printf("size: %ld\n", st.st_size);
printf("actual size: %ld\n", st.st_blocks * 512);

fiemap.fm_length = FIEMAP_MAX_OFFSET;
gettimeofday(&t1, NULL);
if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
perror("fiemap");
close(fd);
return 1;
}
gettimeofday(&t2, NULL);

printf("fiemap: fm_mapped_extents = %d\n",
fiemap.fm_mapped_extents);
printf("time = %lld us\n", interval(t1, t2));

close(fd);
return 0;
}

$ gcc -o pavels_test pavels_test.c

And the wrapper shell script:

$ cat fiemap-pavels-test.sh

#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f -O no-holes $DEV
mount $DEV $MNT

echo
echo "*********** 256M ***********"
echo

./pavels-test $MNT/testfile $((1 << 28))
echo
./pavels-test $MNT/testfile $((1 << 28))

echo
echo "*********** 512M ***********"
echo

./pavels-test $MNT/testfile $((1 << 29))
echo
./pavels-test $MNT/testfile $((1 << 29))

echo
echo "*********** 1G ***********"
echo

./pavels-test $MNT/testfile $((1 << 30))
echo
./pavels-test $MNT/testfile $((1 << 30))

umount $MNT

Running his reproducer before applying the patchset:

*********** 256M ***********

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4003133 us

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4895330 us

*********** 512M ***********

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 30123675 us

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 33450934 us

*********** 1G ***********

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 224924074 us

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 217239242 us

Running it after applying the patchset:

*********** 256M ***********

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29475 us

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29307 us

*********** 512M ***********

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 58996 us

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 59115 us

*********** 1G ***********

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 116251
time = 124141 us

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 119387 us

The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).

For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.

For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.

For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.

Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b8f164e3 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: skip unnecessary extent buffer sharedness checks during fiemap

During fiemap, for each file extent we find, we must check if it's shared
or not. The sharedness check starts by verifying if the extent is directly
shared (its refcount in the extent tree is > 1), and if it is not directly
shared, then we will check if every node in the subvolume b+tree leading
from the root to the leaf that has the file extent item (in reverse order),
is shared (through snapshots).

However this second step is not needed if our extent was created in a
transaction more recent than the last transaction where a snapshot of the
inode's root happened, because it can't be shared indirectly (through
shared subtrees) without a snapshot created in a more recent transaction.

So grab the generation of the extent from the extent map and pass it to
btrfs_is_data_extent_shared(), which will skip this second phase when the
generation is more recent than the root's last snapshot value. Note that
we skip this optimization if the extent map is the result of merging 2
or more extent maps, because in this case its generation is the maximum
of the generations of all merged extent maps.

The fact the we use extent maps and they can be merged despite the
underlying extents being distinct (different file extent items in the
subvolume b+tree and different extent items in the extent b+tree), can
result in some bugs when reporting shared extents. But this is a problem
of the current implementation of fiemap relying on extent maps.
One example where we get incorrect results is:

$ cat fiemap-bug.sh
#!/bin/bash

DEV=/dev/sdj
MNT=/mnt/sdj

mkfs.btrfs -f $DEV
mount $DEV $MNT

# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo

# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo

umount $MNT

Running the reproducer:

$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This is z problem that existed before this change, and remains after this
change, as it can't be easily fixed. The next patch in the series reworks
fiemap to primarily use file extent items instead of extent maps (except
for checking for delalloc ranges), with the goal of improving its
scalability and performance, but it also ends up fixing this particular
bug caused by extent map merging.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 12a824dc 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: speedup checking for extent sharedness during fiemap

One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:

1) Check if the data extent is shared. This implies checking the extent
item in the extent tree, checking delayed references, etc. If we
find the data extent is directly shared, we terminate immediately;

2) If the data extent is not directly shared (its extent item has a
refcount of 1), then it may be shared if we have snapshots that share
subtrees of the inode's subvolume b+tree. So we check if the leaf
containing the file extent item is shared, then its parent node, then
the parent node of the parent node, etc, until we reach the root node
or we find one of them is shared - in which case we stop immediately.

During fiemap we process the extents of a file from left to right, from
file offset 0 to EOF. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.

For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.

On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.

This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.

Example performance test:

$ cat fiemap-perf-test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT

# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

umount $MNT
mount -o compress=lzo $DEV $MNT

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"

umount $MNT

Before this patch:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)

After this patch:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1646 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 698 milliseconds (metadata cached)

That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.

Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8eedadda 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: rename btrfs_check_shared() to a more descriptive name

The function btrfs_check_shared() is supposed to be used to check if a
data extent is shared, but its name is too generic, may easily cause
confusion in the sense that it may be used for metadata extents.

So rename it to btrfs_is_data_extent_shared(), which will also make it
less confusing after the next change that adds a backref lookup cache for
the b+tree nodes that lead to the leaf that contains the file extent item
that points to the target data extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 09fbc1c8 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: allow fiemap to be interruptible

Doing fiemap on a file with a very large number of extents can take a very
long time, and we have reports of it being too slow (two recent examples
in the Link tags below), so make it interruptible.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9a42bbae 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: remove zero length check when entering fiemap

There's no point to check for a 0 length at extent_fiemap(), as before
calling it, we called fiemap_prep() at btrfs_fiemap(), which already
checks for a zero length and returns the same -EINVAL error. So remove
the pointless check.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f12eec9a 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: remove check for impossible block start for an extent map at fiemap

During fiemap we are testing if an extent map has a block start with a
value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
and never was according to git history. So remove that useless check.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 917f32a2 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: give struct btrfs_bio a real end_io handler

Currently btrfs_bio end I/O handling is a bit of a mess. The bi_end_io
handler and bi_private pointer of the embedded struct bio are both used
to handle the completion of the high-level btrfs_bio and for the I/O
completion for the low-level device that the embedded bio ends up being
sent to.

To support this bi_end_io and bi_private are saved into the
btrfs_io_context structure and then restored after the bio sent to the
underlying device has completed the actual I/O.

Untangle this by adding an end I/O handler and private data to struct
btrfs_bio for the high-level btrfs_bio based completions, and leave the
actual bio bi_end_io handler and bi_private pointer entirely to the
low-level device I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b42f5e3 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: pass the operation to btrfs_bio_alloc

Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
set does.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d45cfb88 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: move btrfs_bio allocation to volumes.c

volumes.c is the place that implements the storage layer using the
btrfs_bio structure, so move the bio_set and allocation helpers there
as well.

To make up for the new initialization boilerplate, merge the two
init/exit helpers in extent_io.c into a single one.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1e408af3 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: don't create integrity bioset for btrfs_bioset

btrfs never uses bio integrity data itself, so don't allocate
the integrity pools for btrfs_bioset.

This patch is a revert of the commit b208c2f7ceaf ("btrfs: Fix crash due
to not allocating integrity data for a set"). The integrity data pool
is not needed, the bio-integrity code now handles allocating the
integrity payload without that.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 52b029f4 18-Aug-2022 Ethan Lien <ethanlien@synology.com>

btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path

After we copied data to page cache in buffered I/O, we
1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
endio_readpage_release_extent(), set_extent_delalloc() or
set_extent_defrag().
2. Set page uptodate before we unlock the page.

But the only place we check io_tree's EXTENT_UPTODATE state is in
btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.

For example, when performing a buffered random read:

fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
--filesize=32G --size=4G --bs=4k --name=job \
--filename=/mnt/file --name=job

Then check how many extent_state in io_tree:

cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'

w/o this patch, we got 640567 btrfs_extent_state.
w/ this patch, we got 204 btrfs_extent_state.

Maintaining such a big tree brings overhead since every I/O needs to insert
EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
every insert or remove, we need to lock io_tree, do tree search, alloc or
dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
io_tree in a minimal size and reduce overhead when performing buffered I/O.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e5677f05 09-Aug-2022 Uros Bizjak <ubizjak@gmail.com>

btrfs: use atomic_try_cmpxchg in free_extent_buffer

Use `atomic_try_cmpxchg(ptr, &old, new)` instead of
`atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This
has two benefits:

- The x86 cmpxchg instruction returns success in the ZF flag, so this
change saves a compare after cmpxchg, as well as a related move
instruction in the front of cmpxchg.

- atomic_try_cmpxchg implicitly assigns the *ptr value to &old when
cmpxchg fails, enabling further code simplifications.

This patch has no functional change.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4a445b7b 13-Aug-2022 Qu Wenruo <wqu@suse.com>

btrfs: don't merge pages into bio if their page offset is not contiguous

[BUG]
Zygo reported on latest development branch, he could hit
ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
corrupt one disk, and let btrfs to recover the data during read/scrub).

And The following minimal reproducer can cause extent state leakage at
rmmod time:

mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
mount $dev1 $mnt
fsstress -w -d $mnt -n 25 -s 1660807876
sync
fssum -A -f -w /tmp/fssum.saved $mnt
umount $mnt

# Wipe the dev1 but keeps its super block
xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
mount $dev1 $mnt
fssum -r /tmp/fssum.saved $mnt > /dev/null
umount $mnt
rmmod btrfs

This will lead to the following extent states leakage:

BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1

[CAUSE]
Since commit 7aa51232e204 ("btrfs: pass a btrfs_bio to
btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
determine the file offset of a page.

But that usage assume that, one bio has all its page having a continuous
page offsets.

Unfortunately that's not true, btrfs only requires the logical bytenr
contiguous when assembling its bios.

From above script, we have one bio looks like this:

fssum-27671 submit_one_bio: bio logical=217739264 len=36864
fssum-27671 submit_one_bio: r/i=5/261 page_offset=466944 <<<
fssum-27671 submit_one_bio: r/i=5/261 page_offset=724992 <<<
fssum-27671 submit_one_bio: r/i=5/261 page_offset=729088
fssum-27671 submit_one_bio: r/i=5/261 page_offset=733184
fssum-27671 submit_one_bio: r/i=5/261 page_offset=737280
fssum-27671 submit_one_bio: r/i=5/261 page_offset=741376
fssum-27671 submit_one_bio: r/i=5/261 page_offset=745472
fssum-27671 submit_one_bio: r/i=5/261 page_offset=749568
fssum-27671 submit_one_bio: r/i=5/261 page_offset=753664

Note that the 1st and the 2nd page has non-contiguous page offsets.

This means, at repair time, we will have completely wrong file offset
passed in:

kworker/u32:2-19927 btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192

Since the file offset is incorrect, we latter incorrectly set the extent
states, and no way to really release them.

Thus later it causes the leakage.

In fact, this can be even worse, since the file offset is incorrect, we
can hit cases like the incorrect file offset belongs to a HOLE, and
later cause btrfs_num_copies() to trigger error, finally hit
BUG_ON()/ASSERT() later.

[FIX]
Add an extra condition in btrfs_bio_add_page() for uncompressed IO.

Now we will have more strict requirement for bio pages:

- They should all have the same mapping
(the mapping check is already implied by the call chain)

- Their logical bytenr should be adjacent
This is the same as the old condition.

- Their page_offset() (file offset) should be adjacent
This is the new check.
This would result a slightly increased amount of bios from btrfs
(needs holes and inside the same stripe boundary to trigger).

But this would greatly reduce the confusion, as it's pretty common
to assume a btrfs bio would only contain continuous page cache.

Later we may need extra cleanups, as we no longer needs to handle gaps
between page offsets in endio functions.

Currently this should be the minimal patch to fix commit 7aa51232e204
("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 7aa51232e204 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b40130b2 26-Jul-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: fix lockdep splat with reloc root extent buffers

We have been hitting the following lockdep splat with btrfs/187 recently

WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30

-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

other info that might help us debug this:

Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);

*** DEADLOCK ***

7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:

dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of

reloc tree -> normal tree

for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.

However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of

normal tree -> reloc tree

which is why lockdep complains.

Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.

Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.

This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of

normal tree -> reloc tree

We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.

With this patch we no longer have the lockdep splat in btrfs/187.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 81bd9328 06-Jul-2022 Christoph Hellwig <hch@lst.de>

btrfs: fix repair of compressed extents

Currently the checksum of compressed extents is verified based on the
compressed data and the lower btrfs_bio, but the actual repair process
is driven by end_bio_extent_readpage on the upper btrfs_bio for the
decompressed data.

This has a bunch of issues, including not being able to properly
communicate the failed mirror up in case that the I/O submission got
preempted, a general loss of if an error was an I/O error or a checksum
verification failure, but most importantly that this design causes
btrfs_clean_io_failure to eventually write back the uncompressed good
data onto the disk sectors that are supposed to contain compressed data.

Fix this by moving the repair to the lower btrfs_bio. To do so, a fair
amount of code has to be reshuffled:

a) the lower btrfs_bio now needs a valid csum pointer. The easiest way
to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
the btrfs_bio management of csums. For a compressed_bio that is
split into multiple btrfs_bios this means additional memory
allocations, but the code becomes a lot more regular.
b) checksum verification now runs directly on the lower btrfs_bio instead
of the compressed_bio. This actually nicely simplifies the end I/O
processing.
c) btrfs_repair_one_sector can't just look up the logical address for
the file offset any more, as there is no corresponding relative
offsets that apply to the file offset and the logic address for
compressed extents. Instead require that the saved bvec_iter in the
btrfs_bio is filled out for all read bios and use that, which again
removes a fair amount of code.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7aa51232 06-Jul-2022 Christoph Hellwig <hch@lst.de>

btrfs: pass a btrfs_bio to btrfs_repair_one_sector

Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
and remove the start and failed_mirror arguments in favor of deriving
them from the btrfs_bio. For this to work ensure that the file_offset
field is also initialized for buffered I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c144c63f 06-Jul-2022 Christoph Hellwig <hch@lst.de>

btrfs: repair all known bad mirrors

When there is more than a single level of redundancy there can also be
multiple bad mirrors, and the current read repair code only repairs the
last bad one.

Restructure btrfs_repair_one_sector so that it records the originally
failed mirror and the number of copies, and then repair all known bad
copies until we reach the originally failed copy in clean_io_failure.
Note that this also means the read repair reads will always start from
the next bad mirror and not mirror 0.

This fixes btrfs/265 in xfstests.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7b12a62 08-Jul-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size

On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.

The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.

The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).

[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---

To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.

Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f3e90c1c 21-Jun-2022 Christoph Hellwig <hch@lst.de>

btrfs: remove extent writepage address space operation

Same as in commit 21b4ee7029c9 ("xfs: drop ->writepage completely"): we
can remove the callback as it's only used in one place - single page
writeback from memory reclaim and is not called for cgroup writeback at
all.

We only allow such writeback from kswapd, not from direct memory
reclaim, and so it is rarely used. When it comes from kswapd, it is
effectively random dirty page shoot-down, which is horrible for IO
patterns. We can rely on background writeback to clean all dirty pages
in an efficient way and not let it be interrupted by kswapd.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9db33891 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: unify tree search helper returning prev and next nodes

Simplify helper to return only next and prev pointers, we don't need all
the node/parent/prev/next pointers of __etree_search as there are now
other specialized helpers. Rename parameters so they follow the naming.

Signed-off-by: David Sterba <dsterba@suse.com>


# ec60c76f 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: make tree search for insert more generic and use it for tree_search

With a slight extension of tree_search_for_insert (fill the return node
and parent return parameters) we can avoid calling __etree_search from
tree_search, that could be removed eventually in followup patches.

Signed-off-by: David Sterba <dsterba@suse.com>


# bebb22c1 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: open code inexact rbtree search in tree_search

The call chain from

tree_search
tree_search_for_insert
__etree_search

can be open coded and allow further simplifications, here we need a tree
search with fallback to the next node in case it's not found. This is
represented as __etree_search parameters next_ret=valid, prev_ret=NULL.

Signed-off-by: David Sterba <dsterba@suse.com>


# c367602a 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: remove node and parent parameters from insert_state

There's no caller left that would pass valid pointers to insert_state so
we can drop them.

Signed-off-by: David Sterba <dsterba@suse.com>


# fb8f07d2 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: add fast path for extent_state insertion

In two cases the exact location where to insert the extent state is
known at the call time so we don't need to pass it to insert_state that
takes the fast path.

Signed-off-by: David Sterba <dsterba@suse.com>


# 6d92b304 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: pass bits by value not by pointer for extent_state helpers

The bits are passed to all extent state helpers for no apparent reason,
the value only read and never updated so remove the indirection and pass
it directly. Also unify the type to u32 where needed.

Signed-off-by: David Sterba <dsterba@suse.com>


# cee51268 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: lift start and end parameters to callers of insert_state

Let callers of insert_state to set up the extent state to allow further
simplifications of the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# c7e118cf 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: open code rbtree search in insert_state

The rbtree search is a known pattern and can be open coded, allowing to
remove the tree_insert and further cleanups.

Signed-off-by: David Sterba <dsterba@suse.com>


# 12c9cdda 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: open code rbtree search in split_state

Preparatory work to remove tree_insert from extent_io.c, the rbtree
search loop is a known and simple so it can be open coded.

Signed-off-by: David Sterba <dsterba@suse.com>


# 722c82ac 03-Jun-2022 Christoph Hellwig <hch@lst.de>

btrfs: pass the btrfs_bio_ctrl to submit_one_bio

submit_one_bio always works on the bio and compression flags from a
btrfs_bio_ctrl structure. Pass the explicitly and clean up the
calling conventions by handling a NULL bio in submit_one_bio, and
using the btrfs_bio_ctrl to pass the mirror number as well.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9845e5dd 03-Jun-2022 Christoph Hellwig <hch@lst.de>

btrfs: merge end_write_bio and flush_write_bio

Merge end_write_bio and flush_write_bio into a single submit_write_bio
helper, that either submits the bio or ends it if a negative errno was
passed in. This consolidates a lot of duplicated checks in the callers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2d5ac130 03-Jun-2022 Christoph Hellwig <hch@lst.de>

btrfs: don't use bio->bi_private to pass the inode to submit_one_bio

submit_one_bio is only used for page cache I/O, so the inode can be
trivially derived from the first page in the bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9ff7ddd3 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: do not allocate a btrfs_bio for low-level bios

The bios submitted from btrfs_map_bio don't really interact with the
rest of btrfs and the only btrfs_bio member actually used in the
low-level bios is the pointer to the btrfs_io_context used for endio
handler.

Use a union in struct btrfs_io_stripe that allows the endio handler to
find the btrfs_io_context and remove the spurious ->device assignment
so that a plain fs_bio_set bio can be used for the low-level bios
allocated inside btrfs_map_bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 08a6f464 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: centralize setting REQ_META

Set REQ_META in btrfs_submit_metadata_bio instead of the various callers.
We'll start relying on this flag inside of btrfs in a bit, and this
ensures it is always set correctly.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c93104e7 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: split btrfs_submit_data_bio to read and write parts

Split btrfs_submit_data_bio into one helper for reads and one for writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 21a8935e 01-Jun-2022 David Sterba <dsterba@suse.com>

btrfs: remove redundant calls to flush_dcache_page

Both memzero_page and memcpy_to_page already call flush_dcache_page so
we can remove the calls from btrfs code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 97861cd1 22-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: refactor end_bio_extent_readpage code flow

Untangle the goto and move the code it jumps to so it goes in the order
of the most likely states first.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# a5aa7ab6 22-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: factor out a helper to end a single sector buffer I/O

Add a helper to end I/O on a single sector, which will come in handy
with the new read repair code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fd5a6f63 22-May-2022 Qu Wenruo <wqu@suse.com>

btrfs: remove duplicated parameters from submit_data_read_repair()

The function submit_data_read_repair() is only called for buffered data
read path, thus those members can be calculated using bvec directly:

- start
start = page_offset(bvec->bv_page) + bvec->bv_offset;

- end
end = start + bvec->bv_len - 1;

- page
page = bvec->bv_page;

- pgoff
pgoff = bvec->bv_offset;

Thus we can safely replace those 4 parameters with just one bio_vec.

Also remove the unused return value.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: also remove the return value]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1280d2d1 26-May-2022 Fanjun Kong <bh1scw@gmail.com>

btrfs: use PAGE_ALIGNED instead of IS_ALIGNED

The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
use it instead of IS_ALIGNED and passing PAGE_SIZE directly.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bf9486d6 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

fs/btrfs: Use the enum req_op and blk_opf_t types

Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.

Acked-by: David Sterba <dsterba@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 01cd3909 15-Jul-2022 David Sterba <dsterba@suse.com>

Revert "btrfs: turn fs_info member buffer_radix into XArray"

This reverts commit 8ee922689d67b7cfa6acbe2aa1ee76ac72e6fc8a.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 19ab78ca 07-Jun-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix critical section of relocation inode writeback

We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
write out to the relocation inode. That critical section must include all
the IO submission for the inode. However, flush_write_bio() in
extent_writepages() is out of the critical section, causing an IO
submission outside of the lock. This leads to an out of the order IO
submission and fail the relocation process.

Fix it by extending the critical section.

Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 56fbb0a4 03-May-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: properly finish block group on metadata write

Commit be1a1d7a5d24 ("btrfs: zoned: finish fully written block group")
introduced zone finishing code both for data and metadata end_io path.
However, the metadata side is not working as it should. First, it
compares logical address (eb->start + eb->len) with offset within a
block group (cache->zone_capacity) in submit_eb_page(). That essentially
disabled zone finishing on metadata end_io path.

Furthermore, fixing the issue above revealed we cannot call
btrfs_zone_finish_endio() in end_extent_buffer_writeback(). We cannot
call btrfs_lookup_block_group() which require spin lock inside end_io
context.

Introduce btrfs_schedule_zone_finish_bg() to wait for the extent buffer
writeback and do the zone finish IO in a workqueue.

Also, drop EXTENT_BUFFER_ZONE_FINISH as it is no longer used.

Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0f07003b 27-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: rename bio_ctrl::bio_flags to compress_type

The bio_ctrl is the last use of bio_flags that has been converted to
compress type everywhere else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb3a12d9 27-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: rename bio_flags in parameters and switch type

Several functions take parameter bio_flags that was simplified to just
compress type, unify it and change the type accordingly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0ff40013 27-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: rename io_failure_record::bio_flags to compress_type

The bio_flags is now used to store unchanged compress type, so unify
that.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7f6ca7f2 27-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: open code extent_set_compress_type helpers

The helpers extent_set_compress_type and extent_compress_type have
become trivial after previous cleanups and can be removed.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2a5232a8 27-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: simplify handling of bio_ctrl::bio_flags

The bio_flags are used only to encode the compression and there are no
other EXTENT_BIO_* flags, so the compress type can be stored directly.
The struct member name is left unchanged and will be cleaned in later
patches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 572f3dad 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: remove trivial helper update_nr_written

The helper used to do more with the wbc state but now it's just one
subtraction, no need to have a special helper.

It became trivial in a91326679f2a ("Btrfs: make mapping->writeback_index
point to the last written page").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8ee92268 21-Apr-2022 Gabriel Niebler <gniebler@suse.com>

btrfs: turn fs_info member buffer_radix into XArray

… named 'extent_buffers'. Also adjust all usages of this object to use
the XArray API, which greatly simplifies the code as it takes care of
locking and is generally easier to use and understand, providing
notionally simpler array semantics.

Also perform some light refactoring.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# abf48d58 15-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: remove unused bio_flags argument to btrfs_submit_metadata_bio

This argument is unused since commit 953651eb308f ("btrfs: factor out
helper adding a page to bio") and commit 1b36294a6cd5 ("btrfs: call
submit_bio_hook directly for metadata pages") reworked the way metadata
bio submission is handled.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7aab8b32 15-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: move btrfs_readpage to extent_io.c

Keep btrfs_readpage next to btrfs_do_readpage and the other address
space operations. This allows to keep submit_one_bio and
struct btrfs_bio_ctrl file local in extent_io.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 44e5801f 12-Apr-2022 Qu Wenruo <wqu@suse.com>

btrfs: return correct error number for __extent_writepage_io()

[BUG]
If we hit an error from submit_extent_page() inside
__extent_writepage_io(), we could still return 0 to the caller, and
even trigger the warning in btrfs_page_assert_not_dirty().

[CAUSE]
In __extent_writepage_io(), if we hit an error from
submit_extent_page(), we will just clean up the range and continue.

This is completely fine for regular PAGE_SIZE == sectorsize, as we can
only hit one sector in one page, thus after the error we're ensured to
exit and @ret will be saved.

But for subpage case, we may have other dirty subpage range in the page,
and in the next loop, we may succeeded submitting the next range.

In that case, @ret will be overwritten, and we return 0 to the caller,
while we have hit some error.

[FIX]
Introduce @has_error and @saved_ret to record the first error we hit, so
we will never forget what error we hit.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 10f7f6f8 12-Apr-2022 Qu Wenruo <wqu@suse.com>

btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage()

[BUG]
Test case generic/475 have a very high chance (almost 100%) to hit a fs
hang, where a data page will never be unlocked and hang all later
operations.

[CAUSE]
In btrfs_do_readpage(), if we hit an error from submit_extent_page() we
will try to do the cleanup for our current io range, and exit.

This works fine for PAGE_SIZE == sectorsize cases, but not for subpage.

For subpage btrfs_do_readpage() will lock the full page first, which can
contain several different sectors and extents:

btrfs_do_readpage()
|- begin_page_read()
| |- btrfs_subpage_start_reader();
| Now the page will have PAGE_SIZE / sectorsize reader pending,
| and the page is locked.
|
|- end_page_read() for different branches
| This function will reduce subpage readers, and when readers
| reach 0, it will unlock the page.

But when submit_extent_page() failed, we only cleanup the current
io range, while the remaining io range will never be cleaned up, and the
page remains locked forever.

[FIX]
Update the error handling of submit_extent_page() to cleanup all the
remaining subpage range before exiting the loop.

Please note that, now submit_extent_page() can only fail due to
sanity check in alloc_new_bio().

Thus regular IO errors are impossible to trigger the error path.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c9583ada 12-Apr-2022 Qu Wenruo <wqu@suse.com>

btrfs: avoid double clean up when submit_one_bio() failed

[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.

[CAUSE]
With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().

This will cause the page to be unlocked twice in btrfs_do_readpage():

btrfs_do_readpage()
|- submit_extent_page()
| |- submit_one_bio()
| |- btrfs_submit_data_bio()
| |- if (ret) {
| |- bio->bi_status = ret;
| |- bio_endio(bio); }
| In the endio function, we will call end_page_read()
| and unlock_extent() to cleanup the subpage range.
|
|- if (ret) {
|- unlock_extent(); end_page_read() }
Here we unlock the extent and cleanup the subpage range
again.

For unlock_extent(), it's mostly double unlock safe.

But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.

If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.

[FIX]
The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.

This patch will make submit_one_bio() to return void so that the callers
will never be able to do cleanup when bio submission hook fails.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c0111c44 20-Mar-2022 Qu Wenruo <wqu@suse.com>

btrfs: simplify parameters of submit_read_repair() and rename

Cleanup the function submit_read_repair() by:

- Remove the fixed argument submit_bio_hook()
The function is only called on buffered data read path, so the
@submit_bio_hook argument is always btrfs_submit_data_bio().

Since it's fixed, then there is no need to pass that argument at all.

- Rename the function to submit_data_read_repair()
Just to be more explicit on all the 3 things, data, read and repair.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 110ac0e5 03-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: pass a block_device to btrfs_bio_clone

Pass the block_device to bio_alloc_clone instead of setting it later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e9458bfe 03-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: use on-stack bio in repair_io_failure

The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio. Also cleanup the error handling to use goto
labels and not discard the actual return values.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 58ff51f1 03-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: check-integrity: split submit_bio from btrfsic checking

Require a separate call to the integrity checking helpers from the
actual bio submission.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 395cb57e 06-Apr-2022 Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

btrfs: wait between incomplete batch memory allocations

When allocating memory in a loop, each iteration should call
memalloc_retry_wait() in order to prevent starving memory-freeing
processes (and to mark where allocation loops are). Other filesystems do
that as well.

The bulk page allocation is the only place in btrfs with an allocation
retry loop, so add an appropriate call to it.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 91d6ac1d 30-Mar-2022 Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

btrfs: allocate page arrays using bulk page allocator

While calling alloc_page() in a loop is an effective way to populate an
array of pages, the MM subsystem provides a method to allocate pages in
bulk. alloc_pages_bulk_array() populates the NULL slots in a page
array, trying to grab more than one page at a time.

Unfortunately, it doesn't guarantee allocating all slots in the array,
but it's easy to call it in a loop and return an error if no progress
occurs.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dd137dd1 30-Mar-2022 Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

btrfs: factor out allocating an array of pages

Several functions currently populate an array of page pointers one
allocated page at a time. Factor out the common code so as to allow
improvements to all of the sites at once.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fbca46eb 12-Jan-2022 Qu Wenruo <wqu@suse.com>

btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine

The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.

When other page size come, especially like 16K, the limitation is a bit
limiting.

To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.

Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.

Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.

Please note that, this patch will not yet enable other page size support
yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b95b78e6 15-Mar-2022 Qu Wenruo <wqu@suse.com>

btrfs: warn when extent buffer leak test fails

Although we have btrfs_extent_buffer_leak_debug_check() (enabled by
CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have
some extent buffer leakage, it's just pr_err(), not noisy enough for
fstests to cache.

So here we trigger a WARN_ON() if the allocated_ebs list is not empty.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f913cff3 30-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: Convert to release_folio

I've only converted the outer layers of the btrfs release_folio paths
to use folios; the use of folios should be pushed further down into
btrfs from here.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>


# 00d82525 24-Mar-2022 Christoph Hellwig <hch@lst.de>

btrfs: fix direct I/O read repair for split bios

When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio. But this means the start value used
in btrfs_check_read_dio_bio is incorrect for subsequent bios. Add
a file_offset field to struct btrfs_bio to pass along the correct offset.

Given that check_data_csum only uses start of an error message this
means problems with this miscalculation will only show up when I/O fails
or checksums mismatch.

The logic was removed in f4f39fc5dc30 ("btrfs: remove btrfs_bio::logical
member") but we need it due to the bio splitting.

CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 50f1cff3 24-Mar-2022 Christoph Hellwig <hch@lst.de>

btrfs: fix and document the zoned device choice in alloc_new_bio

Zone Append bios only need a valid block device in struct bio, but
not the device in the btrfs_bio. Use the information from
btrfs_zoned_get_device to set up bi_bdev and fix zoned writes on
multi-device file system with non-homogeneous capabilities and remove
the pointless btrfs_bio.device assignment.

Add big fat comments explaining what is going on here.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# c75e707f 04-Mar-2022 Christoph Hellwig <hch@lst.de>

block: remove the per-bio/request write hint

With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ebf55c88 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: Convert extent_range_redirty_for_io() to use folios

This removes a call to __set_page_dirty_nobuffers().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs


# 895586eb 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: Convert from invalidatepage to invalidate_folio

A lot of the underlying infrastructure in btrfs needs to be switched
over to folios, but this at least documents that invalidatepage can't
be passed a tail page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs


# 8e1dec8e 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: Use folio_invalidate()

Instead of calling ->invalidatepage directly, use folio_invalidate().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs


# d3e29967 07-Mar-2022 Nikolay Borisov <nborisov@suse.com>

btrfs: zoned: put block group after final usage

It's counter-intuitive (and wrong) to put the block group _before_ the
final usage in submit_eb_page. Fix it by re-ordering the call to
btrfs_put_block_group after its final reference. Also fix a minor typo
in 'implies'

Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8cbc3001 18-Feb-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: do not clean up repair bio if submit fails

The submit helper will always run bio_endio() on the bio if it fails to
submit, so cleaning up the bio just leads to a variety of use-after-free
and NULL pointer dereference bugs because we race with the endio
function that is cleaning up the bio. Instead just return BLK_STS_OK as
the repair function has to continue to process the rest of the pages,
and the endio for the repair bio will do the appropriate cleanup for the
page that it was given.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 510671d2 18-Feb-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: do not try to repair bio that has no mirror set

If we fail to submit a bio for whatever reason, we may not have setup a
mirror_num for that bio. This means we shouldn't try to do the repair
workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in
clean_io_failure. Instead simply skip the repair workflow if we have no
mirror set, and add an assert to btrfs_check_repairable() to make it
easier to catch what is happening in the future.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ad3fc794 03-Feb-2022 Filipe Manana <fdmanana@suse.com>

btrfs: remove no longer used counter when reading data page

After commit 92082d40976ed0 ("btrfs: integrate page status update for
data read path into begin/end_page_read"), the 'nr' counter at
btrfs_do_readpage() is no longer used, we increment it but we never
read from it. So just remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bbf0ea7e 03-Feb-2022 Filipe Manana <fdmanana@suse.com>

btrfs: fix lost error return value when reading a data page

At btrfs_do_readpage(), if we get an error when trying to lookup for an
extent map, we end up marking the page with the error bit, clearing
the uptodate bit on it, and doing everything else that should be done.
However we return success (0) to the caller, when we should return the
error encoded in the extent map pointer. So fix that by returning the
error encoded in the pointer.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c0347550 03-Feb-2022 Filipe Manana <fdmanana@suse.com>

btrfs: stop checking for NULL return from btrfs_get_extent()

At extent_io.c, in the read page and write page code paths, we are testing
if the return value from btrfs_get_extent() can be NULL. However that is
not possible, as btrfs_get_extent() always returns either an error pointer
or a (non-NULL) pointer to an extent map structure.

Everywhere else outside extent_io.c we never check for NULL, we always
treat any returned value as a non-NULL pointer if it does not encode an
error.

So check only for the IS_ERR() case at extent_io.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b5b7a41 04-Feb-2022 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: stop checking for NULL return from btrfs_get_extent_fiemap()

In get_extent_skip_holes() we're checking the return of
btrfs_get_extent_fiemap() for an error pointer or NULL, but
btrfs_get_extent_fiemap() will never return NULL, only error pointers or
a valid extent_map.

The other caller of btrfs_get_extent_fiemap(), find_desired_extent(),
correctly only checks for error-pointers.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# abfc426d 02-Feb-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device to bio_clone_fast

Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 609be106 24-Jan-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device and opf to bio_alloc_bioset

Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a50e1fcb 18-Feb-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: do not WARN_ON() if we have PageError set

Whenever we do any extent buffer operations we call
assert_eb_page_uptodate() to complain loudly if we're operating on an
non-uptodate page. Our overnight tests caught this warning earlier this
week

WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
Call Trace:

extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1f6/0x470
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x26d/0x580
? process_one_work+0x580/0x580
worker_thread+0x55/0x3b0
? process_one_work+0x580/0x580
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30

This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer
uptodate when we fail to write it"), however all that fix did was keep
us from finding extent buffers after a failed writeout. It didn't keep
us from continuing to use a buffer that we already had found.

In this case we're searching the commit root to cache the block group,
so we can start committing the transaction and switch the commit root
and then start writing. After the switch we can look up an extent
buffer that hasn't been written yet and start processing that block
group. Then we fail to write that block out and clear Uptodate on the
page, and then we start spewing these errors.

Normally we're protected by the tree lock to a certain degree here. If
we read a block we have that block read locked, and we block the writer
from locking the block before we submit it for the write. However this
isn't necessarily fool proof because the read could happen before we do
the submit_bio and after we locked and unlocked the extent buffer.

Also in this particular case we have path->skip_locking set, so that
won't save us here. We'll simply get a block that was valid when we
read it, but became invalid while we were using it.

What we really want is to catch the case where we've "read" a block but
it's not marked Uptodate. On read we ClearPageError(), so if we're
!Uptodate and !Error we know we didn't do the right thing for reading
the page.

Fix this by checking !Uptodate && !Error, this way we will not complain
if our buffer gets invalidated while we're using it, and we'll maintain
the spirit of the check which is to make sure we have a fully in-cache
block while we're messing with it.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0a4ee518 21-Jan-2022 Christoph Hellwig <hch@lst.de>

mm: remove cleancache

Patch series "remove Xen tmem leftovers".

Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap. This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.

This patch (of 13):

The cleancache subsystem is unused since the removal of Xen tmem driver
in commit 814bbf49dcd0 ("xen: remove tmem driver").

[akpm@linux-foundation.org: remove now-unreachable code]

Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be8d1a2a 20-Dec-2021 Yang Li <yang.lee@linux.alibaba.com>

btrfs: fix argument list that the kdoc format and script verified

The warnings were found by running scripts/kernel-doc, which is
caused by using 'make W=1'.

fs/btrfs/extent_io.c:3210: warning: Function parameter or member
'bio_ctrl' not described in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
description in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter
'prev_bio_flags' description in 'btrfs_bio_add_page'
fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
description in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1602: warning: Function parameter or member
'fs_info' not described in 'btrfs_reserve_metadata_bytes'

Note: this is fixing only the warnings regarding parameter list, the
first line is not strictly conforming to the kdoc format as the btrfs
codebase does not stick to that and keeps the first line more free form
(because it's only for internal use).

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>


# f26c9238 14-Dec-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove reada infrastructure

Currently there is only one user for btrfs metadata readahead, and
that's scrub.

But even for the single user, it's not providing the correct
functionality it needs, as scrub needs reada for commit root, which
current readahead can't provide. (Although it's pretty easy to add such
feature).

Despite this, there are some extra problems related to metadata
readahead:

- Duplicated feature with btrfs_path::reada

- Partly duplicated feature of btrfs_fs_info::buffer_radix
Btrfs already caches its metadata in buffer_radix, while readahead
tries to read the tree block no matter if it's already cached.

- Poor layer separation
Metadata readahead works kinda at device level.
This is definitely not the correct layer it should be, since metadata
is at btrfs logical address space, it should not bother device at all.

This brings extra chance for bugs to sneak in, while brings
unnecessary complexity.

- Dead code
In the very beginning of scrub.c we have #undef DEBUG, rendering all
the debug related code useless and unable to test.

Thus here I purpose to remove the metadata readahead mechanism
completely.

[BENCHMARK]
There is a full benchmark for the scrub performance difference using the
old btrfs_reada_add() and btrfs_path::reada.

For the worst case (no dirty metadata, slow HDD), there could be a 5%
performance drop for scrub.
For other cases (even SATA SSD), there is no distinguishable performance
difference.

The number is reported scrub speed, in MiB/s.
The resolution is limited by the reported duration, which only has a
resolution of 1 second.

Old New Diff
SSD 455.3 466.332 +2.42%
HDD 103.927 98.012 -5.69%

Comprehensive test methodology is in the cover letter of the patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 73672710 07-Dec-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned

REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to
check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the
bio's bio_op.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 554aed7d 07-Dec-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: sink zone check into btrfs_repair_one_zone

Sink zone check into btrfs_repair_one_zone() so we don't need to do it
in all callers.

Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
a boolean function and return false in case it got called on a non-zoned
filesystem and true on a zoned filesystem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 869f4cdc 07-Dec-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: encapsulate inode locking for zoned relocation

Encapsulate the inode lock needed for serializing the data relocation
writes on a zoned filesystem into a helper.

This streamlines the code reading flow and hides special casing for
zoned filesystems.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 83f1b680 11-Nov-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove unnecessary @nr_written parameters

We use @nr_written to record how many pages have been started by
btrfs_run_delalloc_range().

Currently there are only two cases that would populate @nr_written:

- Inline extent creation
- Compressed write

But both cases will also set @page_started to one.

In fact, in writepage_delalloc() we have the following code, showing
that @nr_written is really only utilized for above two cases:

/* did the fill delalloc function already unlock and start
* the IO?
*/
if (page_started) {
/*
* we've unlocked the page, so we can't update
* the mapping's writeback index, just update
* nr_to_write.
*/
wbc->nr_to_write -= nr_written;
return 1;
}

But for such cases, writepage_delalloc() will return 1, and exit
__extent_writepage() without going through __extent_writepage_io().

Thus this means, inside __extent_writepage_io(), we always get
@nr_written as 0.

So this patch is going to remove the unnecessary parameter from the
following functions:

- writepage_delalloc()

As @nr_written passed in is always the initial value 0.

Although inside that function, we still need a local @nr_written
to update wbc->nr_to_write.

- __extent_writepage_io()

As explained above, @nr_written passed in can only be 0.

This also means we can remove one update_nr_written() call.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 651740a5 13-Dec-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: check WRITE_ERR when trying to read an extent buffer

Filipe reported a hang when we have errors on btrfs. This turned out to
be a side-effect of my fix c2e39305299f01 ("btrfs: clear extent buffer
uptodate when we fail to write it") which made it so we clear
EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out.

Below is a paste of Filipe's analysis he got from using drgn to debug
the hang

"""
btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to
a value while writeback of all pages has not yet completed:
--> writeback for the first 3 pages finishes, we clear
EXTENT_BUFFER_UPTODATE from eb on the first page when we get an
error.
--> at this point eb->io_pages is 1 and we cleared Uptodate bit from the
first 3 pages
--> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so
it continues, it's able to lock the pages since we obviously don't
hold the pages locked during writeback
--> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets
eb->io_pages to 3, since only the first page does not have Uptodate
bit set at this point
--> writeback for the remaining page completes, we ended decrementing
eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore
never calling end_extent_buffer_writeback(), so
EXTENT_BUFFER_WRITEBACK remains in the eb's flags
--> of course, when the read bio completes, it doesn't and shouldn't
call end_extent_buffer_writeback()
--> we should clear EXTENT_BUFFER_UPTODATE only after all pages of
the eb finished writeback? or maybe make the read pages code
wait for writeback of all pages of the eb to complete before
checking which pages need to be read, touch ->io_pages, submit
read bio, etc

writeback bit never cleared means we can hang when aborting a
transaction, at:

btrfs_cleanup_one_transaction()
btrfs_destroy_marked_extents()
wait_on_extent_buffer_writeback()
"""

This is a problem because our writes are not synchronized with reads in
any way. We clear the UPTODATE flag and then we can easily come in and
try to read the EB while we're still waiting on other bio's to
complete.

We have two options here, we could lock all the pages, and then check to
see if eb->io_pages != 0 to know if we've already got an outstanding
write on the eb.

Or we can simply check to see if we have WRITE_ERR set on this extent
buffer. We set this bit _before_ we clear UPTODATE, so if the read gets
triggered because we aren't UPTODATE because of a write error we're
guaranteed to have WRITE_ERR set, and in this case we can simply return
-EIO. This will fix the reported hang.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 68b85589 24-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: call mapping_set_error() on btree inode with a write error

generic/484 fails sometimes with compression on because the write ends
up small enough that it goes into the btree. This means that we never
call mapping_set_error() on the inode itself, because the page gets
marked as fine when we inline it into the metadata. When the metadata
writeback happens we see it and abort the transaction properly and mark
the fs as readonly, however we don't do the mapping_set_error() on
anything. In syncfs() we will simply return 0 if the sb is marked
read-only, so we can't check for this in our syncfs callback. The only
way the error gets returned if we called mapping_set_error() on
something. Fix this by calling mapping_set_error() on the btree inode
mapping. This allows us to properly return an error on syncfs and pass
generic/484 with compression on.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2e39305 24-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: clear extent buffer uptodate when we fail to write it

I got dmesg errors on generic/281 on our overnight fstests. Looking at
the history this happens occasionally, with errors like this

WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50
CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G W 5.16.0-rc2+ #469
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffae598230bc60 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000
RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340
RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000
R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0
R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000
FS: 0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0
Call Trace:

extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1d6/0x430
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x270/0x5a0
worker_thread+0x55/0x3c0
? process_one_work+0x5a0/0x5a0
kthread+0x174/0x1a0
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30

This happens because we're trying to read from a extent buffer page that
is !PageUptodate. This happens because we will clear the page uptodate
when we have an IO error, but we don't clear the extent buffer uptodate.
If we do a read later and find this extent buffer we'll think its valid
and not return an error, and then trip over this warning.

Fix this by also clearing uptodate on the extent buffer when this
happens, so that we get an error when we do a btrfs_search_slot() and
find this block later.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f4f39fc5 08-Oct-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove btrfs_bio::logical member

The member btrfs_bio::logical is only initialized by two call sites:

- btrfs_repair_one_sector()
No corresponding site to utilize it.

- btrfs_submit_direct()
The corresponding site to utilize it is btrfs_check_read_dio_bio().

However for btrfs_check_read_dio_bio(), we can grab the file_offset from
btrfs_dio_private::file_offset directly.

Thus it turns out we don't really need that btrfs_bio::logical member at
all.

For btrfs_bio, the logical bytenr can be fetched from its
bio->bi_iter.bi_sector directly.

So let's just remove the member to save 8 bytes for structure btrfs_bio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 84961539 05-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add a BTRFS_FS_ERROR helper

We have a few flags that are inconsistently used to describe the fs in
different states of failure. As of 5963ffcaf383 ("btrfs: always abort
the transaction if we abort a trans handle") we will always set
BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
and ERROR to see if things have gone wrong. Add a helper to check
BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
use the helper.

The TRANS_ABORTED bit check was added in af7227338135 ("Btrfs: clean up
resources during umount after trans is aborted") but is not actually
specific.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2749f7ef 27-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: avoid potential deadlock with compression and delalloc

[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:

mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156

[CAUSE]
If we have a file layout looks like below:

0 32K 64K 96K 128K
|//| |///////////////|
4K

Then we run delalloc range for the inode, it will:

- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).

This range will be COWed.

- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).

This range meets the condition for subpage compression, will go
through async COW path.

And async COW path will return @page_started.

But that @page_started is now for range [64K, 128K), not for range
[0, 64K).

- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.

Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).

This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not

Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.

[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.

So that we will never get incorrect @page_started.

And to prevent such problem from happening again:

- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.

Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().

This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().

- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.

- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.

- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e55a0de1 27-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: rework page locking in __extent_writepage()

Pages passed to __extent_writepage() are always locked, but they may be
locked by different functions.

There are two types of locked page for __extent_writepage():

- Page locked by plain lock_page()
It should not have any subpage::writers count.
Can be unlocked by unlock_page().
This is the most common locked page for __extent_writepage() called
inside extent_write_cache_pages() or extent_write_full_page().
Rarer cases include the @locked_page from extent_write_locked_range().

- Page locked by lock_delalloc_pages()
There is only one caller, all pages except @locked_page for
extent_write_locked_range().
In this case, we have to call subpage helper to handle the case.

So here we introduce a helper, btrfs_page_unlock_writer(), to allow
__extent_writepage() to unlock different locked pages.

And since for all other callers of __extent_writepage() their pages are
ensured to be locked by lock_page(), also add an extra check for
epd::extent_locked to unlock such pages directly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 66448b9d 27-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: make extent_write_locked_range() compatible

There are two sites are not subpage compatible yet for
extent_write_locked_range():

- How @nr_pages are calculated
For subpage we can have the following range with 64K page size:

0 32K 64K 96K 128K
| |////|/////| |

In that case, although 96K - 32K == 64K, thus it looks like one page
is enough, but the range spans two pages, not one.

Fix it by doing proper round_up() and round_down() to calculate
@nr_pages.

Also add some extra ASSERT()s to ensure the range passed in is already
aligned.

- How the page end is calculated
Currently we just use cur + PAGE_SIZE - 1 to calculate the page end.

Which can't handle the above range layout, and will trigger ASSERT()
in btrfs_writepage_endio_finish_ordered(), as the range is no longer
covered by the page range.

Fix it by taking page end into consideration.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2bd0fc93 27-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: cleanup for extent_write_locked_range()

There are several cleanups for extent_write_locked_range(), most of them
are pure cleanups, but with some preparation for future subpage support.

- Add a proper comment for which call sites are suitable
Unlike regular synchronized extent write back, if async COW or zoned
COW happens, we have all pages in the range still locked.

Thus for those (only) two call sites, we need this function to submit
page content into bios and submit them.

- Remove @mode parameter
All the existing two call sites pass WB_SYNC_ALL. No need for @mode
parameter.

- Better error handling
Currently if we hit an error during the page iteration loop, we
overwrite @ret, causing only the last error can be recorded.

Here we add @found_error and @first_error variable to record if we hit
any error, and the first error we hit.
So the first error won't get lost.

- Don't reuse @start as the cursor
We reuse the parameter @start as the cursor to iterate the range, not
a big problem, but since we're here, introduce a proper @cur as the
cursor.

- Remove impossible branch
Since all pages are still locked after the ordered extent is inserted,
there is no way that pages can get its dirty bit cleared.
Remove the branch where page is not dirty and replace it with an
ASSERT().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6a404910 27-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: make add_ra_bio_pages() compatible

[BUG]
If we remove the subpage limitation in add_ra_bio_pages(), then read a
compressed extent which has part of its range in next page, like the
following inode layout:

0 32K 64K 96K 128K
|<--------------|-------------->|

Btrfs will trigger ASSERT() in endio function:

assertion failed: atomic_read(&subpage->readers) >= nbits
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
Internal error: Oops - BUG: 0 [#1] SMP
Workqueue: btrfs-endio btrfs_work_helper [btrfs]
Call trace:
assertfail.constprop.0+0x28/0x2c [btrfs]
btrfs_subpage_end_reader+0x148/0x14c [btrfs]
end_page_read+0x8c/0x100 [btrfs]
end_bio_extent_readpage+0x320/0x6b0 [btrfs]
bio_endio+0x15c/0x1dc
end_workqueue_fn+0x44/0x64 [btrfs]
btrfs_work_helper+0x74/0x250 [btrfs]
process_one_work+0x1d4/0x47c
worker_thread+0x180/0x400
kthread+0x11c/0x120
ret_from_fork+0x10/0x30
---[ end trace c8b7b552d3bb408c ]---

[CAUSE]
When we read the page range [0, 64K), we find it's a compressed extent,
and we will try to add extra pages in add_ra_bio_pages() to avoid
reading the same compressed extent.

But when we add such page into the read bio, it doesn't follow the
behavior of btrfs_do_readpage() to properly set subpage::readers.

This means, for page [64K, 128K), its subpage::readers is still 0.

And when endio is executed on both pages, since page [64K, 128K) has 0
subpage::readers, it triggers above ASSERT()

[FIX]
Function add_ra_bio_pages() is far from subpage compatible, it always
assume PAGE_SIZE == sectorsize, thus when it skip to next range it
always just skip PAGE_SIZE.

Make it subpage compatible by:

- Skip to next page properly when needed
If we find there is already a page cache, we need to skip to next page.
For that case, we shouldn't just skip PAGE_SIZE bytes, but use
@pg_index to calculate the next bytenr and continue.

- Only add the page range covered by current extent map
We need to calculate which range is covered by current extent map and
only add that part into the read bio.

- Update subpage::readers before submitting the bio

- Use proper cursor other than confusing @last_offset

- Calculate the missed threshold based on sector size
It's no longer using missed pages, as for 64K page size, we have at
most 3 pages to skip. (If aligned only 2 pages)

- Add ASSERT() to make sure our bytenr is always aligned

- Add comment for the function
Add a special note for subpage case, as the function won't really
work well for subpage cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cf3075fb 27-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove unnecessary parameter delalloc_start for writepage_delalloc()

In function __extent_writepage() we always pass page start to
@delalloc_start for writepage_delalloc().

Thus we don't really need @delalloc_start parameter as we can extract it
from @page.

Remove @delalloc_start parameter and make __extent_writepage() to
declare @page_start and @page_end as const.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c3a3b19b 15-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename struct btrfs_io_bio to btrfs_bio

Previously we had "struct btrfs_bio", which records IO context for
mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
btrfs specific info for logical bytenr bio.

With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
"btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.

The struct btrfs_bio changes meaning by this commit. There was a
suggested name like btrfs_logical_bio but it's a bit long and we'd
prefer to use a shorter name.

This could be a concern for backports to older kernels where the
different meaning could possibly cause confusion or bugs. Comparing the
new and old structures, there's no overlap among the struct members so a
build would break in case of incorrect backport.

We haven't had many backports to bio code anyway so this is more of a
theoretical cause of bugs and a matter of precaution but we'll need to
keep the semantic change in mind.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cd8e0cca 15-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove btrfs_bio_alloc() helper

The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
bio->bi_iter.bi_sector.

However the naming itself is not using "btrfs_io_bio" to indicate its
parameter is "strcut btrfs_io_bio" and can be easily confused with
"struct btrfs_bio".

Considering assigned bio->bi_iter.bi_sector is such a simple work and
there are already tons of call sites doing that manually, there is no
need to do that in a helper.

Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
function to provide a fail-safe value for its @nr_iovecs.

And then replace all btrfs_bio_alloc() callers with
btrfs_io_bio_alloc().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4c664611 15-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename btrfs_bio to btrfs_io_context

The structure btrfs_bio is used by two different sites:

- bio->bi_private for mirror based profiles
For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
how many mirrors are still pending, and save the original endio
function of the bio.

- RAID56 code
In that case, RAID56 only utilize the stripes info, and no long uses
that to trace the pending mirrors.

So btrfs_bio is not always bind to a bio, and contains more info for IO
context, thus renaming it will make the naming less confusing.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 35156d85 08-Sep-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: only allow one process to add pages to a relocation inode

Don't allow more than one process to add pages to a relocation inode on
a zoned filesystem, otherwise we cannot guarantee the sequential write
rule once we're filling preallocated extents on a zoned filesystem.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38d5e541 03-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: unexport repair_io_failure()

Function repair_io_failure() is no longer used out of extent_io.c since
commit 8b9b6f255485 ("btrfs: scrub: cleanup the remaining nodatasum
fixup code"), which removes the last external caller.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d24fa5c1 23-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: convert latest_bdev type to btrfs_device and rename

In preparation to fix a bug in btrfs_show_devname().

Convert fs_devices::latest_bdev type from struct block_device to struct
btrfs_device and, rename the member to fs_devices::latest_dev.
So that btrfs_show_devname() can use fs_devices::latest_dev::name.

Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# be1a1d7a 19-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: finish fully written block group

If we have written to the zone capacity, the device automatically
deactivates the zone. Sync up block group side (the active BG list and
zone_is_active flag) with it.

We need to do it both on data BGs and metadata BGs. On data side, we add a
hook to btrfs_finish_ordered_io(). On metadata side, we use
end_extent_buffer_writeback().

To reduce excess lookup of a block group, we mark the last extent buffer in
a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
data (ordered_extent), because the address may change due to
REQ_OP_ZONE_APPEND.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 72a69cd0 17-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: pack all subpage bitmaps into a larger bitmap

Currently we use u16 bitmap to make 4k sectorsize work for 64K page
size.

But this u16 bitmap is not large enough to contain larger page size like
128K, nor is space efficient for 16K page size.

To handle both cases, here we pack all subpage bitmaps into a larger
bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for
subpage usage.

Each sub-bitmap will has its start bit number recorded in
btrfs_subpage_info::*_start, and its bitmap length will be recorded in
btrfs_subpage_info::bitmap_nr_bits.

All subpage bitmap operations will be converted from using direct u16
operations to bitmap operations, with above *_start calculated.

For 64K page size with 4K sectorsize, this should not cause much
difference.

While for 16K page size, we will only need 1 unsigned long (u32) to
store all the bitmaps, which saves quite some space.

Furthermore, this allows us to support larger page size like 128K and
258K.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 651fb419 17-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: make btrfs_alloc_subpage() return btrfs_subpage directly

The existing calling convention of btrfs_alloc_subpage() is pretty
awful. Change it to a more common pattern by returning struct
btrfs_subpage directly and let the caller to determine if the call
succeeded.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fdf250db 17-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: only call btrfs_alloc_subpage() when sectorsize is smaller than PAGE_SIZE

There are two call sites of btrfs_alloc_subpage():

- btrfs_attach_subpage()
We have ensured sectorsize is smaller than PAGE_SIZE

- alloc_extent_buffer()
We call btrfs_alloc_subpage() unconditionally.

The alloc_extent_buffer() forces us to check the sectorsize size against
page size inside btrfs_alloc_subpage().

Since the function name, btrfs_alloc_subpage(), already indicates it
should only get called for subpage cases, do the check in
alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 939c7feb 11-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix ordered extent boundary calculation

btrfs_lookup_ordered_extent() is supposed to query the offset in a file
instead of the logical address. Pass the file offset from
submit_extent_page() to calc_bio_boundaries().

Also, calc_bio_boundaries() relies on the bio's operation flag, so move
the call site after setting it.

Fixes: 390ed29b817e ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 14605409 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: initial fsverity support

Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7361b4ae 28-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove the dead comment in writepage_delalloc()

When btrfs_run_delalloc_range() failed, we will error out.

But there is a strange comment mentioning that
btrfs_run_delalloc_range() could have returned value >0 to indicate the
IO has already started.

Commit 40f765805f08 ("Btrfs: split up __extent_writepage to lower stack
usage") introduced the comment, but unfortunately at that time, we were
already using @page_started to indicate that case, and still return 0.

Furthermore, even if that comment was right (which is not), we would
return -EIO if the IO had already started.

By all means the comment is incorrect, just remove the comment along
with the dead check.

Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
make sure we either return 0 or error, no positive return value.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 21dda654 21-Jul-2021 Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

btrfs: fix argument type of btrfs_bio_clone_partial()

The offset and can never be negative use unsigned int instead of int
type for them.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 963e4db8 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: unify regular and subpage error paths in __extent_writepage()

[BUG]
When running btrfs/160 in a loop for subpage with experimental
compression support, it has a high chance to crash (~20%):

BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:238!
Internal error: Oops - BUG: 0 [#1] SMP
pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
Call trace:
__btrfs_add_ordered_extent+0x550/0x670 [btrfs]
btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
run_delalloc_nocow+0x81c/0x8fc [btrfs]
btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
writepage_delalloc+0xc0/0x1ac [btrfs]
__extent_writepage+0xf4/0x370 [btrfs]
extent_write_cache_pages+0x288/0x4f4 [btrfs]
extent_writepages+0x58/0xe0 [btrfs]
btrfs_writepages+0x1c/0x30 [btrfs]
do_writepages+0x60/0x110
__filemap_fdatawrite_range+0x108/0x170
filemap_fdatawrite_range+0x20/0x30
btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
__btrfs_write_out_cache+0x34c/0x480 [btrfs]
btrfs_write_out_cache+0x144/0x220 [btrfs]
btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
btrfs_sync_fs+0x64/0x1cc [btrfs]
sync_fs_one_sb+0x3c/0x50
iterate_supers+0xcc/0x1d4
ksys_sync+0x6c/0xd0
__arm64_sys_sync+0x1c/0x30
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0x4c/0xd4
do_el0_svc+0x30/0x9c
el0_svc+0x2c/0x54
el0_sync_handler+0x1a8/0x1b0
el0_sync+0x198/0x1c0
---[ end trace 336f67369ae6e0af ]---

[CAUSE]
For subpage case, we can have multiple sectors inside a page, this makes
it possible for __extent_writepage() to have part of its page submitted
before returning.

In btrfs/160, we are using dm-dust to emulate write error, this means
for certain pages, we could have everything running fine, but at the end
of __extent_writepage(), one of the submitted bios fails due to dm-dust.

Then the page is marked Error, and we change @ret from 0 to -EIO.

This makes the caller extent_write_cache_pages() to error out, without
submitting the remaining pages.

Furthermore, since we're erroring out for free space cache, it doesn't
really care about the error and will update the inode and retry the
writeback.

Then we re-run the delalloc range, and will try to insert the same
delalloc range while previous delalloc range is still hanging there,
triggering the above error.

[FIX]
The proper fix is to handle errors from __extent_writepage() properly,
by ending the remaining ordered extent.

But that fix needs the following changes:

- Know at exactly which sector the error happened
Currently __extent_writepage_io() works for the full page, can't
return at which sector we hit the error.

- Grab the ordered extent covering the failed sector

As a hotfix for subpage case, here we unify the error paths in
__extent_writepage().

In fact, the "if (PageError(page))" branch never get executed if @ret is
still 0 for non-subpage cases.

As for non-subpage case, we never submit current page in
__extent_writepage(), but only add current page into bio.
The bio can only get submitted in next page.

Thus we never get PageError() set due to IO failure, thus when we hit
the branch, @ret is never 0.

By simply removing that @ret assignment, we let subpage case ignore the
IO failure, thus only error out for fatal errors just like regular
sectorsize.

So that IO error won't be treated as fatal error not trigger the hanging
OE problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e0eefe07 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: allow submit_extent_page() to do bio split

Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.

But this behavior can't handle subpage case at all.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:

- Check how many bytes we can be put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

cur = pg_offset;
while (cur < pg_offset + size) {
u32 offset = cur - pg_offset;
int added;
if (!bio_ctrl->bio) {
/* Allocate new bio if needed */
}

/* Add as many bytes into the bio */
added = btrfs_bio_add_page();

if (added < size - offset) {
/* The current bio is full, submit it */
}
cur += added;
}

Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cc1d0d93 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix writeback which does not have ordered extent

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4c37a793 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: reset this_bio_flag to avoid inheriting old flags

In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a
compressed extent.

This is fine, as for PAGE_SIZE == sectorsize case, we can only have one
sector for one page, thus @this_bio_flag will only be set at most once.

But for subpage case, after hitting a compressed extent, @this_bio_flag
will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular
extent.

This will lead to various read errors, and causing new ASSERT() in
incoming subpage patches, which adds more strict check in
btrfs_submit_compressed_read().

Fix it by declaring @this_bio_flag inside the main loop and reset its
value for each iteration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 25c1252a 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered

The uptodate parameter should be bool, change the type.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a129ffb8 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove unused start and end parameters from btrfs_run_delalloc_range()

Since commit d75855b4518b ("btrfs: Remove
extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
and adds btrfs_writepage_cow_fixup() function, there is no need to
follow the old hook parameters.

Remove the @start and @end hook, since currently the fixup check is full
page check, it doesn't need @start and @end hook.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a80d1c6 28-Jun-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: remove max_zone_append_size logic

There used to be a patch in the original series for zoned support which
limited the extent size to max_zone_append_size, but this patch has been
dropped somewhere around v9.

We've decided to go the opposite direction, instead of limiting extents
in the first place we split them before submission to comply with the
device's limits.

Remove the related code, btrfs_fs_info::max_zone_append_size and
btrfs_zoned_device_info::max_zone_append_size.

This also removes the workaround for dm-crypt introduced in
1d68128c107a ("btrfs: zoned: fail mount if the device does not support
zone append") because the fix has been merged as f34ee1dce642 ("dm
crypt: Fix zoned block device support").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d078efa 07-Jun-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix a rare race between metadata endio and eb freeing

[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.

No other reproducer so far.

The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().

[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:

T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |

This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.

But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.

Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.

[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.

Now btrfs_subpage::readers will be a member shared by both metadata and
data.

For metadata path, we don't do the page unlock as metadata only relies
on extent locking.

At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.

So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.

By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.

The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.

Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c5ef5c6c 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make __extent_writepage_io() only submit dirty range for subpage

__extent_writepage_io() function originally just iterates through all
the extent maps of a page, and submits any regular extents.

This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.

But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.

If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().

Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.

Since we're also here, also modify the following calls to be subpage
compatible:

- SetPageError()
- end_page_writeback()

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d2a91064 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make btrfs_set_range_writeback() subpage compatible

Function btrfs_set_range_writeback() currently just sets the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_fs_info, also change the parameter
to accept btrfs_inode.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a33a8e9a 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: update locked page dirty/writeback/error bits in __process_pages_contig

When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.

There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK

- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
All for their error handling branches.

For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.

Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.

Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b945a463 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make page Ordered bit to be subpage compatible

This involves the following modification:

- Ordered extent creation
This is done in process_one_page(), now PAGE_SET_ORDERED will call
subpage helper to do the work.

- endio functions
This is done in btrfs_mark_ordered_io_finished().

- btrfs_invalidatepage()

- btrfs_cleanup_ordered_extents()
Use the subpage page helper, and add an extra branch to exit if the
locked page have covered the full range.

Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1e1de387 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make process_one_page() to handle subpage locking

Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.

This member acts pretty much the same as readers, except it's only for
delalloc writes.

This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.

With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
process_one_page()
There are still tons of call sites manually lock and unlock a page,
without updating btrfs_subpage::writers.
So if we lock a page through process_one_page() then it must be
unlocked by process_one_page() to keep btrfs_subpage::writers
consistent.

This will be handled in next patch.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9047e317 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make end_bio_extent_writepage() to be subpage compatible

Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().

Just call the subpage helpers.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e38992be 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status

For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 321a02db 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: only require sector size alignment for end_bio_extent_writepage()

Just like read page, for subpage support we only require sector size
alignment.

So change the error message condition to only require sector alignment.

This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ed8f13bf 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: refactor page status update into process_one_page()

In __process_pages_contig() we update page status according to page_ops.

That update process is a bunch of 'if' branches, which lie inside
two loops, this makes it pretty hard to expand for later subpage
operations.

So this patch will extract these operations into its own function,
process_one_pages().

Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 98af9ab1 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: pass bytenr directly to __process_pages_contig()

As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.

So change the parameter and all callers to pass bytenr in.

With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.

Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.

If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().

So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.

This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f57ad937 07-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename PagePrivate2 to PageOrdered inside btrfs

Inside btrfs we use Private2 page status to indicate we have an ordered
extent with pending IO for the sector.

But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38a39ac7 08-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered()

There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().

It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.

Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.

By this, end_compressed_bio_write() can happily pass page=NULL while
still getting everything done properly.

Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fa04c165 26-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: make subpage metadata write path call its own endio functions

For subpage metadata, we're reusing two functions for subpage metadata
write:

- end_bio_extent_buffer_writepage()
- write_one_eb()

But the truth is, for subpage we just call
end_bio_subpage_eb_writepage() without using any bit in
end_bio_extent_buffer_writepage().

For write_one_eb(), it's pretty similar, but with a small part of code
reused.

There is really no need to pollute the existing code path if we're not
really using most of them.

So this patch will do the following change to separate the subpage
metadata write path from regular write path by:

- Use end_bio_subpage_eb_writepage() directly as endio in
write_one_subpage_eb()
- Directly call write_one_subpage_eb() in submit_eb_subpage()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 390ed29b 14-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: refactor submit_extent_page() to make bio and its flag tracing easier

There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".

Such strange parameters are here for bio assembly.

For example, we have such inode page layout:

0 4K 8K 12K
|<-- Extent A-->|<- EB->|

Then what we do is:

- Page [0, 4K)
*bio_ret = NULL
So we allocate a new bio to bio_ret,
Add page [0, 4K) to *bio_ret.

- Page [4K, 8K)
*bio_ret != NULL
We found this page is continuous to *bio_ret,
and if we're not at stripe boundary, we
add page [4K, 8K) to *bio_ret.

- Page [8K, 12K)
*bio_ret != NULL
But we found this page is not continuous, so
we submit *bio_ret, then allocate a new bio,
and add page [8K, 12K) to the new bio.

This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.

So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.

Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary. This check itself can be time
consuming, and we don't really need to do that for each page.

This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.

The following functions/structures are affected:

- struct extent_page_data
Replace its bio pointer with structure btrfs_bio_ctrl (embedded
structure, not pointer)

- end_write_bio()
- flush_write_bio()
Just change how bio is fetched

- btrfs_bio_add_page()
Use pre-calculated boundaries instead of re-calculating them.
And use @bio_ctrl to replace @bio and @prev_bio_flags.

- calc_bio_boundaries()
New function

- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
bio.

- btrfs_bio_fits_in_ordered_extent()
Removed, as now the ordered extent size limit is done at bio
allocation time, no need to check for each page range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 24880be5 21-Sep-2020 David Sterba <dsterba@suse.com>

btrfs: clean up header members offsets in write helpers

Move header offsetof() to the expression that calculates the address so
it's part of get_eb_offset_in_page where the 2nd parameter is the member
offset.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e7ff9e6b 18-May-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: factor out zoned device lookup

To be able to construct a zone append bio we need to look up the
btrfs_device. The code doing the chunk map lookup to get the device is
present in btrfs_submit_compressed_write and submit_extent_page.

Factor out the lookup calls into a helper and use it in the submission
paths.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1245835d 02-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove io_failure_record::in_validation

The io_failure_record::in_validation was introduced to handle failed bio
which cross several sectors. In such case, we still need to verify
which sectors are corrupted.

But since we've changed the way how we handle corrupted sectors, by only
submitting repair for each corrupted sector, there is no need for extra
validation any more.

This patch will cleanup all io_failure_record::in_validation related
code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 150e4b05 02-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: submit read time repair only for each corrupted sector

Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.

By this, each read repair can be easily handled without the need to
verify which sector is corrupted.

This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.

So this patch will:

- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.

- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:

* Good sector
We need to release the page and extent, set the range uptodate.

* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.

* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.

Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.

- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.

- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.

This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dc56219f 08-Apr-2021 Goldwyn Rodrigues <rgoldwyn@suse.de>

btrfs: correct try_lock_extent() usage in read_extent_buffer_subpage()

try_lock_extent() returns 1 on success or 0 for failure and not an error
code. If try_lock_extent() fails, read_extent_buffer_subpage() returns
zero indicating subpage extent read success.

Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the
extent.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e380adfc 18-May-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: pass start block to btrfs_use_zone_append

btrfs_use_zone_append only needs the passed in extent_map's block_start
member, so there's no need to pass in the full extent map.

This also enables the use of btrfs_use_zone_append in places where we only
have a start byte but no extent_map.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 15c7745c 06-Apr-2021 Boris Burkov <boris@bur.io>

btrfs: return whole extents in fiemap

`xfs_io -c 'fiemap <off> <len>' <file>`

can give surprising results on btrfs that differ from xfs.

btrfs prints out extents trimmed to fit the user input. If the user's
fiemap request has an offset, then rather than returning each whole
extent which intersects that range, we also trim the start extent to not
have start < off.

Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
that returning the whole extent is expected.

Some cases which all yield the same fiemap in xfs, but not btrfs:
dd if=/dev/zero of=$f bs=4k count=1
sudo xfs_io -c 'fiemap 0 1024' $f
0: [0..7]: 26624..26631
sudo xfs_io -c 'fiemap 2048 1024' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 2048 4096' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 3584 512' $f
0: [7..7]: 26631..26631
sudo xfs_io -c 'fiemap 4091 5' $f
0: [7..6]: 26631..26630

I believe this is a consequence of the logic for merging contiguous
extents represented by separate extent items. That logic needs to track
the last offset as it loops through the extent items, which happens to
pick up the start offset on the first iteration, and trim off the
beginning of the full extent. To fix it, start `off` at 0 rather than
`start` so that we keep the iteration/merging intact without cutting off
the start of the extent.

after the fix, all the above commands give:

0: [0..7]: 26624..26631

The merging logic is exercised by fstest generic/483, and I have written
a new fstest for checking we don't have backwards or zero-length fiemaps
for cases like those above.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# d048b9c2 04-May-2021 Ira Weiny <ira.weiny@intel.com>

btrfs: use memzero_page() instead of open coded kmap pattern

There are many places where kmap/memset/kunmap patterns occur.

Use the newly lifted memzero_page() to eliminate direct uses of kmap and
leverage the new core functions use of kmap_local_page().

The development of this patch was aided by the following coccinelle
script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memset/kunmap pattern and replace with memset*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate. Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// Then the memset pattern
//
@ memset_rule1 @
expression page, V, L, Off;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
-memset(ptr, 0, L);
+memzero_page(page, 0, L);
|
-memset(ptr + Off, 0, L);
+memzero_page(page, Off, L);
|
-memset(ptr, V, L);
+memset_page(page, V, 0, L);
|
-memset(ptr + Off, V, L);
+memset_page(page, V, Off, L);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memset_rule1
@
identifier memset_rule1.ptr;
type VP, VP1;
@@

-VP ptr;
... when != ptr;
? VP1 ptr;

//
// Catch all
//
@ memset_rule2 @
expression page;
identifier ptr;
expression GenTo, GenSize, GenValue;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
//
// Some call sites have complex expressions within the memset/memcpy
// The follow are catch alls which need to be evaluated by hand.
//
-memset(GenTo, 0, GenSize);
+memzero_pageExtra(page, GenTo, GenSize);
|
-memset(GenTo, GenValue, GenSize);
+memset_pageExtra(page, GenValue, GenTo, GenSize);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memset_rule2
@
identifier memset_rule2.ptr;
type VP, VP1;
@@

-VP ptr;
... when != ptr;
? VP1 ptr;

// </smpl>

Link: https://lkml.kernel.org/r/20210309212137.2610186-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e9306ad4 24-Feb-2021 Qu Wenruo <wqu@suse.com>

btrfs: more graceful errors/warnings on 32bit systems when reaching limits

Btrfs uses internally mapped u64 address space for all its metadata.
Due to the page cache limit on 32bit systems, btrfs can't access
metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See
how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K
page size while 256T for 64K page size.

Users can have a filesystem which doesn't have metadata beyond the
boundary at mount time, but later balance can cause it to create
metadata beyond the boundary.

And modification to MM layer is unrealistic just for such minor use
case. We can't do more than to prevent mounting such filesystem or warn
early when the numbers are still within the limits.

To address such problem, this patch will introduce the following checks:

- Mount time rejection
This will reject any fs which has metadata chunk at or beyond the
boundary.

- Mount time early warning
If there is any metadata chunk beyond 5/8th of the boundary, we do an
early warning and hope the end user will see it.

- Runtime extent buffer rejection
If we're going to allocate an extent buffer at or beyond the boundary,
reject such request with EOVERFLOW.
This is definitely going to cause problems like transaction abort, but
we have no better ways.

- Runtime extent buffer early warning
If an extent buffer beyond 5/8th of the max file size is allocated, do
an early warning.

Above error/warning message will only be printed once for each fs to
reduce dmesg flood.

If the mount is rejected, the filesystem will be mountable only on a
64bit host.

Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/
Reported-by: Erik Jensen <erikjensen@rkjnsn.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c4aec299 05-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: introduce submit_eb_subpage() to submit a subpage metadata page

The new function, submit_eb_subpage(), will submit all the dirty extent
buffers in the page.

The major difference between submit_eb_page() and submit_eb_subpage()
is:
- How to grab extent buffer
Now we use find_extent_buffer_nospinlock() other than using
page::private.

All other different handling is already done in functions like
lock_extent_buffer_for_io() and write_one_eb().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f3156df9 05-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: make lock_extent_buffer_for_io() to be subpage compatible

For subpage metadata, we don't use page locking at all. So just skip
the page locking part for subpage. The rest of the function can be
reused.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 35b6ddfa 05-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: introduce write_one_subpage_eb() function

The new function, write_one_subpage_eb(), as a subroutine for subpage
metadata write, will handle the extent buffer bio submission.

The major differences between the new write_one_subpage_eb() and
write_one_eb() is:

- No page locking
When entering write_one_subpage_eb() the page is no longer locked.
We only lock the page for its status update, and unlock immediately.
Now we completely rely on extent io tree locking.

- Extra bitmap update along with page status update
Now page dirty and writeback is controlled by
btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
They both follow the schema that any sector is dirty/writeback, then
the full page gets dirty/writeback.

- When to update the nr_written number
Now we take a shortcut, if we have cleared the last dirty bit of the
page, we update nr_written.
This is not completely perfect, but should emulate the old behavior
well enough.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2f3186d8 05-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: introduce end_bio_subpage_eb_writepage() function

The new function, end_bio_subpage_eb_writepage(), will handle the
metadata writeback endio.

The major differences involved are:

- How to grab extent buffer
Now page::private is a pointer to btrfs_subpage, we can no longer grab
extent buffer directly.
Thus we need to use the bv_offset to locate the extent buffer manually
and iterate through the whole range.

- Use btrfs_subpage_end_writeback() caller
This helper will handle the subpage writeback for us.

Since this function is executed under endio context, when grabbing
extent buffers it can't grab eb->refs_lock as that lock is not designed
to be grabbed under hardirq context.

So here introduce a helper, find_extent_buffer_nolock(), for such
situation, and convert find_extent_buffer() to use that helper.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 32c0a6bc 21-Mar-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: add and use readahead_batch_length

Implement readahead_batch_length() to determine the number of bytes in
the current batch of readahead pages and use it in btrfs. Also use the
readahead_pos to get the offset.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a2c6075 25-Mar-2021 Qu Wenruo <wqu@suse.com>

btrfs: make set_btree_ioerr accept extent buffer and be subpage compatible

Current set_btree_ioerr() only accepts @page parameter and grabs extent
buffer from page::private. This works fine for sector size == PAGE_SIZE
case, but not for subpage case.

Add an extra parameter, @eb, for callers to pass extent buffer to this
function, so that subpage code can reuse this function.

And also add subpage special handling to update
btrfs_subpage::error_bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d27797e 25-Mar-2021 Qu Wenruo <wqu@suse.com>

btrfs: make set/clear_extent_buffer_dirty() subpage compatible

For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.

For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.

This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.

Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.

But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.

[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:

T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH

The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b8f95771 25-Mar-2021 Qu Wenruo <wqu@suse.com>

btrfs: support page uptodate assertions in subpage mode

There are quite some assert checks on page uptodate in extent buffer
write accessors. They ensure the destination page is already uptodate.

This is fine for regular sector size case, but not for subpage case, as
for subpage we only mark the page uptodate if the page contains no hole
and all its extent buffers are uptodate.

So instead of checking PageUptodate(), for subpage case we check the
uptodate bitmap of btrfs_subpage structure.

To make the check more elegant, introduce a helper,
assert_eb_page_uptodate() to do the check for both subpage and regular
sector size cases.

The following functions are involved:

- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer()
- extent_buffer_test_bit()
- extent_buffer_bitmap_set()
- extent_buffer_bitmap_clear()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1e5eb3d6 25-Mar-2021 Qu Wenruo <wqu@suse.com>

btrfs: make alloc_extent_buffer() check subpage dirty bitmap

In alloc_extent_buffer(), we make sure that the newly allocated page is
never dirty.

This is fine for sector size == PAGE_SIZE case, but for subpage it's
possible that one extent buffer in the page is dirty, thus the whole
page is marked dirty, and could cause false alert.

To support subpage, call btrfs_page_test_dirty() to handle both cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cea62800 16-Mar-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: remove duplicated in_range() macro

The in_range() macro is defined twice in btrfs' source, once in ctree.h
and once in misc.h.

Remove the definition in ctree.h and include misc.h in the files depending
on it.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5e295768 03-Mar-2021 Goldwyn Rodrigues <rgoldwyn@suse.de>

btrfs: remove mirror argument from btrfs_csum_verify_data()

The parameter mirror is not used and does not make sense for checksum
verification of the given bio.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d9bb77d5 14-Mar-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix wild pointer access during metadata read failure

[BUG]
When running fstests for btrfs subpage read-write test, it has a very
high chance to crash at generic/475 with the following stack:

BTRFS warning (device dm-8): direct IO failed ino 510 rw 1,34817 sector 0xcdf0 len 94208 err no 10
Unable to handle kernel paging request at virtual address ffff80001157e7c0
CPU: 2 PID: 687125 Comm: kworker/u12:4 Tainted: G WC 5.12.0-rc2-custom+ #5
Hardware name: Khadas VIM3 (DT)
Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
pc : queued_spin_lock_slowpath+0x1a0/0x390
lr : do_raw_spin_lock+0xc4/0x11c
Call trace:
queued_spin_lock_slowpath+0x1a0/0x390
_raw_spin_lock+0x68/0x84
btree_readahead_hook+0x38/0xc0 [btrfs]
end_bio_extent_readpage+0x504/0x5f4 [btrfs]
bio_endio+0x170/0x1a4
end_workqueue_fn+0x3c/0x60 [btrfs]
btrfs_work_helper+0x1b0/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827)

[CAUSE]
In end_bio_extent_readpage(), if we hit an error during read, we will
handle the error differently for data and metadata.
For data we queue a repair, while for metadata, we record the error and
let the caller choose what to do.

But the code is still using page->private to grab extent buffer, which
no longer points to extent buffer for subpage metadata pages.

Thus this wild pointer access leads to above crash.

[FIX]
Introduce a helper, find_extent_buffer_readpage(), to grab extent
buffer.

The difference against find_extent_buffer_nospinlock() is:

- Also handles regular sectorsize == PAGE_SIZE case
- No extent buffer refs increase/decrease
As extent buffer under IO must have non-zero refs, so this is safe

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d2dcc8ed 08-Mar-2021 Qu Wenruo <wqu@suse.com>

btrfs: fix wrong offset to zero out range beyond i_size

[BUG]
The test generic/091 fails , with the following output:

fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
mapped writes DISABLED
Seed set to 1
main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
skipping zero size read
truncating to largest ever: 0xe400
copying to largest ever: 0x1f400
cloning to largest ever: 0x70000
cloning to largest ever: 0x77000
fallocating to largest ever: 0x7a120
Mapped Read: non-zero data past EOF (0x3a7ff) page offset 0x800 is 0xf2e1 <<<
...

[CAUSE]
In commit c28ea613fafa ("btrfs: subpage: fix the false data csum mismatch error")
end_bio_extent_readpage() changes to only zero the range inside the bvec
for incoming subpage support.

But that commit is using incorrect offset to calculate the start.

For subpage, we can have a case that the whole bvec is beyond isize,
thus we need to calculate the correct offset.

But the offending commit is using @end (bvec end), other than @start
(bvec start) to calculate the start offset.

This means, we only zero the last byte of the bvec, not from the isize.
This stupid bug makes the range beyond isize is not properly zeroed, and
failed above test.

[FIX]
Use correct @start to calculate the range start.

Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: c28ea613fafa ("btrfs: subpage: fix the false data csum mismatch error")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a8affc03 10-Mar-2021 Christoph Hellwig <hch@lst.de>

block: rename BIO_MAX_PAGES to BIO_MAX_VECS

Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c28ea613 01-Mar-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix the false data csum mismatch error

[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).

With some extra debug info added, we have the following traces:

0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0

[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.

But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.

This means for subpage if we have an inode whose content looks like below:

0 16K 32K 48K 64K
|///////| |///////| |

|//| = data needs to be read from disk
| | = hole

And i_size is 64K initially.

Then the following race can happen:

T1 | T2
--------------------------------+--------------------------------
btrfs_do_readpage() |
|- isize = 64K; |
| At this time, the isize is |
| 64K |
| |
|- submit_extent_page() |
| submit previous assembled bio|
| assemble bio for [0, 16K) |
| |
|- submit_extent_page() |
submit read bio for [0, 16K) |
assemble read bio for |
[32K, 48K) |
|
| btrfs_setsize()
| |- i_size_write(, 16K);
| Now i_size is only 16K
end_io() for [0K, 16K) |
|- end_bio_extent_readpage() |
|- btrfs_verify_data_csum() |
| No csum error |
|- i_size = 16K; |
|- zero_user_segment(16K, |
PAGE_SIZE); |
!!! We zeroed range |
!!! [32K, 48K) |
| end_io for [32K, 48K)
| |- end_bio_extent_readpage()
| |- btrfs_verify_data_csum()
| ! CSUM MISMATCH !
| ! As the range is zeroed now !

[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.

The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7ef5287 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: relocate block group to repair IO failure in zoned filesystems

When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.

We can consider three methods to repair an IO failure in zoned filesystems:

(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.

Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0bc09ca1 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: serialize metadata IO

We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.

Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.

A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d8e3fb10 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: use ZONE_APPEND write for zoned mode

Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.

Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.

Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().

And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cacb2cea 04-Feb-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: check if bio spans across an ordered extent

To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.

Ensure that constructing bio does not span more than an ordered extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e1326f03 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: use bio_add_zone_append_page

A zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow these restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to cache the target device to btrfs_io_bio(bio)->device.

Caching only the target device is sufficient here as zoned filesystems
only supports the single profile at the moment. Once more profiles will be
supported btrfs_io_bio can hold an extent_map to be able to check for the
restrictions of all devices the btrfs_bio will be mapped to.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 953651eb 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: factor out helper adding a page to bio

Factor out adding a page to a bio from submit_extent_page(). The page
is added only when bio_flags are the same, contiguous and the added page
fits in the same stripe as pages in the bio.

Condition checks are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_stripe() calling.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d3575156 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: redirty released extent buffers

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2c4d8cb7 28-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: explain page locking and readahead in read_extent_buffer_pages()

In read_extent_buffer_pages(), if we failed to lock the page atomically,
we just exit with return value 0.

This is counter-intuitive, as normally if we can't lock what we need, we
would return something like EAGAIN.

But that return hides under (wait == WAIT_NONE) branch, which only gets
triggered for readahead.

And for readahead, if we failed to lock the page, it means the extent
buffer is either being read by other thread, or has been read and is
under modification. Either way the eb will or has been cached, thus
readahead has no need to wait for it.

Add comment on this counter-intuitive behavior.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 92082d40 01-Feb-2021 Qu Wenruo <wqu@suse.com>

btrfs: integrate page status update for data read path into begin/end_page_read

In btrfs data page read path, the page status update are handled in two
different locations:

btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}

end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}

This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.

But for subpage, there is more work to be done in page status update:

- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.

- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.

To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:

- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.

- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.

The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.

This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.

Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 32443de3 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: introduce btrfs_subpage for data inodes

To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...

This patch will make pages for data inodes get btrfs_subpage structure
attached, and detached when the page is freed.

This patch also slightly changes the timing when
set_page_extent_mapped() is called to make sure:

- We have page->mapping set
page->mapping->host is used to grab btrfs_fs_info, thus we can only
call this function after page is mapped to an inode.

One call site attaches pages to inode manually, thus we have to modify
the timing of set_page_extent_mapped() a bit.

- As soon as possible, before other operations
Since memory allocation can fail, we have to do extra error handling.
Calling set_page_extent_mapped() as soon as possible can simply the
error handling for several call sites.

The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.

Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).

So we will stick to btrfs specific bitmap for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4325cb22 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: support subpage in endio_readpage_update_page_status()

To handle subpage status update, add the following:

- Use btrfs_page_*() subpage-aware helpers to update page status
Now we can handle both cases well.

- No page unlock for subpage metadata
Since subpage metadata doesn't utilize page locking at all, skip it.
For subpage data locking, it's handled in later commits.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4012daf7 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: introduce read_extent_buffer_subpage()

Introduce a helper, read_extent_buffer_subpage(), to do the subpage
extent buffer read.

The difference between regular and subpage routines are:

- No page locking
Here we completely rely on extent locking.
Page locking can reduce the concurrency greatly, as if we lock one
page to read one extent buffer, all the other extent buffers in the
same page will have to wait.

- Extent uptodate condition
Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
We also need to check btrfs_subpage::uptodate_bitmap.

- No page iteration
Just one page, no need to loop, this greatly simplified the subpage
routine.

This patch only implements the bio submit part, no endio support yet.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d1e86e3f 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: support subpage in try_release_extent_buffer()

Unlike the original try_release_extent_buffer(),
try_release_subpage_extent_buffer() will iterate through all the ebs in
the page, and try to release each.

We can release the full page only after there's no private attached,
which means all ebs of that page have been released as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 92d83e94 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: support subpage in btrfs_clone_extent_buffer

For btrfs_clone_extent_buffer(), it's mostly the same code of
__alloc_dummy_extent_buffer(), except it has extra page copy.

So to make it subpage compatible, we only need to:

- Call set_extent_buffer_uptodate() instead of SetPageUptodate()
This will set correct uptodate bit for subpage and regular sector size
cases.

Since we're calling set_extent_buffer_uptodate() which will also set
EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit
either.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 251f2acc 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: support subpage in set/clear_extent_buffer_uptodate()

To support subpage in set_extent_buffer_uptodate and
clear_extent_buffer_uptodate we only need to use the subpage-aware
helpers to update the page bits.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 09bc1f0f 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: attach private to dummy extent buffer pages

There are locations where we allocate dummy extent buffers for temporary
usage, like in tree_mod_log_rewind() or get_old_root().

These dummy extent buffers will be handled by the same eb accessors, and
if they don't have page::private subpage eb accessors could fail.

To address such problems, make __alloc_dummy_extent_buffer() attach
page private for dummy extent buffers too.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8ff8466d 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: support subpage for extent buffer page release

In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.

Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.

For subpage case, handle detaching page private.

For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.

For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.

But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:

Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.

T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |

Then we have a metadata eb whose page has no private bit.

To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.

In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.

The section is marked by:

- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 81982210 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: make grab_extent_buffer_from_page() handle subpage case

For subpage case, grab_extent_buffer() can't really get an extent buffer
just from btrfs_subpage.

We have radix tree lock protecting us from inserting the same eb into
the tree. Thus we don't really need to do the extra hassle, just let
alloc_extent_buffer() handle the existing eb in radix tree.

Now if two ebs are being allocated as the same time, one will fail with
-EEIXST when inserting into the radix tree.

So for grab_extent_buffer(), just always return NULL for subpage case.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 760f991f 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: make attach_extent_buffer_page() handle subpage case

For subpage case, we need to allocate additional memory for each
metadata page.

So we need to:

- Allow attach_extent_buffer_page() to return int to indicate allocation
failure

- Allow manually pre-allocate subpage memory for alloc_extent_buffer()
As we don't want to use GFP_ATOMIC under spinlock, we introduce
btrfs_alloc_subpage() and btrfs_free_subpage() functions for this
purpose.
(The simple wrap for btrfs_free_subpage() is for later convert to
kmem_cache. Already internally tested without problem)

- Preallocate btrfs_subpage structure for alloc_extent_buffer()
We don't want to call memory allocation with spinlock held, so
do preallocation before we acquire mapping->private_lock.

- Handle subpage and regular case differently in
attach_extent_buffer_page()
For regular case, no change, just do the usual thing.
For subpage case, allocate new memory or use the preallocated memory.

For future subpage metadata, we will make use of radix tree to grab
extent buffer.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 62c053fb 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support

For the incoming subpage support, UNMAPPED extent buffer will have
different behavior in btrfs_release_extent_buffer().

This means we need to set UNMAPPED bit early before calling
btrfs_release_extent_buffer().

Currently there is only one caller which relies on
btrfs_release_extent_buffer() in its error path while set UNMAPPED bit
late:
- btrfs_clone_extent_buffer()

Make it subpage compatible by setting the UNMAPPED bit early, since
we're here, also move the UPTODATE bit early.

There is another caller, __alloc_dummy_extent_buffer(), setting
UNMAPPED bit late, but that function clean up the allocated page
manually, thus no need for any modification.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6869b0a8 26-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK

PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
__process_pages_contig(), to let the function know to clear page dirty
bit and then set page writeback.

However page writeback and dirty bits are conflicting (at least for
sector size == PAGE_SIZE case), this means these two have to be always
updated together.

This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
PAGE_START_WRITEBACK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2e626e56 24-Jan-2021 Nigel Christian <nigel.l.christian@gmail.com>

btrfs: remove repeated word in struct member comment

Comment for processed extent end of range has an unnecessary "in",
remove it.

Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3bed2da1 22-Jan-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: fix parameter description for functions in extent_io.c

This makes the file W=1 clean and fixes the following warnings:

fs/btrfs/extent_io.c:414: warning: Function parameter or member 'tree' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'offset' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'next_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'prev_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'p_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'parent_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'tree' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start_ret' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'end_ret' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'bits' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'tree' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start_ret' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'end_ret' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'bits' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:4187: warning: Function parameter or member 'epd' not described in 'extent_write_cache_pages'
fs/btrfs/extent_io.c:4187: warning: Excess function parameter 'data' description in 'extent_write_cache_pages'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c0f0a9e7 05-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: introduce helper to grab an existing extent buffer from a page

This patch will extract the code to grab an extent buffer from a page
into a helper, grab_extent_buffer_from_page().

This reduces one indent level, and provides the work place for later
expansion for subapge support.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6bc5636a 05-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: refactor __extent_writepage_io() to improve readability

The refactoring involves the following modifications:

- iosize alignment
In fact we don't really need to manually do alignment at all.
All extent maps should already be aligned, thus basic ASSERT() check
would be enough.

- redundant variables
We have extra variable like blocksize/pg_offset/end.
They are all unnecessary.

@blocksize can be replaced by sectorsize size directly, and it's only
used to verify the em start/size is aligned.

@pg_offset can be easily calculated using @cur and page_offset(page).

@end is just assigned from @page_end and never modified, use
"start + PAGE_SIZE - 1" directly and remove @page_end.

- remove some BUG_ON()s
The BUG_ON()s are for extent map, which we have tree-checker to check
on-disk extent data item and runtime check.
ASSERT() should be enough.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0c64c33c 05-Jan-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename parameter offset to disk_bytenr in submit_extent_page

The parameter offset is confusing, it's supposed to be the disk bytenr
of metadata/data. Rename it to disk_bytenr and update the comment.

Also rename each offset passed to submit_extent_page() as @disk_bytenr
so they're consistent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 29b665cc 03-Jan-2021 Su Yue <l@damenly.su>

btrfs: prevent NULL pointer dereference in extent_io_tree_panic

Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.

Pass tree->fs_info as it's known to be valid in all cases.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 884b07d0 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors

To support sectorsize < PAGE_SIZE case, we need to take extra care of
extent buffer accessors.

Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.

This patch introduces two helpers to do this:

- get_eb_page_index()
This is to calculate the index to access extent_buffer::pages.
It's just a simple wrapper around "start >> PAGE_SHIFT".

For sectorsize == PAGE_SIZE case, nothing is changed.
For sectorsize < PAGE_SIZE case, we always get index as 0, and
the existing page shift also works.

- get_eb_offset_in_page()
This is to calculate the offset to access extent_buffer::pages.
This needs to take extent_buffer::start into consideration.

For sectorsize == PAGE_SIZE case, extent_buffer::start is always
aligned to PAGE_SIZE, thus adding extent_buffer::start to
offset_in_page() won't change the result.
For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
us the correct offset to access.

This patch will touch the following parts to cover all extent buffer
accessors:

- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1aaac38c 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: don't allow tree block to cross page boundary for subpage support

As a preparation for subpage sector size support (allowing filesystem
with sector size smaller than page size to be mounted) if the sector
size is smaller than page size, we don't allow tree block to be read if
it crosses 64K(*) boundary.

The 64K is selected because:

- we are only going to support 64K page size for subpage for now
- 64K is also the maximum supported node size

This ensures that tree blocks are always contained in one page for a
system with 64K page size, which can greatly simplify the handling.

Otherwise we would have to do complex multi-page handling of tree
blocks. Currently there is no way to create such tree blocks.

In kernel we have avoided such tree blocks allocation even on 4K page
size, as it can lead to RAID56 stripe scrubbing.

While btrfs-progs have fixed its chunk allocator since 2016 for convert,
and has extra checks to do the same behavior as the kernel.

Just add such graceful checks in case of an ancient filesystem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# deb67895 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: calculate inline extent buffer page size based on page size

Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.

For a system using 64K page size, we would really have just one page.

While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.

Calculate the array size based on page size and the node size maximum.

- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page

Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f91e0d0c 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: factor out btree page submission code to a helper

In btree_write_cache_pages() we have a btree page submission routine
buried deeply in a nested loop.

This patch will extract that part of code into a helper function,
submit_eb_page(), to do the same work.

Since submit_eb_page() now can return >0 for successful extent
buffer submission, remove the "ASSERT(ret <= 0);" line.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7ffd27e3 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: pass bio_offset to check_data_csum() directly

Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().

Both parameters are calculated values for csum lookup.

Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.

Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.

Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1201b58b 26-Nov-2020 David Sterba <dsterba@suse.com>

btrfs: drop casts of bio bi_sector

Since commit 72deb455b5ec ("block: remove CONFIG_LBDAF") (5.2) the
sector_t type is u64 on all arches and configs so we don't need to
typecast it. It used to be unsigned long and the result of sector size
shifts were not guaranteed to fit in the type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb22e9c4 13-Nov-2020 Qu Wenruo <wqu@suse.com>

btrfs: use detach_page_private() in alloc_extent_buffer()

In alloc_extent_buffer(), after we got a page from btree inode, we check
if that page has private pointer attached.

If attached, we check if the existing extent buffer has proper refs.
If not (the eb is being freed), we will detach that private eb pointer.

The point here is, we are detaching that eb pointer by calling:
- ClearPagePrivate()
- put_page()

The put_page() here is especially confusing, as it's decreasing the ref
from attach_page_private(). Without knowing that, it looks like the
put_page() is for the find_or_create_page() call, confusing the reader.

Since we're always modifying page private with attach_page_private() and
detach_page_private(), the only open-coded detach_page_private() here is
really confusing.

Fix it by calling detach_page_private().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 829ddec9 13-Nov-2020 Qu Wenruo <wqu@suse.com>

btrfs: only clear EXTENT_LOCK bit in extent_invalidatepage

extent_invalidatepage() will try to clear all possible bits since it's
calling clear_extent_bit() with delete == 1.

This is currently fine, since for btree io tree, it only utilizes
EXTENT_LOCK bit. But this could be a problem for later subpage support,
which will utilize extra io tree bit to represent additional info.

This patch will just convert that clear_extent_bit() to
unlock_extent_cached().

For current code since only EXTENT_LOCKED bit is utilized, this doesn't
change the behavior, but provides a much cleaner basis for incoming
subpage support.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8e1dc982 12-Nov-2020 Qu Wenruo <wqu@suse.com>

btrfs: remove unused parameter phy_offset from btrfs_validate_metadata_buffer

Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
@phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.

But for metadata, it's completely useless as metadata stores their own
csum in its header, so we can remove it.

Note: parameters @start and @end, they are not utilized at all for
current sectorsize == PAGE_SIZE case, as we can grab eb directly from
page.

But those two parameters are very important for later subpage support,
thus @start/@len are not touched here.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f97e27e9 13-Nov-2020 Qu Wenruo <wqu@suse.com>

btrfs: use fixed width int type for extent_state::state

Currently the type is unsigned int which could change its width
depending on the architecture. We need up to 32 bits so make it
explicit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e09caaf9 13-Nov-2020 Qu Wenruo <wqu@suse.com>

btrfs: introduce helper to handle page status update in end_bio_extent_readpage()

Introduce a new helper to handle update page status in
end_bio_extent_readpage(). This will be later used for subpage support
where the page status update can be more complex than now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 94e8c95c 13-Nov-2020 Qu Wenruo <wqu@suse.com>

btrfs: add structure to keep track of extent range in end_bio_extent_readpage

In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.

Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.

Here is an example to explain the original work flow:

Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)

end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */

As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.

However current behavior has something not really considered:

- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.

- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.

- Poor readability

This patch will address the problem:

- Introduce a proper structure, processed_extent, to record processed
extent range

- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()

- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.

- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cab5e72 05-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: merge __set_extent_bit and set_extent_bit

There are only 2 direct calls to set_extent_bit outside of extent-io -
in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
thin wrappers around __set_extent_bit. This adds unnecessary indirection
and just makes it more annoying when looking at the various extent bit
manipulation functions. This patch renames __set_extent_bit to
set_extent_bit effectively removing a level of indirection. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat and remove __must_check ]
Signed-off-by: David Sterba <dsterba@suse.com>


# a55463c9 06-Nov-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: remove extent_buffer::recursed

It is unused everywhere now, it can be removed.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2766ff61 04-Nov-2020 Filipe Manana <fdmanana@suse.com>

btrfs: update the number of bytes used by an inode atomically

There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.

In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:

https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

The cases where this can happen are the following:

-> Case 1

If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).

This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.

If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).

Example reproducer:

$ cat reproducer-1.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got

while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}

mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT

xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)

# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!

for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done

kill $loop_pid &> /dev/null
wait

umount $DEV

$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 2

If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.

This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.

Example reproducer:

$ cat reproducer-2.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got

while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}

mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT

touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)

# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar

kill $loop_pid &> /dev/null
wait $loop_pid
done

umount $DEV

$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 3

Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.

The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.

Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.

Example reproducer:

$ cat reproducer-3.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT

extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))

# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null

expected=$(stat -c %b $MNT/foo)

# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done

umount $DEV

$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

So fix this by:

1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;

2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;

3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.

An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e114c545 05-Nov-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: set the lockdep class for extent buffers on creation

Both Filipe and Fedora QA recently hit the following lockdep splat:

WARNING: possible recursive locking detected
5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted
--------------------------------------------
rsync/2610 is trying to acquire lock:
ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

but task is already holding lock:
ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&eb->lock);
lock(&eb->lock);

*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by rsync/2610:
#0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
#1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

stack backtrace:
CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x8b/0xb0
__lock_acquire.cold+0x12d/0x2a4
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
lock_acquire+0xc8/0x400
? btrfs_tree_read_lock_atomic+0x34/0x140
? read_block_for_search.isra.0+0xdd/0x320
_raw_read_lock+0x3d/0xa0
? btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_search_slot+0x616/0x9a0
btrfs_lookup_dir_item+0x6c/0xb0
btrfs_lookup_dentry+0xa8/0x520
? lockdep_init_map_waits+0x4c/0x210
btrfs_lookup+0xe/0x30
__lookup_slow+0x10f/0x1e0
walk_component+0x11b/0x190
path_lookupat+0x72/0x1c0
filename_lookup+0x97/0x180
? strncpy_from_user+0x96/0x1e0
? getname_flags.part.0+0x45/0x1a0
vfs_statx+0x64/0x100
? lockdep_hardirqs_on_prepare+0xff/0x180
? _raw_spin_unlock_irqrestore+0x41/0x50
__do_sys_newlstat+0x26/0x40
? lockdep_hardirqs_on_prepare+0xff/0x180
? syscall_enter_from_user_mode+0x27/0x80
? syscall_enter_from_user_mode+0x27/0x80
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9

I have also seen a report of lockdep complaining about the lock class
that was looked up being the same as the lock class on the lock we were
using, but I can't find the report.

These are problems that occur because we do not have the lockdep class
set on the extent buffer until _after_ we read the eb in properly. This
is problematic for concurrent readers, because we will create the extent
buffer, lock it, and then attempt to read the extent buffer.

If a second thread comes in and tries to do a search down the same path
they'll get the above lockdep splat because the class isn't set properly
on the extent buffer.

There was a good reason for this, we generally didn't know the real
owner of the eb until we read it, specifically in refcounted roots.

However now all refcounted roots have the same class name, so we no
longer need to worry about this. For non-refcounted trees we know
which root we're on based on the parent.

Fix this by setting the lockdep class on the eb at creation time instead
of read time. This will fix the splat and the weirdness where the class
changes in the middle of locking the block.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3fbaf258 05-Nov-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: pass the owner_root and level to alloc_extent_buffer

Now that we've plumbed all of the callers to have the owner root and the
level, plumb it down into alloc_extent_buffer().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bfb484d9 05-Nov-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: cleanup extent buffer readahead

We're going to pass around more information when we allocate extent
buffers, in order to make that cleaner how we do readahead. Most of the
callers have the parent node that we're getting our blockptr from, with
the sole exception of relocation which simply has the bytenr it wants to
read.

Add a helper that takes the current arguments that we need (bytenr and
gen), and add another helper for simply reading the slot out of a node.
In followup patches the helper that takes all the extra arguments will
be expanded, and the simpler helper won't need to have it's arguments
adjusted.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 478ef886 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: make buffer_radix take sector size units

For subpage sector size support, one page can contain multiple tree
blocks. The entries cannot be based on page size and index must be
derived from the sectorsize. No change for page size == sector size.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d01e247 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: assert page mapping lock in attach_extent_buffer_page

When calling attach_extent_buffer_page(), either we're attaching
anonymous pages, called from btrfs_clone_extent_buffer(),
or we're attaching btree inode pages, called from alloc_extent_buffer().

For the latter case, we should hold page->mapping->private_lock to avoid
parallel changes to page->private.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b9729ce0 20-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: locking: rip out path->leave_spinning

We no longer distinguish between blocking and spinning, so rip out all
this code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 223486c2 02-Jul-2020 David Sterba <dsterba@suse.com>

btrfs: switch cached fs_info::csum_size from u16 to u32

The fs_info value is 32bit, switch also the local u16 variables. This
leads to a better assembly code generated due to movzwl.

This simple change will shave some bytes on x86_64 and release config:

text data bss dec hex filename
1090000 17980 14912 1122892 11224c pre/btrfs.ko
1089794 17980 14912 1122686 11217e post/btrfs.ko

DELTA: -206

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55fc29be 29-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: use cached value of fs_info::csum_size everywhere

btrfs_get_16 shows up in the system performance profiles (helper to read
16bit values from on-disk structures). This is partially because of the
checksum size that's frequently read along with data reads/writes, other
u16 uses are from item size or directory entries.

Replace all calls to btrfs_super_csum_size by the cached value from
fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 265fdfa6 01-Jul-2020 David Sterba <dsterba@suse.com>

btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits

The value of super_block::s_blocksize_bits is the same as
fs_info::sectorsize_bits, but we don't need to do the extra dereferences
in many functions and storing the bits as u32 (in fs_info) generates
shorter assembly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e940e9a7 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: rename page_size to io_size in submit_extent_page

The variable @page_size in submit_extent_page() is not related to page
size.

It can already be smaller than PAGE_SIZE, so rename it to io_size to
reduce confusion, this is especially important for later subpage
support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8b8bbd46 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: only require sector size alignment for page read

If we're reading partial page, btrfs will warn about this as read/write
is always done in sector size, which now equals page size.

But for the upcoming subpage read-only support, our data read is only
aligned to sectorsize, which can be smaller than page size.

Thus here we change the warning condition to check it against
sectorsize, the behavior is not changed for regular sectorsize ==
PAGE_SIZE case, and won't report error for subpage read.

Also, pass the proper start/end with bv_offset for check_data_csum() to
handle.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 12e3360f 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: rename pages_locked in process_pages_contig()

Function process_pages_contig() does not only handle page locking but
also other operations. Rename the local variable pages_locked to
pages_processed to reduce confusion.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3f6bb4ae 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: sink the failed_start parameter to set_extent_bit

The @failed_start parameter is only paired with @exclusive_bits, and
those parameters are only used for EXTENT_LOCKED bit, which have their
own wrappers lock_extent_bits().

Thus for regular set_extent_bit() calls, the failed_start makes no
sense, just sink the parameter.

Also, since @failed_start and @exclusive_bits are used in pairs, add
an assert to make it obvious.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 03509b78 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: update the comment for find_first_extent_bit

The pitfall here is, if the parameter @bits has multiple bits set, we
will return the first range which just has one of the specified bits
set.

This is a little tricky if we want an exact match. Anyway, update the
comment to make that clear.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a3efb2f0 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: fix the comment on lock_extent_buffer_for_io

The return value of that function is completely wrong.

That function only returns 0 if the extent buffer doesn't need to be
submitted. The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".

And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.

While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.

The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.

So the comment is completely wrong.

Since we're here, also change the description a little. The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.

For reference, added in commit 2e3c25136adf ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 196d59ab 20-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: switch extent buffer tree lock to rw_semaphore

Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree. For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon. At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.

However now the opposite is the case. There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order. This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases. The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.

The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.

So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything. This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions. The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree. However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):

dbench 200 baseline
Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms

dbench 200 with patch
Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms

Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking

fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16

baseline
Average Files/sec: 160166.7
p50 Files/sec: 165832
p90 Files/sec: 123886
p99 Files/sec: 123495

real 3m26.527s
user 2m19.223s
sys 48m21.856s

patched
Average Files/sec: 164135.7
p50 Files/sec: 171095
p90 Files/sec: 122889
p99 Files/sec: 113819

real 3m29.660s
user 2m19.990s
sys 44m12.259s

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 949b3273 15-Sep-2020 Goldwyn Rodrigues <rgoldwyn@suse.de>

btrfs: use iosize while reading compressed pages

While using compression, a submitted bio is mapped with a compressed bio
which performs the read from disk, decompresses and returns uncompressed
data to original bio. The original bio must reflect the uncompressed
size (iosize) of the I/O to be performed, or else the page just gets the
decompressed I/O length of data (disk_io_size). The compressed bio
checks the extent map and gets the correct length while performing the
I/O from disk.

This came up in subpage work when only compressed length of the original
bio was filled in the page. This worked correctly for pagesize ==
sectorsize because both compressed and uncompressed data are at pagesize
boundaries, and would end up filling the requested page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 905eb88b 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove struct extent_io_ops

It's no longer used just remove the function and any related code which
was initialising it for inodes. No functional changes.

Removing 8 bytes from extent_io_tree in turn reduces size of other
structures where it is embedded, notably btrfs_inode where it reduces
size by 24 bytes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1b36294a 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: call submit_bio_hook directly for metadata pages

No need to go through a function pointer indirection simply call
submit_bio_hook directly by exporting and renaming the helper to
btrfs_submit_metadata_bio. This makes the code more readable and should
result in somewhat faster code due to no longer paying the price for
specualtive attack mitigations that come with indirect function calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 908930f3 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: stop calling submit_bio_hook for data inodes

Instead export and rename the function to btrfs_submit_data_bio and
call it directly in submit_one_bio. This avoids paying the cost for
speculative attacks mitigations and improves code readability.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# be17b3af 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: don't opencode is_data_inode in end_bio_extent_readpage

Use the is_data_inode helper.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cd053744 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: call submit_bio_hook directly in submit_one_bio

BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
- ordinary data inodes (including the freespace inode) and the btree
inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
never be called from submit_one_bio so just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9a446d6a 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: replace readpage_end_io_hook with direct calls

Don't call readpage_end_io_hook for the btree inode. Instead of relying
on indirect calls to implement metadata buffer validation simply check
if the inode whose page we are processing equals the btree inode. If it
does call the necessary function.

This is an improvement in 2 directions:

1. We aren't paying the penalty of indirect calls in a post-speculation
attacks world.

2. The function is now named more explicitly so it's obvious what's
going on

This is in preparation to removing struct extent_io_ops altogether.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0f208812 14-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: open code extent_read_full_page to its sole caller

This makes reading the code a tad easier by decreasing the level of
indirection by one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fd513000 13-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: sink mirror_num argument in __do_readpage

It's always set to 0 by the 2 callers so move it inside __do_readpage.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6f15af60 13-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: sink read_flags argument into extent_read_full_page

It's always set to 0 by its sole caller - btrfs_readpage. Simply remove
it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 003c286a 13-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: sink mirror_num argument in extent_read_full_page

It's always set to 0 from the sole caller - btrfs_readpage.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c1be9c1a 13-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: promote extent_read_full_page to btrfs_readpage

Now that btrfs_readpage is the only caller of extent_read_full_page the
latter can be open coded in the former. Use the occassion to rename
__extent_read_full_page to extent_read_full_page. To facillitate this
change submit_one_bio has to be exported as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 72cffee4 13-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove mirror_num argument from extent_read_full_page

It's called only from btrfs_readpage which always passes 0 so just sink
the argument into extent_read_full_page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1a5ee1e6 13-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove btrfs_get_extent indirection from __do_readpage

Now that this function is only responsible for reading data pages it's
no longer necessary to pass get_extent_t parameter across several
layers of functions. This patch removes this parameter from multiple
functions: __get_extent_map/__do_readpage/__extent_read_full_page/
extent_read_full_page and simply calls btrfs_get_extent directly in
__get_extent_map.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0420177c 13-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: simplify metadata pages reading

Metadata pages currently use __do_readpage to read metadata pages,
unfortunately this function is also used to deal with ordinary data
pages. This makes the metadata pages reading code to go through multiple
hoops in order to adhere to __do_readpage invariants. Most of these are
necessary for data pages which could be compressed. For metadata it's
enough to simply build a bio and submit it.

To this effect simply call submit_extent_page directly from
read_extent_buffer_pages which is the only callpath used to populate
extent_buffers with data. This in turn enables further cleanups.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# facee0a0 31-Aug-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make extent_fiemap take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f1bbde8d 31-Aug-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make get_extent_skip_holes take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6fee248d 31-Aug-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: convert btrfs_inode_sectorsize to take btrfs_inode

It's counterintuitive to have a function named btrfs_inode_xxx which
takes a generic inode. Also move the function to btrfs_inode.h so that
it has access to the definition of struct btrfs_inode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 329ced79 20-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed

Nested locking with lockdep and everything else refers to lock hierarchy
within the same lock map. This is how we indicate the same locks for
different objects are ok to take in a specific order, for our use case
that would be to take the lock on a leaf and then take a lock on an
adjacent leaf.

What ->lock_nested _actually_ refers to is if we happen to already be
holding the write lock on the extent buffer and we're allowing a read
lock to be taken on that extent buffer, which is recursion. Rename this
so we don't get confused when we switch to a rwsem and have to start
using the _nested helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f98b6215 19-Aug-2020 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: do extra check for extent buffer read write functions

Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:

- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.

- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.

- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().

- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.

To address above problems, this patch creates a new common function to
check such access, check_eb_range().

- Add overflow check
This function checks start, start + len against eb->len and overflow
check.

- Unified checks

- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.

- Exit ASAP if check fails
No more possible memory corruption.

- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 260db43c 04-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

btrfs: delete duplicated words + other fixes in comments

Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a48b73ec 10-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: fix potential deadlock in the search ioctl

With the conversion of the tree locks to rwsem I got the following
lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
------------------------------------------------------
compsize/11122 is trying to acquire lock:
ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90

but task is already holding lock:
ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (btrfs-fs-00){++++}-{3:3}:
down_write_nested+0x3b/0x70
__btrfs_tree_lock+0x24/0x120
btrfs_search_slot+0x756/0x990
btrfs_lookup_inode+0x3a/0xb4
__btrfs_update_delayed_inode+0x93/0x270
btrfs_async_run_delayed_root+0x168/0x230
btrfs_work_helper+0xd4/0x570
process_one_work+0x2ad/0x5f0
worker_thread+0x3a/0x3d0
kthread+0x133/0x150
ret_from_fork+0x1f/0x30

-> #1 (&delayed_node->mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_delayed_update_inode+0x50/0x440
btrfs_update_inode+0x8a/0xf0
btrfs_dirty_inode+0x5b/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&mm->mmap_lock#2){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

Chain exists of:
&mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(btrfs-fs-00);
lock(&delayed_node->mutex);
lock(btrfs-fs-00);
lock(&mm->mmap_lock#2);

*** DEADLOCK ***

1 lock held by compsize/11122:
#0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

stack backtrace:
CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __might_fault+0x3e/0x90
? find_held_lock+0x72/0x90
__might_fault+0x68/0x90
? __might_fault+0x3e/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
? btrfs_search_forward+0x2a6/0x360
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
? __do_sys_newfstat+0x5a/0x70
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5e548b32 21-Jul-2020 Filipe Manana <fdmanana@suse.com>

btrfs: do not set the full sync flag on the inode during page release

When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.

This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.

We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.

This patch is part of a patchset that consists of 3 patches, which have
the following subjects:

1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release

Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.

The following script, using fio, was used to perform the tests:

$ cat test-fsync.sh
#!/bin/bash

DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"

if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi

NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3

cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF

echo "Using config:"
echo
cat /tmp/fio-job.ini
echo

mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT

The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.

The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).

*****************************************************
**** 1 job, 32GiB file, fsync frequency 1 ****
*****************************************************

Before patchset:

WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec

After patchset:

WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)

*****************************************************
**** 2 jobs, 16GiB files, fsync frequency 1 ****
*****************************************************

Before patchset:

WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec

After patchset:

WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)

*****************************************************
**** 4 jobs, 8GiB files, fsync frequency 1 ****
*****************************************************

Before patchset:

WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec

After patchset:

WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)

*****************************************************
**** 8 jobs, 4GiB files, fsync frequency 1 ****
*****************************************************

Before patchset:

WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec

After patchset:

WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)

*****************************************************
**** 16 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************

Before patchset:

WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec

After patchset:

WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)

*****************************************************
**** 32 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************

Before patchset:

WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec

After patchset:

WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)

*****************************************************
**** 64 jobs, 1GiB files, fsync frequency 1 ****
*****************************************************

Before patchset:

WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec

After patchset:

WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fbc2bd7e 21-Jul-2020 Filipe Manana <fdmanana@suse.com>

btrfs: release old extent maps during page release

When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we never release an extent
map that is in the list of modified extents. This is to prevent races with
a concurrent fsync using the fast path, which could lead to not logging an
extent created in the current transaction.

However we can safely remove an extent map created in a past transaction
that is still in the list of modified extents (because no one fsynced yet
the inode after that transaction got commited), because such extents are
skipped during an fsync as it is pointless to log them. This change does
that.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d6448e6 21-Jul-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix race between page release and a fast fsync

When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:

1) A page is dirtied for some inode I;

2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;

3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;

4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;

5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:

6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;

7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;

8) The fsync finishes;

9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().

So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.

Fixes: ff44c6e36dc9dc ("Btrfs: do not hold the write_lock on the extent tree while logging")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fbabd4a3 21-Jul-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases

Eric reported seeing this message while running generic/475

BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted

Full stack trace:

BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
------------[ cut here ]------------
BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
BTRFS: Transaction aborted (error -117)
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
Call Trace:
? finish_wait+0x90/0x90
? __mutex_unlock_slowpath+0x45/0x2a0
? lock_acquire+0xa3/0x440
? lockref_put_or_lock+0x9/0x30
? dput+0x20/0x4a0
? dput+0x20/0x4a0
? do_raw_spin_unlock+0x4b/0xc0
? _raw_spin_unlock+0x1f/0x30
btrfs_sync_file+0x335/0x490 [btrfs]
do_fsync+0x38/0x70
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x50/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6a6ef1b6e3
Code: Bad RIP value.
RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace af146e0e38433456 ]---
BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted

This ret came from btrfs_write_marked_extents(). If we get an aborted
transaction via EIO before, we'll see it in btree_write_cache_pages()
and return EUCLEAN, which gets printed as "Filesystem corrupted".

Except we shouldn't be returning EUCLEAN here, we need to be returning
EROFS because EUCLEAN is reserved for actual corruption, not IO errors.

We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
elsewhere, but we want to use EROFS for this particular case. The
original transaction abort has the real error code for why we ended up
with an aborted transaction, all subsequent actions just need to return
EROFS because they may not have a trans handle and have no idea about
the original cause of the abort.

After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
stacktrace will not be dumped either.

Reported-by: Eric Sandeen <esandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add full test stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>


# b69d1ee9 16-Jul-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove done label in writepage_delalloc

Since there is not common cleanup run after the label it makes it
somewhat redundant.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3526302f 02-Jul-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: streamline btrfs_get_io_failure_record logic

Make the function directly return a pointer to a failure record and
adjust callers to handle it. Also refactor the logic inside so that
the case which allocates the failure record for the first time is not
handled in an 'if' arm, saving us a level of indentation. Finally make
the function static as it's not used outside of extent_io.c .

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2279a270 02-Jul-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make get_state_failrec return failrec directly

Only failure that get_state_failrec can get is if there is no failure
for the given address. There is no reason why the function should return
a status code and use a separate parameter for returning the actual
failure rec (if one is found). Simplify it by making the return type
a pointer and return ERR_PTR value in case of errors.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cd4c0bf94 05-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make writepage_delalloc take btrfs_inode

Only find_lock_delalloc_range uses vfs_inode so let's take the
btrfs_inode as a parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d4580fe2 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make __extent_writepage_io take btrfs_inode

It has only a single use for a generic vfs inode vs 3 for btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 98456b9c 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_run_delalloc_range take btrfs_inode

All children now take btrfs_inode so convert it to taking it as a
parameter as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bab16e21 23-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: don't use UAPI types for fiemap callback

The fiemap callback is not part of UAPI interface and the prototypes
don't have the __u64 types either.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ad7ff17b 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make extent_clear_unlock_delalloc take btrfs_inode

It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
interface will allow seamless conversion of its callers.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5909ca11 19-Jul-2020 Robbie Ko <robbieko@synology.com>

btrfs: fix page leaks after failure to lock page for delalloc

When locking pages for delalloc, we check if it's dirty and mapping still
matches. If it does not match, we need to return -EAGAIN and release all
pages. Only the current page was put though, iterate over all the
remaining pages too.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6bf9cd2e 17-Jun-2020 Boris Burkov <boris@bur.io>

btrfs: fix fatal extent_buffer readahead vs releasepage race

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.

Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9f47eb54 08-May-2020 Paul E. McKenney <paulmck@kernel.org>

fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls

Very large I/Os can cause the following RCU CPU stall warning:

RIP: 0010:rb_prev+0x8/0x50
Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c =
89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48
RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13
RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0
RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630
RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238
R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001
R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0
__lookup_extent_mapping+0xa0/0x110
try_release_extent_mapping+0xdc/0x220
btrfs_releasepage+0x45/0x70
shrink_page_list+0xa39/0xb30
shrink_inactive_list+0x18f/0x3b0
shrink_lruvec+0x38e/0x6b0
shrink_node+0x14d/0x690
do_try_to_free_pages+0xc6/0x3e0
try_to_free_mem_cgroup_pages+0xe6/0x1e0
reclaim_high.constprop.73+0x87/0xc0
mem_cgroup_handle_over_high+0x66/0x150
exit_to_usermode_loop+0x82/0xd0
do_syscall_64+0xd4/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9

On a PREEMPT=n kernel, the try_release_extent_mapping() function's
"while" loop might run for a very long time on a large I/O. This commit
therefore adds a cond_resched() to this loop, providing RCU any needed
quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d1b89bc0 01-Jun-2020 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

btrfs: use attach/detach_page_private

Since the new pair function is introduced, we can call them to clean the
code in btrfs.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: http://lkml.kernel.org/r/20200517214718.468-4-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba206a02 01-Jun-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: convert from readpages to readahead

Implement the new readahead method in btrfs using the new
readahead_page_batch() function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-18-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c60ac0ff 29-Apr-2020 David Sterba <dsterba@suse.com>

btrfs: drop unnecessary offset_in_page in extent buffer helpers

Helpers that iterate over extent buffer pages set up several variables,
one of them is finding out offset of the extent buffer start within a
page. Right now we have extent buffers aligned to page sizes so this is
effectively storing zero. This makes the code harder the follow and can
be simplified.

The same change is done in all the helpers:

* remove: size_t start_offset = offset_in_page(eb->start);
* simplify code using start_offset

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b48966a 28-Apr-2020 David Sterba <dsterba@suse.com>

btrfs: constify extent_buffer in the API functions

There are many helpers around extent buffers, found in extent_io.h and
ctree.h. Most of them can be converted to take constified eb as there
are no changes to the extent buffer structure itself but rather the
pages.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# db3756c8 29-Apr-2020 David Sterba <dsterba@suse.com>

btrfs: remove unused map_private_extent_buffer

All uses of map_private_extent_buffer have been replaced by more
effective way. The set/get helpers have their own bounds checker.
The function name was confusing since the non-private helper was removed
in a65917156e34 ("Btrfs: stop using highmem for extent_buffers") many
years ago.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 77d5d689 16-Apr-2020 Omar Sandoval <osandov@fb.com>

btrfs: unify buffered and direct I/O read repair

Currently, direct I/O has its own versions of bio_readpage_error() and
btrfs_check_repairable() (dio_read_error() and
btrfs_check_dio_repairable(), respectively). The main difference is that
the direct I/O version doesn't do read validation. The rework of direct
I/O repair makes it possible to do validation, so we can get rid of
btrfs_check_dio_repairable() and combine bio_readpage_error() and
dio_read_error() into a new helper, btrfs_submit_read_repair().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fd9d6670 16-Apr-2020 Omar Sandoval <osandov@fb.com>

btrfs: simplify direct I/O read repair

Direct I/O read repair was originally implemented in commit 8b110e393c5a
("Btrfs: implement repair function when direct read fails"). This
implementation is unnecessarily complicated. There is major code
duplication between __btrfs_subio_endio_read() (checks checksums and
handles I/O errors for files with checksums),
__btrfs_correct_data_nocsum() (handles I/O errors for files without
checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
for retries of files with checksums), and btrfs_retry_endio_nocsum()
(handles I/O errors for retries of files without checksum). If it sounds
like these should be one function, that's because they should.
Additionally, these functions are very hard to follow due to their
excessive use of goto.

This commit replaces the original implementation. After the previous
commit getting rid of orig_bio, we can reuse the same endio callback for
repair I/O and the original I/O, we just need to track the file offset
and original iterator in the repair bio. We can also unify the handling
of files with and without checksums and simplify the control flow. We
also no longer have to wait for each repair I/O to complete one by one.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ce06d3ec 16-Apr-2020 Omar Sandoval <osandov@fb.com>

btrfs: make btrfs_check_repairable() static

Since its introduction in commit 2fe6303e7cd0 ("Btrfs: split
bio_readpage_error into several functions"), btrfs_check_repairable()
has only been used from extent_io.c where it is defined.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f337bd74 16-Apr-2020 Omar Sandoval <osandov@fb.com>

btrfs: don't do repair validation for checksum errors

The purpose of the validation step is to distinguish between good and
bad sectors in a failed multi-sector read. If a multi-sector read
succeeded but some of those sectors had checksum errors, we don't need
to validate anything; we know the sectors with bad checksums need to be
repaired.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c7333972 16-Apr-2020 Omar Sandoval <osandov@fb.com>

btrfs: look at full bi_io_vec for repair decision

Read repair does two things: it finds a good copy of data to return to
the reader, and it corrects the bad copy on disk. If a read of multiple
sectors has an I/O error, repair does an extra "validation" step that
issues a separate read for each sector. This allows us to find the exact
failing sectors and only rewrite those.

This heuristic is implemented in
bio_readpage_error()/btrfs_check_repairable() as:

failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
if (failed_bio_pages > 1)
do validation

However, at this point, bi_iter may have already been advanced. This
means that we'll skip the validation step and rewrite the entire failed
read.

Fix it by getting the actual size from the biovec (which we can do
because this is only called for non-cloned bios, although that will
change in a later commit).

Fixes: 8a2ee44a371c ("btrfs: look at bi_size for repair decisions")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8c38938c 14-Feb-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: move the root freeing stuff into btrfs_put_root

There are a few different ways to free roots, either you allocated them
yourself and you just do

free_extent_buffer(root->node);
free_extent_buffer(root->commit_node);
btrfs_put_root(root);

Which is the pattern for log roots. Or for snapshots/subvolumes that
are being dropped you simply call btrfs_free_fs_root() which does all
the cleanup for you.

Unify this all into btrfs_put_root(), so that we don't free up things
associated with the root until the last reference is dropped. This
makes the root freeing code much more significant.

The only caveat is at close_ctree() time we have to free the extent
buffers for all of our main roots (extent_root, chunk_root, etc) because
we have to drop the btree_inode and we'll run into issues if we hold
onto those nodes until ->kill_sb() time. This will be addressed in the
future when we kill the btree_inode.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3fd63727 14-Feb-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: make the extent buffer leak check per fs info

I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b3ff8f1d 11-Feb-2020 Qu Wenruo <wqu@suse.com>

btrfs: Don't submit any btree write bio if the fs has errors

[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.

BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922

CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[CAUSE]
The fuzzed image has a completely screwd up extent tree:

leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5

Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.

In short, the bug is caused by:

- Existing tree block gets allocated to log tree
This got its generation bumped.

- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.

- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.

- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.

- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.

The detailed sequence looks like this:

- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.

- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.

- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages

- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.

- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.

eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages

- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.

eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.

- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.

eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.

At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.

- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.

eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.

- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.

This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.

[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.

Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5ce48d0f 23-Feb-2020 Jules Irenge <jbi.octave@gmail.com>

btrfs: Add missing lock annotation for release_extent_buffer()

Sparse reports a warning at release_extent_buffer()
warning: context imbalance in release_extent_buffer() - unexpected unlock

The root cause is the missing annotation at release_extent_buffer()
Add the missing __releases(&eb->refs_lock) annotation

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55ffaabe 13-Feb-2020 Filipe Manana <fdmanana@suse.com>

Btrfs: avoid unnecessary splits when setting bits on an extent io tree

When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.

This happens specifically at the following case at __set_extent_bit():

$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;

So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.

The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.

This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.

Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.

This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:

"btrfs: introduce the inode->file_extent_tree"

This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.

Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:

1) A buffered write against file range [512KiB, 768KiB) is made. At this
point delalloc was not flushed yet;

2) Deduplication from some other inode into this inode's range
[128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
to be called, from btrfs_insert_clone_extent(), to mark the range
[128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;

3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
end up at __set_extent_bit(). In the first iteration of that function's
loop we end up in the following branch:

$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
(...)
1089 goto search_again;

This splits the state record into two, one for range [0, 128KiB) and
another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
bit set. Then we jump to the 'search_again' label, where we unlock the
the spinlock protecting the extent io tree before jumping to the
'again' label to perform the next iteration;

4) In the meanwhile, delalloc is flushed, the ordered extent for the range
[512KiB, 768KiB) is created and when it completes, at
btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
with a value of 0 for its 'new_size' argument;

5) Before the deduplication task currently at __set_extent_bit() moves to
the next iteration, the task finishing the ordered extent calls
find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
and gets 'start' set to 0 and 'end' set to 128KiB - because at this
moment the io tree has two extent state records, one representing the
range [0, 128KiB) and another representing the range [128KiB, 1MiB),
both with EXTENT_DIRTY set. Then we set 'isize' to:

isize = min(isize, end + 1)
= min(1MiB, 128KiB - 1 + 1)
= 128KiB

Then we set the inode's disk_i_size to 128KiB (isize).

After a clean unmount of the filesystem and mounting it again, we have
the file with a size of 128KiB, and effectively lost all the data it
had before in the range from 128KiB to 1MiB.

This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f657a31c 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: sink argument tree to __do_readpage

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b6660e80 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: sink arugment tree to contiguous_readpages

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d44fea7 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: sink argument tree to __extent_read_full_page

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 71ad38b4 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: sink argument tree to extent_read_full_page

The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b272ae22 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range

The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ae6957eb 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: add assertions for tree == inode->io_tree to extent IO helpers

Add assertions to all helpers that get tree as argument and verify that
it's the same that can be obtained from the inode or from its pages. In
followup patches the redundant arguments and assertions will be removed
one by one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0ceb34bf 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: drop argument tree from submit_extent_page

Now that we're sure the tree from argument is same as the one we can get
from the page's inode io_tree, drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 45b08405 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: remove extent_page_data::tree

All functions that set up extent_page_data::tree set it to the inode
io_tree. That's passed down the callstack that accesses either the same
inode or its pages. In the end submit_extent_page can pull the tree out
of the page and we don't have to store it in the structure.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 41a2ee75 17-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce per-inode file extent tree

In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.

Add helpers to use this tree, as it's not required for NO_HOLES file
systems. We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5ab58055 21-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: drop the -EBUSY case in __extent_writepage_io

Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
do not need this -EBUSY case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5750c375 27-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: Correctly handle empty trees in find_first_clear_extent_bit

Raviu reported that running his regular fs_trim segfaulted with the
following backtrace:

[ 237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
[ 237.525984] ------------[ cut here ]------------
[ 237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
[ 237.525992] invalid opcode: 0000 [#1] SMP PTI
[ 237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G U OE 5.4.14-8-vanilla #1
[ 237.526001] Hardware name: ASUSTeK COMPUTER INC.
[ 237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
[ 237.526079] Call Trace:
[ 237.526120] find_first_clear_extent_bit+0x13d/0x150 [btrfs]
[ 237.526148] btrfs_trim_fs+0x211/0x3f0 [btrfs]
[ 237.526184] btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
[ 237.526219] btrfs_ioctl+0x129a/0x2ed0 [btrfs]
[ 237.526227] ? filemap_map_pages+0x190/0x3d0
[ 237.526232] ? do_filp_open+0xaf/0x110
[ 237.526238] ? _copy_to_user+0x22/0x30
[ 237.526242] ? cp_new_stat+0x150/0x180
[ 237.526247] ? do_vfs_ioctl+0xa4/0x640
[ 237.526278] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 237.526283] do_vfs_ioctl+0xa4/0x640
[ 237.526288] ? __do_sys_newfstat+0x3c/0x60
[ 237.526292] ksys_ioctl+0x70/0x80
[ 237.526297] __x64_sys_ioctl+0x16/0x20
[ 237.526303] do_syscall_64+0x5a/0x1c0
[ 237.526310] entry_SYSCALL_64_after_hwframe+0x49/0xbe

That was due to btrfs_fs_device::aloc_tree being empty. Initially I
thought this wasn't possible and as a percaution have put the assert in
find_first_clear_extent_bit. Turns out this is indeed possible and could
happen when a file system with SINGLE data/metadata profile has a 2nd
device added. Until balance is run or a new chunk is allocated on this
device it will be completely empty.

In this case find_first_clear_extent_bit should return the full range
[0, -1ULL] and let the caller handle this i.e for trim the end will be
capped at the size of actual device.

Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
Fixes: 45bfcfc168f8 ("btrfs: Implement find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 42ffb0bf 23-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: flush write bio if we loop in extent_write_cache_pages

There exists a deadlock with range_cyclic that has existed forever. If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out. The task traces are as follows

PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
#0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
#1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
#2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
#3 [ffffc900297bb708] __lock_page at ffffffff811f145b
#4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
#5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
#6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
#7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd

PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
"aio-dio-invalid"
#0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
#1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
#2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
#3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
#4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
#5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
#6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
#7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032

I used drgn to find the respective pages we were stuck on

page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874

As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148. aio-dio-invalid has 7680, and the kworker epd looks like the
following

crash> struct extent_page_data ffffc900297bbbb0
struct extent_page_data {
bio = 0xffff889f747ed830,
tree = 0xffff889eed6ba448,
extent_locked = 0,
sync_io = 0
}

Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).

Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on

bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
for i in range(0, bio.bi_vcnt.value_()):
bv = bio.bi_io_vec[i]
if bv.bv_page.value_() == 0xffffea00fbfc7500:
print("FOUND IT")

which validated what I suspected.

The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.

Fixes: b293f02e1423 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 556755a8 03-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: fix improper setting of scanned for range cyclic write cache pages

We noticed that we were having regular CG OOM kills in cases where there
was still enough dirty pages to avoid OOM'ing. It turned out there's
this corner case in btrfs's handling of range_cyclic where files that
were being redirtied were not getting fully written out because of how
we do range_cyclic writeback.

We unconditionally were setting scanned = 1; the first time we found any
pages in the inode. This isn't actually what we want, we want it to be
set if we've scanned the entire file. For range_cyclic we could be
starting in the middle or towards the end of the file, so we could write
one page and then not write any of the other dirty pages in the file
because we set scanned = 1.

Fix this by not setting scanned = 1 if we find pages. The rules for
setting scanned should be

1) !range_cyclic. In this case we have a specified range to write out.
2) range_cyclic && index == 0. In this case we've started at the
beginning and there is no need to loop around a second time.
3) range_cyclic && we started at index > 0 and we've reached the end of
the file without satisfying our nr_to_write.

This patch fixes both of our writepages implementations to make sure
these rules hold true. This fixed our over zealous CG OOMs in
production.

Fixes: d1310b2e0cd9 ("Btrfs: Split the extent_map code into two parts")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>


# c8b04030 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: simplify compressed/inline check in __extent_writepage_io()

Commit 7087a9d8db88 ("btrfs: Remove
extent_io_ops::writepage_end_io_hook") left this logic in a confusing
state. Simplify it.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 39b07b5d 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: drop create parameter to btrfs_get_extent()

We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.

While we're here, let's document btrfs_get_extent().

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f95d713b 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: remove redundant i_size check in __extent_writepage_io()

In __extent_writepage_io(), we check whether

i_size <= page_offset(page).

Note that if i_size < page_offset(page), then
i_size >> PAGE_SHIFT < page->index.

If i_size == page_offset(page), then
i_size >> PAGE_SHIFT == page->index && offset_in_page(i_size) == 0.

__extent_writepage() already has a check for these cases that
returns without calling __extent_writepage_io():

end_index = i_size >> PAGE_SHIFT
pg_offset = offset_in_page(i_size);
if (page->index > end_index ||
(page->index == end_index && !pg_offset)) {
page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
unlock_page(page);
return 0;
}

Get rid of the one in __extent_writepage_io(), which was obsoleted in
211c17f51f46 ("Fix corners in writepage and btrfs_truncate_page").

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 169d2c87 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: remove trivial goto label in __extent_writepage()

Since 40f765805f08 ("Btrfs: split up __extent_writepage to lower stack
usage"), done_unlocked is simply a return 0. Get rid of it.
Mid-statement block returns don seem to make the code less readable here.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb70d222 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: remove unnecessary pg_offset assignments in __extent_writepage()

We're initializing pg_offset to 0, setting it immediately, then
reassigning it to 0 again after. The former became unnecessary in
211c17f51f46 ("Fix corners in writepage and btrfs_truncate_page"). The
latter is a leftover that should've been removed in 40f765805f08
("Btrfs: split up __extent_writepage to lower stack usage"). Remove
both.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b6293c82 03-Dec-2019 Dan Carpenter <dan.carpenter@oracle.com>

btrfs: return error pointer from alloc_test_extent_buffer

Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.

Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fa17ed06 03-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: drop bdev argument from submit_extent_page

After previous patches removing bdev being passed around to set it to
bio, it has become unused in submit_extent_page. So it now has "only" 13
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# a019e9e1 30-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: remove extent_map::bdev

We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.

Signed-off-by: David Sterba <dsterba@suse.com>


# 1a418027 30-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: drop bio_set_dev where not needed

bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.

Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.

In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.

Where do we know which bdev to set:

* for regular IO: submit_stripe_bio that's called by btrfs_map_bio

* repair IO: repair_io_failure, read or write from specific device

* super block write (using buffer_heads but uses raw bdev) and barriers

* scrub: this does not use all regular IO paths as it needs to reach all
copies, verify and fixup eventually, and for that all bdev management
is independent

* raid56: rbio_add_io_page, for the RMW write

* integrity-checker: does it's own low-level block tracking

Signed-off-by: David Sterba <dsterba@suse.com>


# 429aebc0 18-Nov-2019 David Sterba <dsterba@suse.com>

btrfs: get bdev directly from fs_devices in submit_extent_page

This is preparatory patch to remove @bdev parameter from
submit_extent_page. It can't be removed completely, because the cgroups
need it for wbc when initializing the bio

wbc_init_bio
bio_associate_blkg_from_css
dereference bdev->bi_disk->queue

The bdev pointer is the same as latest_bdev, thus no functional change.
We can retrieve it from fs_devices that's reachable through several
dereferences. The local variable shadows the parameter, but that's only
temporary.

Signed-off-by: David Sterba <dsterba@suse.com>


# 57e5ffeb 29-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: sink write_flags to __extent_writepage_io

__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7bddf1e 03-Oct-2019 Tejun Heo <tj@kernel.org>

btrfs: Avoid getting stuck during cyclic writebacks

During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over. However,
instead of current index + 1, it gets to to the current index itself.

Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.

1. There is a single file which has accumulated enough dirty pages to
trigger balance_dirty_pages() and the writer appending to the file
with a series of short writes.

2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.

3. Writeback kicks in and the cursor is on the last page of the dirty
file. Writeback is started or skipped if already in progress. As
it's EOF, extent_write_cache_pages() returns and the cursor is set
to done_index which is pointing to the last page.

4. Writeback is done. Nothing happens till balance_dirty_pages
finishes, at which point we go back to #1.

This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system. We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.

Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed. However, this concern seems bogus.

On each page, the code either:

* Skips and moves onto the next page.

* Fails issue and sets done_index to index + 1.

* Successfully issues and continue to the next page if budget allows
and not EOF.

IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page. Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related. Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.

Fix it by always setting done_index past the current page being
processed.

Note that this problem exists in other writepages too.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dbb70bec 10-Jul-2019 Chris Mason <clm@fb.com>

Btrfs: extent_write_locked_range() should attach inode->i_wb

extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression. It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ec39f769 10-Jul-2019 Chris Mason <clm@fb.com>

Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios

Async CRCs and compression submit IO through helper threads, which means
they have IO priority inversions when cgroup IO controllers are in use.

This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.

For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.

For the async CRCs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1d53c9e6 10-Jul-2019 Chris Mason <clm@fb.com>

Btrfs: only associate the locked page with one async_chunk struct

The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing. When compression is on, we allocate one async_chunk
structure for every 512K, and then run those pages through the
compression code for IO submission.

writepages starts all of this off with a single page, locked by the
original call to extent_write_cache_pages(), and it's important to keep
track of this page because it has already been through
clear_page_dirty_for_io().

The btrfs async_chunk struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_chunk
struct really owns that page.

But, this is racey. If a given delalloc range is broken up into two
async_chunks (chunkA and chunkB), we can end up with something like
this:

compress_file_range(chunkA)
submit_compress_extents(chunkA)
submit compressed bios(chunkA)
put_page(locked_page)

compress_file_range(chunkB)
...

Or:

async_cow_submit
submit_compressed_extents <--- falls back to buffered writeout
cow_file_range
extent_clear_unlock_delalloc
__process_pages_contig
put_page(locked_pages)

async_cow_submit

The end result is that chunkA is completed and cleaned up before chunkB
even starts processing. This means we can free locked_page() and reuse
it elsewhere. If we get really lucky, it'll have the same page->index
in its new home as it did before.

While we're processing chunkB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on chunkB->locked_page.

Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. What can happen, it can go
through the fixup worker and the whole COW machinery again:

in submit_compressed_extents():
while (async extents) {
...
cow_file_range
if (!page_started ...)
extent_write_locked_range
else if (...)
unlock_page
continue;

This hasn't been observed in practice but is still possible.

With cgroups in use, we might crash in the accounting code because
page->mapping->i_wb isn't set.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: percpu_counter_add_batch+0x11/0x70
PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 16 PID: 2172 Comm: rm Not tainted
RIP: 0010:percpu_counter_add_batch+0x11/0x70
RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
account_page_cleaned+0x15b/0x1f0
__cancel_dirty_page+0x146/0x200
truncate_cleanup_page+0x92/0xb0
truncate_inode_pages_range+0x202/0x7d0
btrfs_evict_inode+0x92/0x5a0
evict+0xc1/0x190
do_unlinkat+0x176/0x280
do_syscall_64+0x63/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

The fix here is to make asyc_chunk->locked_page NULL everywhere but the
one async_chunk struct that's allowed to do things to the locked page.

Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ update changelog from mail thread discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>


# b3f167aa 23-Sep-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move the failrec tree stuff into extent-io-tree.h

This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 083e75e7 23-Sep-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: export find_delalloc_range

This utilizes internal stuff to the extent_io_tree, so we need to export
it before we move it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9c7d3a54 23-Sep-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move extent_io_tree defs to their own header

extent_io.c/h are huge, encompassing a bunch of different things. The
extent_io_tree code can live on its own, so separate this out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6f0d04f8 23-Sep-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: separate out the extent io init function

We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 33ca832f 23-Sep-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: separate out the extent leak code

We check both extent buffer and extent state leaks in the same function,
separate these two functions out so we can move them around.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0607eb1d 11-Sep-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix missing error return if writeback for extent buffer never started

If lock_extent_buffer_for_io() fails, it returns a negative value, but its
caller btree_write_cache_pages() ignores such error. This means that a
call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
failed. We should make btree_write_cache_pages() notice such error values
and stop immediatelly, making sure filemap_fdatawrite_range() returns an
error to the transaction commit path. A failure from flush_write_bio()
should also result in the endio callback end_bio_extent_buffer_writepage()
being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
there's no risk a transaction or log commit doesn't catch a writeback
failure.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb5b64f1 13-Sep-2019 Dennis Zhou <dennis@kernel.org>

btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer

Before, if a eb failed to write out, we would end up triggering a
BUG_ON(). As of f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in
flush_write_bio() one level up"), we no longer BUG_ON(), so we should
make life consistent and add back the unwritten bytes to
dirty_metadata_bytes.

Fixes: f4340622e022 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>


# 18dfa711 11-Sep-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix unwritten extent buffers and hangs on future writeback attempts

The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.

When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).

Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:

[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c

So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).

This is a regression introduced in the 5.2 kernel.

Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e182163d 15-Aug-2019 Omar Sandoval <osandov@fb.com>

btrfs: stop clearing EXTENT_DIRTY in inode I/O tree

Since commit fee187d9d9dd ("Btrfs: do not set EXTENT_DIRTY along with
EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
can simplify and stop trying to clear it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 74e9194a 17-Jul-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove delalloc_end argument from extent_clear_unlock_delalloc

It was added in ba8b04c1d4ad ("btrfs: extend btrfs_set_extent_delalloc
and its friends to support in-band dedupe and subpage size patchset") as
a preparatory patch for in-band and subapge block size patchsets.
However neither of those are likely to be merged anytime soon and the
code has diverged significantly from the last public post of either
of those patchsets.

It's unlikely either of the patchests are going to use those preparatory
steps so just remove the variables. Since cow_file_range also took
delalloc_end to pass it to extent_clear_unlock_delalloc remove the
parameter from that function as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 34e51a5e 27-Jun-2019 Tejun Heo <tj@kernel.org>

blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()

wbc_account_io() does a very specific job - try to see which cgroup is
actually dirtying an inode and transfer its ownership to the majority
dirtier if needed. The name is too generic and confusing. Let's
rename it to something more specific.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e02d48ea 05-Jul-2019 Colin Ian King <colin.king@canonical.com>

btrfs: fix memory leak of path on error return path

Currently if the allocation of roots or tmp_ulist fails the error handling
does not free up the allocation of path causing a memory leak. Fix this and
other similar leaks by moving the call of btrfs_free_path from label out
to label out_free_ulist.

Kudos to David Sterba for spotting the issue in my original fix and suggesting
the correct way to fix the leak and Anand Jain for spotting a double free
issue.

Addresses-Coverity: ("Resource leak")
Fixes: 5911c8fe05c5 ("btrfs: fiemap: preallocate ulists for btrfs_check_shared")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9978059b 21-Jun-2019 Goldwyn Rodrigues <rgoldwyn@suse.de>

btrfs: Evaluate io_tree in find_lock_delalloc_range()

Simplification. No point passing the tree variable when it can be
evaluated from inode. The tests now use the io_tree from btrfs_inode as
opposed to creating one.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e749af44 18-Jun-2019 David Sterba <dsterba@suse.com>

btrfs: lift bio_set_dev from bio allocation helpers

The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.

Signed-off-by: David Sterba <dsterba@suse.com>


# 2792237d 18-Jun-2019 David Sterba <dsterba@suse.com>

btrfs: use common helpers for extent IO state insertion messages

Print the error messages using the helpers that also print the
filesystem identification.

Signed-off-by: David Sterba <dsterba@suse.com>


# 00801ae4 02-May-2019 David Sterba <dsterba@suse.com>

btrfs: switch extent_buffer write_locks from atomic to int

The write_locks is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f3dc24c5 02-May-2019 David Sterba <dsterba@suse.com>

btrfs: switch extent_buffer spinning_writers from atomic to int

The spinning_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Signed-off-by: David Sterba <dsterba@suse.com>


# 06297d8c 02-May-2019 David Sterba <dsterba@suse.com>

btrfs: switch extent_buffer blocking_writers from atomic to int

The blocking_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8666e638 05-Jun-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Document __etree_search

The function has a lot of return values and specific conventions making
it cumbersome to understand what's returned. Have a go at documenting
its parameters and return values.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1eaebb34 03-Jun-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Don't trim returned range based on input value in find_first_clear_extent_bit

Currently find_first_clear_extent_bit always returns a range whose
starting value is >= passed 'start'. This implicit trimming behavior is
somewhat subtle and an implementation detail.

Instead, this patch modifies the function such that now it always
returns the range which contains passed 'start' and has the given bits
unset. This range could either be due to presence of existing records
which contains 'start' but have the bits unset or because there are no
records that contain the given starting offset.

This patch also adds test cases which cover find_first_clear_extent_bit
since they were missing up until now.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 23d31bd4 07-May-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Use newly introduced btrfs_lock_and_flush_ordered_range

There several functions which open code
btrfs_lock_and_flush_ordered_range, just replace them with a call to the
function. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1200b51f 09-May-2019 Qu Wenruo <wqu@suse.com>

btrfs: remove the incorrect comment on RO fs when btrfs_run_delalloc_range() fails

At the context of btrfs_run_delalloc_range(), we haven't started/joined
a transaction, thus even something went wrong, we can't and won't abort
transaction, thus no way to make the fs RO.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5911c8fe 15-May-2019 David Sterba <dsterba@suse.com>

btrfs: fiemap: preallocate ulists for btrfs_check_shared

btrfs_check_shared looks up parents of a given extent and uses ulists
for that. These are allocated and freed repeatedly. Preallocation in the
caller will avoid the overhead and also allow us to use the GFP_KERNEL
as it is happens before the extent locks are taken.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b070cfe 25-Apr-2019 Christoph Hellwig <hch@lst.de>

block: remove the i argument to bio_for_each_segment_all

We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5c5aff98 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter fs_info from emit_last_fiemap_cache

Signed-off-by: David Sterba <dsterba@suse.com>


# 50489a57 10-Apr-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove bio_offset argument from submit_bio_hook

None of the implementers of the submit_bio_hook use the bio_offset
parameter, simply remove it. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2ccfbc6 10-Apr-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove 'tree' argument from read_extent_buffer_pages

This function always uses the btree inode's io_tree. Stop taking the
tree as a function argument and instead access it internally from
read_extent_buffer_pages. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 45bfcfc1 27-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Implement find_first_clear_extent_bit

This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ca73656 27-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Implement set_extent_bits_nowait

It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 41e7acd3 25-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Rename and export clear_btree_io_tree

This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8f881e8c 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from eb in leaf_data_end

We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 0ab02063 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from eb in write_one_eb

We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 20a1fbf9 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from eb in repair_eb_io_failure

We can read fs_info from extent buffer and can drop it from the
parameters. As all callsites are updated, add the btrfs_ prefix as the
function is exported.

Signed-off-by: David Sterba <dsterba@suse.com>


# 9df76fb5 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from eb in lock_extent_buffer_for_io

We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 290342f6 25-Mar-2019 Arnd Bergmann <arnd@arndb.de>

btrfs: use BUG() instead of BUG_ON(1)

BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:

fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
BUG_ON(1);
^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
# define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
max_chunk_size);
^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
typeof(y) unique_y = (y); \
^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
BUG_ON(1);
^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
u64 max_chunk_size;
^
= 0

Change it to BUG() so clang can see that this code path can never
continue.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# d4eb671a 21-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: remove stale definition of BUFFER_LRU_MAX

Long time ago (2008), the extent buffers were organized in a LRU list
and switched to rb-tree in 6af118ce51b52ced ("Btrfs: Index extent
buffers in an rbtree"). There was one stale macro definition left.

Signed-off-by: David Sterba <dsterba@suse.com>


# a2a72fbd 20-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: Handle errors better in extent_writepages()

We can only get <=0 from extent_write_cache_pages, add an ASSERT() for
it just in case.

Then instead of submitting the write bio even if we got some error,
check the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.

If there is no error so far, then call flush_write_bio() and return the
result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2e3c2513 20-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()

This function needs some extra checks on locked pages and eb. For error
handling we need to unlock locked pages and the eb.

There is a rare >0 return value branch, where all pages get locked
while write bio is not flushed.

Thankfully it's handled by the only caller, btree_write_cache_pages(),
as later write_one_eb() call will trigger submit_one_bio(). So there
shouldn't be any problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 02c6db4f 20-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: Handle errors better in extent_write_locked_range()

We can only get @ret <= 0. Add an ASSERT() for it just in case.

Then, instead of submitting the write bio even we got some error, check
the return value first.

If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.

If there is no error so far, then call flush_write_bio() and return the
result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e06808be 20-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: Kill dead condition in extent_write_cache_pages()

Since __extent_writepage() will no longer return >0 value,
(ret == AOP_WRITEPAGE_ACTIVATE) will never be true.

Kill that dead branch.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b952eea 20-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: Handle errors better in btree_write_cache_pages()

In btree_write_cache_pages(), we can only get @ret <= 0.
Add an ASSERT() for it just in case.

Then instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.

If there is no error so far, then call flush_write_bio() and return the
result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3065976b 20-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: Handle errors better in extent_write_full_page()

Since now flush_write_bio() could return error, kill the BUG_ON() first.
Then don't call flush_write_bio() unconditionally, instead we check the
return value from __extent_writepage() first.

If __extent_writepage() fails, we do cleanup, and return error without
submitting the possible corrupted or half-baked bio.

If __extent_writepage() successes, then we call flush_write_bio() and
return the result.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f4340622 20-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up

We have a BUG_ON() in flush_write_bio() to handle the return value of
submit_one_bio().

Move the BUG_ON() one level up to all its callers.

This patch will introduce temporary variable, @flush_ret to keep code
change minimal in this patch. That variable will be cleaned up when
enhancing the error handling later.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d51f51bb 18-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused -EIO assignment in end_bio_extent_readpage

In case we hit the error case for a metadata buffer in
end_bio_extent_readpage then 'ret' won't really be checked before it's
written again to. This means the -EIO in this case will never be
checked, just remove it.

Fixes-coverity-id: 1442513
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e65ef21e 11-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Exploit the fact that pages passed to extent_readpages are always contiguous

Currently extent_readpages (called from btrfs_readpages) will always
call __extent_readpages which tries to create contiguous range of pages
and call __do_contiguous_readpages when such contiguous range is
created.

It turns out this is unnecessary due to the fact that generic MM code
always calls filesystem's ->readpages callback (btrfs_readpages in
this case) with already contiguous pages. Armed with this knowledge it's
possible to simplify extent_readpages by eliminating the call to
__extent_readpages and directly calling contiguous_readpages.

The only edge case that needs to be handled is when
add_to_page_cache_lru fails. This is easy as all that is needed is to
submit whatever is the number of pages successfully added to the lru.
This can happen when the page is already in the range, so it does not
need to be read again, and we can't do anything else in case of other
errors.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ed1b4ed7 24-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: switch extent_buffer::lock_nested to bool

The member is tracking simple status of the lock, we can use bool for
that and make some room for further space reduction in the structure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# c79adfc0 24-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: use assertion helpers for extent buffer write lock counters

Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::write_locks become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c9c799a 24-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: use assertion helpers for extent buffer read lock counters

Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::read_locks become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# afd495a8 24-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: use assertion helpers for spinning readers

Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_readers become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 843ccf9f 24-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: use assertion helpers for spinning writers

Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_writers become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8882679e 14-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove EXTENT_IOBITS

This flag just became synonymous to EXTENT_LOCKED, so just remove it and
used EXTENT_LOCKED directly. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4e586ca3 14-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove EXTENT_WRITEBACK

This flag was introduced in a52d9a8033c4 ("Btrfs: Extent based page
cache code.") and subsequently it's usage effectively was removed by
1edbb734b4e0 ("Btrfs: reduce CPU usage in the extent_state tree") and
f2a97a9dbd86 ("btrfs: remove all unused functions"). Just remove it,
no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a1d19847 28-Feb-2019 Qu Wenruo <wqu@suse.com>

btrfs: tracepoints: Add trace events for extent_io_tree

Although btrfs heavily relies on extent_io_tree, we don't really have
any good trace events for them.

This patch will add the folowing trace events:
- trace_btrfs_set_extent_bit()
- trace_btrfs_clear_extent_bit()
- trace_btrfs_convert_extent_bit()

Since selftests could create temporary extent_io_tree without fs_info,
modify TP_fast_assign_fsid() to accept NULL as fs_info. NULL fs_info
will lead to all zero fsid.

The output would be:
btrfs_set_extent_bit: <FDID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22040576 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22044672 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22048768 len=4096 set_bits=LOCKED
btrfs_clear_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=16384 clear_bits=LOCKED
^^^ Extent buffer 22036480 read from disk, the locking progress

btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30425088 len=16384 set_bits=DIRTY
btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30441472 len=16384 set_bits=DIRTY
^^^ 2 new tree blocks allocated in one transaction

btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30523392 len=16384 set_bits=DIRTY
btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30556160 len=16384 set_bits=DIRTY
^^^ 2 old tree blocks get pinned down

There is one point which need attention:
1) Those trace events can be pretty heavy:
The following workload would generate over 400 trace events.

mkfs.btrfs -f $dev
start_trace
mount $dev $mnt -o enospc_debug
sync
touch $mnt/file1
touch $mnt/file2
touch $mnt/file3
xfs_io -f -c "pwrite 0 16k" $mnt/file4
umount $mnt
end_trace

It's not recommended to use them in real world environment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename enums ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 43eb5f29 28-Feb-2019 Qu Wenruo <wqu@suse.com>

btrfs: Introduce extent_io_tree::owner to distinguish different io_trees

Btrfs has the following different extent_io_trees used:

- fs_info::free_extents[2]
- btrfs_inode::io_tree - for both normal inodes and the btree inode
- btrfs_inode::io_failure_tree
- btrfs_transaction::dirty_pages
- btrfs_root::dirty_log_pages

If we want to trace changes in those trees, it will be pretty hard to
distinguish them.

Instead of using hard-to-read pointer address, this patch will introduce
a new member extent_io_tree::owner to track the owner.

This modification needs all the callers of extent_io_tree_init() to
accept a new parameter @owner.

This patch provides the basis for later trace events.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c258d6e3 28-Feb-2019 Qu Wenruo <wqu@suse.com>

btrfs: Introduce fs_info to extent_io_tree

This patch will add a new member fs_info to extent_io_tree.

This provides the basis for later trace events to distinguish the output
between different btrfs filesystems. While this increases the size of
the structure, we want to know the source of the trace events and
passing the fs_info as an argument to all contexts is not possible.

The selftests are now allowed to set it to NULL as they don't use the
tracepoints.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8e928218 14-Feb-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix corruption reading shared and compressed extents after hole punching

In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:

#!/bin/bash

mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc

# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync

echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"

# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync

echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"

# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync

echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"

echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"

umount /dev/sdc

When running the script we get the following output:

Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar

This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared
extents").

Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.

A test case for fstests follows soon.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bb58eb9e 24-Jan-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent_io: Kill the forward declaration of flush_write_bio

There is no need to forward declare flush_write_bio(), as it only
depends on submit_one_bio(). Both of them are pretty small, just move
them to kill the forward declaration.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 352646c7 30-Jan-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Fix grossly misleading argument names in extent io search

The variables and function parameters of __etree_search which pertain to
prev/next are grossly misnamed. Namely, prev_ret holds the next state
and not the previous. Similarly, next_ret actually holds the previous
extent state relating to the offset we are interested in. Fix this by
renaming the variables as well as switching the arguments order. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ba8f5206 30-Jan-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove EXTENT_FIRST_DELALLOC bit

With the refactoring introduced in 8b62f87bad9c ("Btrfs: reworki
outstanding_extents") this flag became unused. Remove it and renumber
the following flags accordingly. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ab47a8d 12-Dec-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused arguments from btrfs_get_extent_fiemap

This function is a simple wrapper over btrfs_get_extent that returns
either:

a) A real extent in the passed range or
b) Adjusted extent based on whether delalloc bytes are found backing up
a hole.

To support these semantics it doesn't need the page/pg_offset/create
arguments which are passed to btrfs_get_extent in case an extent is to
be created. So simplify the function by removing the unused arguments.
No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6dc4f100 15-Feb-2019 Ming Lei <ming.lei@redhat.com>

block: allow bio_for_each_segment_all() to iterate over multi-page bvec

This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c3a7ce73 15-Feb-2019 Ming Lei <ming.lei@redhat.com>

btrfs: use mp_bvec_last_segment to get bio's last page

Preparing for supporting multi-page bvec.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8a2ee44a 15-Feb-2019 Christoph Hellwig <hch@lst.de>

btrfs: look at bi_size for repair decisions

bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O. But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying. Use bi_size for that and shift by
PAGE_SHIFT. This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f86196ea 03-Jan-2019 Nikolay Borisov <nborisov@suse.com>

fs: don't open code lru_to_page()

Multiple filesystems open code lru_to_page(). Rectify this by moving
the macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.

No functional changes.

Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52042d8e 27-Nov-2018 Andrea Gelmini <andrea.gelmini@gelma.net>

btrfs: Fix typos in comments and strings

The typos accumulate over time so once in a while time they get fixed in
a large patch.

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 61ed3a14 29-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Refactor main loop in extent_readpages

extent_readpages processes all pages in the readlist in batches of 16,
this is implemented by a single for loop but thanks to an if condition
the loop does 2 things based on whether we've filled the batch or not.
Additionally due to the structure of the code there is an additional
check which deals with partial batches.

Streamline all of this by explicitly using two loops. The outter one is
used to process all pages while the inner one just fills in the batch
of 16 (currently). Due to this new structure the code guarantees that
all pages are processed in the loop hence the code to deal with any
leftovers is eliminated.

This also enable the compiler to inline __extent_readpages:

./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for

add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
Function old new delta
extent_readpages 476 1136 +660
__extent_readpages 820 - -820
Total: Before=44315, After=44155, chg -0.36%

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7073017a 05-Dec-2018 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: use offset_in_page instead of open-coding it

Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.

So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3522e903 28-Nov-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: remove always true if branch in find_delalloc_range

The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# da12fe54 27-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Refactor btrfs_merge_bio_hook

This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cc2c39d6 28-Nov-2018 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: don't initialize 'offset' in map_private_extent_buffer()

In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.

But later on it is overwritten with either the offset from the extent
buffer's start or 0.

So get rid of the initial initialization.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 78e62c02 22-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::readpage_io_failed_hook

For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.

This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b3a0dd50 22-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: replace btrfs_io_bio::end_io with a simple helper

The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.

This shrinks struct btrfs_io_bio by 8 bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# ce9f967f 19-Nov-2018 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: use EXPORT_FOR_TESTS for conditionally exported functions

Several functions in BTRFS are only used inside the source file they are
declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
with the unit tests code.

Before the introduction of the EXPORT_FOR_TESTS macro, these functions
could not be declared as static and the compiler had a harder task when
optimizing and inlining them.

As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
compiler.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9bfd61d9 26-Oct-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Replace BUG_ON with ASSERT in find_lock_delalloc_range

lock_delalloc_pages should only return 2 values - 0 in case of success
and -EAGAIN if the range of pages to be locked should be shrunk due to
some of gone. Manual inspections confirms that this is indeed the case
since __process_pages_contig is where lock_delalloc_pages gets its
return value. The latter always returns 0 or -EAGAIN so the invariant
holds. No functional changes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 917aacec 26-Oct-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Sink find_lock_delalloc_range's 'max_bytes' argument

All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
reduce the argument count and make that a local variable. No functional
changes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3cd24c69 01-Nov-2018 Ethan Lien <ethanlien@synology.com>

btrfs: use tagged writepage to mitigate livelock of snapshot

Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.

This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.

We do a simple snapshot speed test on a Intel D-1531 box:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 1m58sec
patched: 6.54sec

This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.

For a multi writers, random write test:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 15.83sec
patched: 10.35sec

The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c629732d 08-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered

This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8cc0237a 08-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_page_data argument from writepage_delalloc

The only remaining use of the 'epd' argument in writepage_delalloc is
to reference the extent_io_tree which was set in extent_writepages. Since
it is guaranteed that page->mapping of any page passed to
writepage_delalloc (and __extent_writepage as the sole caller) to be
equal to that passed in extent_writepages we can directly get the
io_tree via the already passed inode (which is also taken from
page->mapping->host). No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7789a55a 08-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Move epd::extent_locked check to writepage_delalloc's caller

If epd::extent_locked is set then writepage_delalloc terminates. Make
this a bit more apparent in the caller by simply bubbling the check up.
This enables to remove epd as an argument to writepage_delalloc in a
future patch. No functional change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 46cc775e 15-Oct-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Adjust loop in free_extent_buffer

The loop construct in free_extent_buffer was added in
242e18c7c1a8 ("Btrfs: reduce lock contention on extent buffer locks")
as means of reducing the times the eb lock is taken, the non-last ref
count is decremented and lock is released. As the special handling
of UNMAPPED extent buffers was removed now there is only one decrement
op which is happening for EXTENT_BUFFER_UNMAPPED case.

This commit modifies the loop condition so that in case of UNMAPPED
buffers the eb's lock is taken only if we are 100% sure the eb is going
to be freed by the current executor of the code. Additionally, remove
superfluous ref count ops in btrfs test.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9cfc8ba7 15-Aug-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove special handling of EXTENT_BUFFER_UNMAPPED while freeing

Now that the whole of btrfs code has been audited for eb reference count
management it's time to remove the hunk in free_extent_buffer that
essentially considered the condition

"eb->ref == 2 && EXTENT_BUFFER_DUMMY"

to equal "eb->ref = 1". Also remove the last location
which takes an extra reference count in alloc_test_extent_buffer.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# abbb55f4 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::split_extent_hook callback

This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c848198 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::merge_extent_hook callback

This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a36bb5f9 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::clear_bit_hook callback

This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e06a1fc9 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::set_bit_hook extent_io callback

This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.

This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode. No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 65a680f6 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::check_extent_io_range callback

This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7087a9d8 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::writepage_end_io_hook

This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures. No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d75855b4 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::writepage_start_hook

This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5eaad97a 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::fill_delalloc

This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 0a943c65 04-Dec-2017 Matthew Wilcox <willy@infradead.org>

btrfs: Convert page cache to XArray

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Sterba <dsterba@suse.com>


# 10bbd235 05-Dec-2017 Matthew Wilcox <willy@infradead.org>

pagevec: Use xa_mark_t

Removes sparse warnings.

Signed-off-by: Matthew Wilcox <willy@infradead.org>


# 9c36396c 17-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: tests: add separate stub for find_lock_delalloc_range

The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# abb57ef3 13-Sep-2018 Liu Bo <bo.liu@linux.alibaba.com>

Btrfs: skip set_page_dirty if eb pages are already dirty

As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.

Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 51995c39 13-Sep-2018 Liu Bo <bo.liu@linux.alibaba.com>

Btrfs: assert page dirty bit on extent buffer pages

Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty. The next patch will dirty the pages conditionally.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9688e9a9 22-Aug-2018 Liu Bo <bo.liu@linux.alibaba.com>

Btrfs: use next_state in find_first_extent_bit

As next_state() is already defined to get the next state, use it in
find_first_extent_bit. No functional changes.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5e9d3982 17-Aug-2018 Jens Axboe <axboe@kernel.dk>

btrfs: readpages() should submit IO as read-ahead

a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.

Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5cdc84bf 18-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: drop extent_io_ops::set_range_writeback callback

The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 00032d38 18-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: drop extent_io_ops::merge_bio_hook callback

The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 05912a3c 18-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: drop extent_io_ops::tree_fs_info callback

All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b0132a3b 27-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_UNMAPPED

EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 07e21c4d 27-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Document locking requirement via lockdep_assert_held

Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55ac0139 19-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: rename btrfs_release_extent_buffer_page

The function used to release one page (and always the first one), but
not anymore since a50924e3a4d7fccb0ecfbd4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page"). Update the name and comment.

Signed-off-by: David Sterba <dsterba@suse.com>


# d64766fd 27-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Refactor loop in btrfs_release_extent_buffer_page

The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.

The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97acee6532b36dd6e99 ("Btrfs: set page->private to the eb").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b16d011e 04-Jul-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Reword dodgy comments in alloc_extent_buffer

Commit eb14ab8ed24a ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 28187ae5 04-Jul-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Simplify page unlocking in alloc_extent_buffer

Current version of the page unlocking code was added in
727011e07cbd ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe6e ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").

However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ebcc3263 29-Jun-2018 David Sterba <dsterba@suse.com>

btrfs: open-code bio_set_op_attrs

The helper is trivial and marked as deprecated.

Signed-off-by: David Sterba <dsterba@suse.com>


# cc5e31a4 01-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: switch types to int when counting eb pages

The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 65ad0104 29-Jun-2018 David Sterba <dsterba@suse.com>

btrfs: pass only eb to num_extent_pages

Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd3599a0 11-Jul-2018 Filipe Manana <fdmanana@suse.com>

Btrfs: fix file data corruption after cloning a range and fsync

When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).

The following example, turned into a test for fstests, reproduces the
issue:

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt

$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

$ xfs_io -c "fsync" /mnt/bar
# reflink destination offset corresponds to the size of file bar,
# 2728Kb minus 4Kb.
$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar

$ md5sum /mnt/bar
95a95813a8c2abc9aa75a6c2914a077e /mnt/bar

<power fail>

$ mount /dev/sdb /mnt
$ md5sum /mnt/bar
207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
# power failure

In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.

Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).

CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f0986318 20-Jun-2018 Filipe Manana <fdmanana@suse.com>

Btrfs: fix physical offset reported by fiemap for inline extents

Commit 9d311e11fc1f ("Btrfs: fiemap: pass correct bytenr when
fm_extent_count is zero") introduced a regression where we no longer
report 0 as the physical offset for inline extents (and other extents
with a special block_start value). This is because it always sets the
variable used to report the physical offset ("disko") as em->block_start
plus some offset, and em->block_start has the value 18446744073709551614
((u64) -2) for inline extents.

This made the btrfs test 004 (from fstests) often fail, for example, for
a file with an inline extent we have the following items in the subvolume
tree:

item 101 key (418 INODE_ITEM 0) itemoff 11029 itemsize 160
generation 25 transid 38 size 1525 nbytes 1525
block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x2(none)
atime 1529342058.461891730 (2018-06-18 18:14:18)
ctime 1529342058.461891730 (2018-06-18 18:14:18)
mtime 1529342058.461891730 (2018-06-18 18:14:18)
otime 1529342055.869892885 (2018-06-18 18:14:15)
item 102 key (418 INODE_REF 264) itemoff 11016 itemsize 13
index 25 namelen 3 name: fc7
item 103 key (418 EXTENT_DATA 0) itemoff 9470 itemsize 1546
generation 38 type 0 (inline)
inline extent data size 1525 ram_bytes 1525 compression 0 (none)

Then when test 004 invoked fiemap against the file it got a non-zero
physical offset:

$ filefrag -v /mnt/p0/d4/d7/fc7
Filesystem type is: 9123683e
File size of /mnt/p0/d4/d7/fc7 is 1525 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 4095: 18446744073709551614.. 4093: 4096: last,not_aligned,inline,eof
/mnt/p0/d4/d7/fc7: 1 extent found

This resulted in the test failing like this:

btrfs/004 49s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2016-08-23 10:17:35.027012095 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad 2018-06-18 18:15:02.385872155 +0100
@@ -1,3 +1,10 @@
QA output created by 004
*** test backref walking
-*** done
+./tests/btrfs/004: line 227: [: 7.55578637259143e+22: integer expression expected
+ERROR: 7.55578637259143e+22 is not a valid numeric value.
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -s 65536 -P 7.55578637259143e+22 /home/fdmanana/btrfs-tests/scratch_1
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004

The large number in scientific notation reported as an invalid numeric
value is the result from the filter passed to perl which multiplies the
physical offset by the block size reported by fiemap.

So fix this by ensuring the physical offset is always set to 0 when we
are processing an extent with a special block_start value.

Fixes: 9d311e11fc1f ("Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9d311e11 07-May-2018 Robbie Ko <robbieko@synology.com>

Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero

[BUG]
fm_mapped_extents is not correct when fm_extent_count is 0
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1

When user space wants to get the number of file extents,
set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents.

In the above example, fiemap will return with fm_mapped_extents set to 4,
but it should be 1 since there's only one entry in the output.

[REASON]
The problem seems to be that disko is only set if
fieinfo->fi_extents_max is set. And this member is initialized, in the
generic ioctl_fiemap function, to the value of used-passed
fm_extent_count. So when the user passes 0 then fi_extent_max is also
set to zero and this causes btrfs to not initialize disko at all.
Eventually this leads emit_fiemap_extent being called with a bogus
'phys' argument preventing proper fiemap entries merging.

[FIX]
Move the disko initialization earlier in extent_fiemap making it
independent of user-passed arguments, allowing emit_fiemap_extent to
properly handle consecutive extent entries.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8ac9f7c1 20-May-2018 Kent Overstreet <kent.overstreet@gmail.com>

btrfs: convert to bioset_init()/mempool_init()

Convert btrfs to embedded bio sets.

Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ae225a8 19-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove tree argument from extent_writepages

It can be directly referenced from the passed address_space so do that.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2a3ff0ad 19-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove redundant tree argument from extent_readpages

This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 29c68b2d 19-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove map argument from try_release_extent_state

It's not used in the function so just remove it. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 477a30ba 19-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Sink extent_tree arguments in try_release_extent_mapping

This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6faa8f47 18-Apr-2018 Howard McLauchlan <hmclauchlan@fb.com>

btrfs: clean up le_bitmap_{set, clear}()

le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.

Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c1d7c514 03-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: replace GPL boilerplate by SPDX -- sources

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Signed-off-by: David Sterba <dsterba@suse.com>


# b93b0163 10-Apr-2018 Matthew Wilcox <willy@infradead.org>

page cache: use xa_lock

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 57599c7e 01-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: lift errors from add_extent_changeset to the callers

The missing error handling in add_extent_changeset was hidden, so make
it at least visible in the callers.

Signed-off-by: David Sterba <dsterba@suse.com>


# b8b3d625 12-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: document more parameters of submit_extent_page

Signed-off-by: David Sterba <dsterba@suse.com>


# 0c8508a6 12-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: cleanup merging conditions in submit_extent_page

The merge call was factored out to a separate helper but it's a trivial
one and arguably we can opencode it and cache the value.

Signed-off-by: David Sterba <dsterba@suse.com>


# 8eec8296 06-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: remove redundant variable in __do_readpage

The value of page_end is only stored to end, no other use.

Signed-off-by: David Sterba <dsterba@suse.com>


# 5c2b1fd7 06-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: assume that bio_ret is always valid in submit_extent_page

All callers pass a valid pointer so we can drop the redundant checks.
The call to submit_one_bio never happend and can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>


# e67c718b 19-Feb-2018 David Sterba <dsterba@suse.com>

btrfs: add more __cold annotations

The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba <dsterba@suse.com>


# ba020491 12-Feb-2018 Anand Jain <Anand.Jain@oracle.com>

btrfs: extent_buffer_uptodate() make it static and inline

extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.

Before:
text data bss dec hex filename
1131257 82898 18992 1233147 12d0fb fs/btrfs/btrfs.ko

After:
text data bss dec hex filename
1131090 82898 18992 1232980 12d054 fs/btrfs/btrfs.ko

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# af2679e4 25-Jan-2018 Liu Bo <bo.li.liu@oracle.com>

Btrfs: enhance leak debug checker for extent state and extent buffer

This prints out eb->bflags since it contains some useful information,
e.g. whether eb is dirty.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e43bbe5e 12-Dec-2017 David Sterba <dsterba@suse.com>

btrfs: sink unlock_extent parameter gfp_flags

All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.

Signed-off-by: David Sterba <dsterba@suse.com>


# d810a4be 07-Dec-2017 David Sterba <dsterba@suse.com>

btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC

There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.

Signed-off-by: David Sterba <dsterba@suse.com>


# ffc9c8dd 13-Dec-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove redundant bio_get/bio_set pair from submit_one_bio

The bio is never referenced after it has been submitted so there is no
point in getting an extra reference.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# aab6e9ed 30-Nov-2017 David Sterba <dsterba@suse.com>

btrfs: unify extent_page_data type passed as void

Functions called from extent_write_cache_pages used void* as generic
callback data, but all of them convert it to extent_page_data, or use it
directly.

Signed-off-by: David Sterba <dsterba@suse.com>


# 935db853 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink writepage parameter to extent_write_cache_pages

The function extent_write_cache_pages is modelled after
write_cache_pages which is a generic interface and the writepage
parameter makes sense there. In btrfs we know exactly which callback
we're going to use, so we can pass it directly.

Signed-off-by: David Sterba <dsterba@suse.com>


# 25b860e0 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink flush_fn to extent_write_cache_pages

All callers pass the same value flush_write_bio.

Signed-off-by: David Sterba <dsterba@suse.com>


# e2932ee0 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: merge two flush_write_bio helpers

flush_epd_write_bio is same as flush_write_bio, no point having two such
functions. Merge them to flush_write_bio. The 'noinline' attribute is
removed as it does not have any meaning.

Signed-off-by: David Sterba <dsterba@suse.com>


# 0a9b0e53 08-Dec-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: sink extent_write_full_page tree argument

The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5e3ee236 08-Dec-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: sink extent_write_locked_range tree parameter

This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ebbede42 03-Dec-2017 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE

Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 4a2d25cd 23-Nov-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove redundant FLAG_VACANCY

Commit 9036c10208e1 ("Btrfs: update hole handling v2") added the
FLAG_VACANCY to denote holes, however there was already a consistent way
of flagging extents which represent hole - ->block_start =
EXTENT_MAP_HOLE. And also the only place where this flag is checked is
in the fiemap code, but the block_start value is also checked and every
other place in the filesystem detects holes by using block_start
value's. So remove the extra flag. This survived a full xfstest run.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6af49dbd 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to read_extent_buffer_pages

All callers pass btree_get_extent, which needs to be exported.

Signed-off-by: David Sterba <dsterba@suse.com>


# 4ef77695 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to __do_contiguous_readpages

All callers pass btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>


# e4d17ef5 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to __extent_readpages

All callers pass btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>


# 0932584b 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to extent_readpages

There's only one caller that passes btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>


# e3350e16 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to get_extent_skip_holes

All callers pass btrfs_get_extent_fiemap and get_extent_skip_holes
itself is used only as a fiemap helper.

Signed-off-by: David Sterba <dsterba@suse.com>


# 2135fb9b 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to extent_fiemap

All callers pass btrfs_get_extent_fiemap and we don't expect anything
else in the context of extent_fiemap.

Signed-off-by: David Sterba <dsterba@suse.com>


# 3c98c62f 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: drop get_extent from extent_page_data

Previous patches cleaned up all places where
extent_page_data::get_extent was set and it was btrfs_get_extent all the
time, so we can simply call that instead.

This also reduces size of extent_page_data by 8 bytes which has positive
effect on stack consumption on various functions on the write out path.

Signed-off-by: David Sterba <dsterba@suse.com>


# deac642d 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to extent_write_full_page

There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>


# 916b9298 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to extent_write_locked_range

There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>


# 43317599 22-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink get_extent parameter to extent_writepages

There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>


# ae0f1625 31-Oct-2017 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to clear_extent_bit

All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.

Signed-off-by: David Sterba <dsterba@suse.com>


# 66b0c887 31-Oct-2017 David Sterba <dsterba@suse.com>

btrfs: prepare to drop gfp mask parameter from clear_extent_bit

Use __clear_extent_bit directly in case we want to pass unknown
gfp flags. Otherwise all clear_extent_bit callers use GFP_NOFS, so we
can sink them to the function and reduce argument count, at the cost
that __clear_extent_bit has to be exported.

Signed-off-by: David Sterba <dsterba@suse.com>


# d3fac6ba 24-Oct-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove redundant mirror_num arg

The following callpath is always invoked with mirror_num set to 0, so
let's remove it as an argument and directly pass 0 to __do_redpage. No
functional change.

extent_readpages
__extent_readpages
__do_contiguous_readpages
__do_readpage

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a0b60d72 18-Dec-2017 Ming Lei <ming.lei@redhat.com>

btrfs: avoid access to .bi_vcnt directly

BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.

Use bio_nr_pages() to do that instead.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c45a8f2d 18-Dec-2017 Ming Lei <ming.lei@redhat.com>

fs: convert to bio_last_bvec_all()

This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1751e8a6 27-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Rename superblock flags (MS_xyz -> SB_xyz)

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"

SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86679820 15-Nov-2017 Mel Gorman <mgorman@techsingularity.net>

mm, pagevec: remove cold parameter for pagevecs

Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 67fd707f 15-Nov-2017 Jan Kara <jack@suse.cz>

mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()

All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4006f437 15-Nov-2017 Jan Kara <jack@suse.cz>

btrfs: use pagevec_lookup_range_tag()

We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f82b7359 23-Oct-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: add write_flags for compression bio

Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.

It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.

This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 6273b7f8 04-Oct-2017 David Sterba <dsterba@suse.com>

btrfs: get rid of sector_t and use u64 offset in submit_extent_page

The use of sector_t in the callchain of submit_extent_page is not
necessary. Switch to u64 and rename the variable and use byte units
instead of 512b, ie. dropping the >> 9 shifts and avoiding the
con(tro)versions of sector_t.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6c5a4e2c 04-Oct-2017 David Sterba <dsterba@suse.com>

btrfs: rename page offset parameter in submit_extent_page

We're going to remove sector_t and will use 'offset', so this patch
frees the name.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 18fdc679 13-Sep-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: remove bio_flags which indicates a meta block of log-tree

Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.

We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.

Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.

1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.

2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.

However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.

This removes the bio_flags related stuff for writing log-tree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2d8ce70a 03-Oct-2017 Goffredo Baroncelli <kreijack@inwind.it>

btrfs: avoid overflow when sector_t is 32 bit

Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5f14efd3 23-Aug-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: do not reset bio->bi_ops while writing bio

flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().

This remove this unnecessary bio->bi_opf setting.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ff40adf7 24-Aug-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: use the new helper wbc_to_write_flags

This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.

Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 74d46992 23-Aug-2017 Christoph Hellwig <hch@lst.de>

block: replace bi_bdev with a gendisk pointer and partitions index

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f716abd5 09-Aug-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix out of bounds array access while reading extent buffer

There is a corner case that slips through the checkers in functions
reading extent buffer, ie.

if (start < eb->len) and (start + len > eb->len),
then

a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,

b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).

The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.

It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref". And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.

This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4b81ba48 06-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: merge REQ_OP and REQ_ flags to one parameter in submit_extent_page

The function submit_extent_page has 15(!) parameters right now, op and
op_flags are effectively one value stored to bio::bi_opf, no need to
pass them separately. So it's 14 parameters now.

Signed-off-by: David Sterba <dsterba@suse.com>


# f1c77c55 06-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: cleanup types storing REQ_*

Unify types of local variables and parameters that store various
REQ_* values to unsigned int.

Signed-off-by: David Sterba <dsterba@suse.com>


# e4ff5fb5 19-Jul-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused parameters from volume.c functions

This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.

btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3abeb
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit

btrfs_is_parity_mirror's mirror_num - same as above

chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c4b ("Btrfs:
devid subset filter") and never used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bb739cf0 28-Jun-2017 Edmund Nadolski <enadolski@suse.com>

btrfs: btrfs_check_shared should manage its own transaction

Commit afce772e87c3 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cbb1f45 28-Jun-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: struct-funcs, constify readers

We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only. We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.

No impact on code, but serves as documentation that a buffer is intended
not to be modified.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc98a42c 17-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)

Firstly by applying the following with coccinelle's spatch:

@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)

@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)

to remove left over excess bracketage and finally by applying:

@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>


# c3cfb656 13-Jul-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix unexpected return value of bio_readpage_error

With blk_status_t conversion (that are now present in master),
bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set
%ret if it runs without problems.

This fixes that unexpected return value by changing
btrfs_check_repairable() to return a bool instead of updating %ret, and
patch is applicable to both codebases with and without blk_status_t.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e8f5b395 13-Jul-2017 David Sterba <dsterba@suse.com>

btrfs: btrfs_create_repair_bio never fails, skip error handling

As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.

Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c09abff8 13-Jul-2017 David Sterba <dsterba@suse.com>

btrfs: cloned bios must not be iterated by bio_for_each_segment_all

We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration. Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions. This patch adds assertions to all remaining
bio_for_each_segment_all cases.

[1] https://patchwork.kernel.org/patch/9838535/

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 848c23b7 21-Jun-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Remove false alert when fiemap range is smaller than on-disk extent

Commit 4751832da990 ("btrfs: fiemap: Cache and merge fiemap extent before
submit it to user") introduced a warning to catch unemitted cached
fiemap extent.

However such warning doesn't take the following case into consideration:

0 4K 8K
|<---- fiemap range --->|
|<----------- On-disk extent ------------------>|

In this case, the whole 0~8K is cached, and since it's larger than
fiemap range, it break the fiemap extent emit loop.
This leaves the fiemap extent cached but not emitted, and caught by the
final fiemap extent sanity check, causing kernel warning.

This patch removes the kernel warning and renames the sanity check to
emit_last_fiemap_cache() since it's possible and valid to have cached
fiemap extent.

Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Fixes: 4751832da990 ("btrfs: fiemap: Cache and merge fiemap extent ...")
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e6959b93 27-Jun-2017 Jens Axboe <axboe@kernel.dk>

btrfs: add support for passing in write hints for buffered writes

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 104b4e51 20-Jun-2017 Nikolay Borisov <nborisov@suse.com>

percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch

Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable. Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add. In terms of context-safety,
they're equivalent. The only difference is that the __ version takes
a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

This patch doesn't cause any functional changes.

tj: Minor updates to patch description for clarity. Cosmetic
indentation updates.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>


# c5e4c3d7 12-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to btrfs_io_bio_alloc

We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 184f999e 12-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: add helper to initialize the non-bio part of btrfs_io_bio

We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c821e7f3 02-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: pass bytes to btrfs_bio_alloc

Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9f2179a5 02-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: remove redundant parameters from btrfs_bio_alloc

All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.

submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.

Signed-off-by: David Sterba <dsterba@suse.com>


# 8b6c1d56 02-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to btrfs_bio_clone

All callers pass GFP_NOFS.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e4f56903 02-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: btrfs_io_bio_alloc never fails, skip error handling

Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0c4dd97c 02-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: btrfs_bio_alloc never fails, skip error handling

Update direct callers of btrfs_bio_alloc that do error handling, that we
can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6e707bcd 02-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: bioset allocations will never fail, adapt our helpers

Christoph pointed out that bio allocations backed by a bioset will never
fail. As we always use a bioset for all bio allocations, we can skip
the error handling. This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.

CC: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d9ec8c4 29-May-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET

Commit 5f39d397dfbe ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>


# e477094f 16-May-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial

We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 17347cec 15-May-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: change how we iterate bios in endio

Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio. Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2f8e9140 15-May-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: new helper btrfs_bio_clone_partial

This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 015c1bd9 04-Apr-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: use bio_clone_fast to clone our bio

For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.

Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it. This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7870d082 05-May-2017 Josef Bacik <josef@toxicpanda.com>

Btrfs: don't pass the inode through clean_io_failure

Instead pass around the failure tree and the io tree.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6ec656bc 05-May-2017 Josef Bacik <josef@toxicpanda.com>

btrfs: remove inode argument from repair_io_failure

Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c6100a4b 05-May-2017 Josef Bacik <josef@toxicpanda.com>

Btrfs: replace tree->mapping with tree->private_data

For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 011067b0 17-Jun-2017 NeilBrown <neilb@suse.com>

blk: replace bioset_create_nobvec() with a flags arg to bioset_create()

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e4cbee9 03-Jun-2017 Christoph Hellwig <hch@lst.de>

block: switch bios to blk_status_t

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# bff5baf8 09-May-2017 Colin Ian King <colin.king@canonical.com>

btrfs: fix incorrect error return ret being passed to mapping_set_error

The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.

Detected by CoverityScan, CID#1414312 ("Logically dead code")

Fixes: 5dca6eea91653e ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4751832d 06-Apr-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: fiemap: Cache and merge fiemap extent before submit it to user

[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.

It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c725328c 29-Mar-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: enable repair during read for raid56 profile

Now that scrub can fix data errors with the help of parity for raid56
profile, repair during read is able to as well.

Although the mirror num in raid56 scenario has different meanings, i.e.
0 or 1: read data directly
> 1: do recover with parity,
it could be fit into how we repair bad block during read.

The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
device and position on it.

Cc: David Sterba <dsterba@suse.cz>
Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 592d92ee 14-Mar-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: create a helper for getting chunk map

We have similar code here and there, this merges them into a helper.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b7ac31b7 03-Mar-2017 Elena Reshetova <elena.reshetova@intel.com>

btrfs: convert extent_state.refs from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 490b54d6 03-Mar-2017 Elena Reshetova <elena.reshetova@intel.com>

btrfs: convert extent_map.refs from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9d0d1c8b 24-Mar-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: bring back repair during read

Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.

This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest. We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 49d4a334 06-Mar-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix regression in lock_delalloc_pages

The bug is a regression after commit
(da2c7009f6ca "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8fd "Btrfs: use helper to simplify lock/unlock pages").

So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed

-----------------------------------------------------------------------------
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-----------------------------------------------------------------------------

This fixes the bug by returning the proper -EAGAIN.

Cc: David Sterba <dsterba@suse.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 20a7db8a 17-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: add dummy callback for readpage_io_failed and drop checks

Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.

Signed-off-by: David Sterba <dsterba@suse.com>


# 20c9801d 17-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: drop checks for mandatory extent_io_ops callbacks

We know that eadpage_end_io_hook, submit_bio_hook and merge_bio_hook are
always defined so we can drop the checks before we call them.

Signed-off-by: David Sterba <dsterba@suse.com>


# c3988d63 17-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: let writepage_end_io_hook return void

There's no error path in any of the instances, always return 0.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fc4f21b1 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Make get_extent_t take btrfs_inode

In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6fc0ef68 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_clear_bit_hook take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7ab7956e 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_free_io_failure_record take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b30cb441 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: make clean_io_failure take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9d4f7f8a 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: make repair_io_failure take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ac1f4ac 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: make free_io_failure take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a776c6fa 20-Feb-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_lookup_ordered_range take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4242b64a 10-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter from extent_write_cache_pages

The 'tree' was used to call locking hook that does not exist anymore.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d4b9496 10-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter from update_nr_written

The logic has been updated in "Btrfs: make mapping->writeback_index
point to the last written page" (a91326679f2a0a4c2) and page is not
needed anymore.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2df8bb4 10-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter from submit_extent_page

This used to hold number of maximum pages to allocate, but this is now
limited by BIO_MAX_PAGES. The local are now unused and removed as well.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 53d32359 13-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: embed extent_changeset::range_changed to the structure

We can embed range_changed to the extent changeset to address following
problems:

- no need to allocate ulist dynamically, we also get rid of the GFP_NOFS
for free
- fix lack of allocation failure checking in btrfs_qgroup_reserve_data

The stack consuption where extent_changeset is used slightly increases:

before: 16
after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40

Which is bearable.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fe01aa65 09-Feb-2017 Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>

Btrfs: add another missing end_page_writeback on submit_extent_page failure

If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.

__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.

For reproducing the hang, we inject a fault like

if (should_failtest()) { // I define should_failtest()
bio = NULL;
}
else {
bio = btrfs_bio_alloc(...);
}

in submit_extent_page.

We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.

Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>


# 76c0021d 10-Feb-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: use helper to simplify lock/unlock pages

Since we have a helper to set page bits, let lock_delalloc_pages and
__unlock_for_delalloc use it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# da2c7009 10-Feb-2017 Liu Bo <bo.li.liu@oracle.com>

btrfs: teach __process_pages_contig about PAGE_LOCK operation

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changes to the helper separated from the following patch ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 873695b3 02-Feb-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: create helper for processing bits on contiguous pages

This introduces a new helper which can be used to process pages bits.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bcf93489 25-Jan-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: cleanup unused cached_state in __extent_writepage_io

@cached_state is no more required in __extent_writepage_io, also remove
the goto label.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f85b7379 20-Jan-2017 David Sterba <dsterba@suse.com>

btrfs: fix over-80 lines introduced by previous cleanups

This goes as a separate patch because fixing that inside the patches
caused too many many conflicts.

Signed-off-by: David Sterba <dsterba@suse.com>


# 4a0cc7ca 10-Jan-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_ino take a struct btrfs_inode

Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 1aceabf3 09-Jan-2017 Michal Hocko <mhocko@suse.com>

btrfs: drop gfp mask tweaking in try_release_extent_state

try_release_extent_state reduces the gfp mask to GFP_NOFS if it is
compatible. This is true for GFP_KERNEL as well. There is no real
reason to do that though. There is no new lock taken down the
the only consumer of the gfp mask which is
try_release_extent_state
clear_extent_bit
__clear_extent_bit
alloc_extent_state

So this seems just unnecessary and confusing.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3ba7ab22 09-Jan-2017 Michal Hocko <mhocko@suse.com>

btrfs: fix up misleading GFP_NOFS usage in btrfs_releasepage

b335b0034e25 ("Btrfs: Avoid using __GFP_HIGHMEM with slab allocator")
has reduced the allocation mask in btrfs_releasepage to GFP_NOFS just
to prevent from giving an unappropriate gfp mask to the slab allocator
deeper down the callchain (in alloc_extent_state). This is wrong for
two reasons a) GFP_NOFS might be just too restrictive for the calling
context b) it is better to tweak the gfp mask down when it needs that.

So just remove the mask tweaking from btrfs_releasepage and move it
down to alloc_extent_state where it is needed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3a45bb20 09-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: remove root parameter from transaction commit/end routines

Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ff7e61e 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: take an fs_info directly when the root is not used otherwise

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0b246afa 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: root->fs_info cleanup, add fs_info convenience variables

In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# da17066c 15-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: pull node/sector/stripe sizes out of root and into fs_info

We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 58e8012c 08-Nov-2016 David Sterba <dsterba@suse.com>

btrfs: add optimized version of eb to eb copy

Using copy_extent_buffer is suitable for copying betwenn buffers from an
arbitrary offset and deals with page boundaries. This is not necessary
when doing a full extent_buffer-to-extent_buffer copy. We can utilize
the copy_page helper as well.

Signed-off-by: David Sterba <dsterba@suse.com>


# b159fa28 08-Nov-2016 David Sterba <dsterba@suse.com>

btrfs: remove constant parameter to memset_extent_buffer and rename it

The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>


# fba1acf9 08-Nov-2016 David Sterba <dsterba@suse.com>

btrfs: use specialized page copying helpers in btrfs_clone_extent_buffer

The copy_page is usually optimized and can be faster than memcpy.

Signed-off-by: David Sterba <dsterba@suse.com>


# f157bf76 09-Nov-2016 David Sterba <dsterba@suse.com>

btrfs: introduce helpers for updating eb uuids

The fsid and chunk tree uuid are always located in the first page,
we don't need the to use write_extent_buffer.

Signed-off-by: David Sterba <dsterba@suse.com>


# cf8cddd3 27-Oct-2016 Christoph Hellwig <hch@lst.de>

btrfs: don't abuse REQ_OP_* flags for btrfs_map_block

btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations. But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags. This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 70fd7614 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block,fs: use REQ_* flags directly

Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9c894696 12-Oct-2016 Dan Carpenter <dan.carpenter@oracle.com>

Btrfs: remove some no-op casts

We cast 0 to a u8 but then because of type promotion, it's immediately
cast to int back to int before we do a bitwise negate. The cast doesn't
matter in this case, the code works as intended. It causes a static
checker warning though so let's remove it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2fe1d551 22-Sep-2016 Omar Sandoval <osandov@fb.com>

Btrfs: fix free space tree bitmaps on big-endian systems

In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
bitmaps in memory and copy them directly to/from the extent buffers with
{read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
granularity, which is equivalent to a little-endian bitmap. This means
that on big-endian systems, the in-memory bitmaps will be written to
disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
memory.

Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 851cd173 23-Sep-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: memset to avoid stale content in btree leaf

This is an additional patch to
"Btrfs: memset to avoid stale content in btree node block".

This uses memset to initialize the unused space in a leaf to avoid
potential stale content, which may be incurred by pushing items
between sibling leaves.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab8d0fc4 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert pr_* to btrfs_* where possible

For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 62e85577 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert printk(KERN_* to use pr_* calls

This patch converts printk(KERN_* style messages to use the pr_* versions.

One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5d163e0e 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: unsplit printed strings

CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3eb548ee 14-Sep-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: memset to avoid stale content in btree node block

During updating btree, we could push items between sibling
nodes/leaves, for leaves data sections starts reversely from
the end of the block while for nodes we only have key pairs
which are stored one by one from the start of the block.

So we could do try to push key pairs from one node to the next
node right in the tree, and after that, we update the node's
nritems to reflect the correct end while leaving the stale
content in the node. One may intentionally corrupt the fs
image and access the stale content by bumping the nritems and
causes various crashes.

This takes the in-memory @nritems as the correct one and
gets to memset the unused part of a btree node.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8436ea91 02-Sep-2016 Josef Bacik <jbacik@fb.com>

Btrfs: kill the start argument to read_extent_buffer_pages

Nobody uses this, it makes no sense to do partial reads of extent buffers.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# afcdd129 02-Sep-2016 Josef Bacik <jbacik@fb.com>

Btrfs: add a flags field to btrfs_fs_info

We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ba8b04c1 19-Jul-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset

Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.

This should reduce conflict of both patchset and the effort to rebase
them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2571e739 03-Aug-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix memory leak in reading btree blocks

So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
3) doesn't count them for eb->io_pages, but they are later
cleared by 2) so we has to readpage on the page, we get
the wrong eb->io_pages which results in a memory leak of
this block.

This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.

t1(readahead) t2(readahead endio) t3(the following read)
read_extent_buffer_pages end_bio_extent_readpage
for pg in eb: for page 0,1,2 in eb:
if pg is uptodate: btree_readpage_end_io_hook(pg)
num_reads++ if uptodate:
eb->io_pages = num_reads SetPageUptodate(pg) _______________
for pg in eb: for page 3 in eb: read_extent_buffer_pages
if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb:
__extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate:
clear_extent_buffer_uptodate(eb) num_reads++
for pg in eb: eb->io_pages = num_reads
ClearPageUptodate(page) _______________
for pg in eb:
if pg is NOT uptodate:
__extent_read_full_page(pg)

So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# afce772e 12-Jun-2016 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: fix check_shared for fiemap ioctl

Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b571bc606 30-Jul-2016 Shaun Tancheff <shaun@tancheff.com>

Fixup direct bi_rw modifiers

bi_rw should be using bio_set_op_attrs to set bi_rw.

Signed-off-by: Shaun Tancheff <shaun@tancheff.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 20bd723e 26-Jul-2016 Paolo Valente <paolo.valente@linaro.org>

block: add missing group association in bio-cloning functions

When a bio is cloned, the newly created bio must be associated with
the same blkcg as the original bio (if BLK_CGROUP is enabled). If
this operation is not performed, then the new bio is not associated
with any group, and the group of the current task is returned when
the group of the bio is requested.

Depending on the cloning frequency, this may cause a large
percentage of the bios belonging to a given group to be treated
as if belonging to other groups (in most cases as if belonging to
the root group). The expected group isolation may thereby be broken.

This commit adds the missing association in bio-cloning functions.

Fixes: da2f0f74cf7d ("Btrfs: add support for blkio controllers")
Cc: stable@vger.kernel.org # v4.3+

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 8a5c743e 26-Jul-2016 Michal Hocko <mhocko@suse.com>

mm, memcg: use consistent gfp flags during readahead

Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs. This gfp mask discrepancy is really unfortunate and easily
fixable. Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well. The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# baf863b9 11-Jul-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix eb memory leak due to readpage failure

eb->io_pages is set in read_extent_buffer_pages().

In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.

When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
and ends up with a memory leak eventually.

This lets __do_readpage propagate errors to callers and adds the
'atomic_dec(&eb->io_pages)'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6f034ece 22-Jun-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: cleanup BUG_ON in merge_bio

One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
thus this aims to stop anyone to panic the whole system by using
their btrfs.

Since the error in merge_bio can only come from __btrfs_map_block()
when chunk tree mapping has something insane and __btrfs_map_block()
has already had printed the reason, we can just return errors in
merge_bio.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fba4b697 23-Jun-2016 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Fix slab accounting flags

BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.

To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 415b35a5 17-Jun-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix error handling in map_private_extent_buffer

map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
than pagesize,
2. when it detects something insane.

The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.

Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c871b0f2 06-Jun-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: check if extent buffer is aligned to sectorsize

Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer(). An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 81a75f67 05-Jun-2016 Mike Christie <mchristi@redhat.com>

btrfs: use bio fields for op and flags

The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1f7ad75b 05-Jun-2016 Mike Christie <mchristi@redhat.com>

btrfs: have submit_one_bio users use bio op accessors

This patch has btrfs's submit_one_bio users set the bio op using
bio_set_op_attrs and get the op using bio_op.

The next patches will continue to convert btrfs,
so submit_bio_hook and merge_bio_hook
related code will be modified to take only the bio. I did
not do it in this patch to try and keep it smaller.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 4e49ea4a 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block/fs/drivers: remove rw argument from submit_bio

This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>


# b9ef22de 01-Jun-2016 Feifei Xu <xufeifei@linux.vnet.ibm.com>

Btrfs: self-tests: Support non-4k page size

self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b5de8d0d 27-May-2016 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between device replace and read repair

While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".

So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>


# 01327610 19-May-2016 Nicholas D Steeves <nsteeves@gmail.com>

btrfs: fix string and comment grammatical issues and typos

Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2d324f59 17-May-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix unexpected return value of fiemap

btrfs's fiemap is supposed to return 0 on success and return < 0 on
error. however, ret becomes 1 after looking up the last file extent:

btrfs_lookup_file_extent ->
btrfs_search_slot(..., ins_len=0, cow=0)

and if the offset is beyond EOF, we'll get 'path' pointed to the place
of potentail insertion, and ret == 1.

This may confuse applications using ioctl(FIEL_IOC_FIEMAP).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e1860a77 09-May-2016 David Sterba <dsterba@suse.com>

btrfs: GFP_NOFS does not GFP_HIGHMEM

Masking HIGHMEM out of NOFS does not make sense.

Signed-off-by: David Sterba <dsterba@suse.com>


# 58409edd 04-May-2016 David Sterba <dsterba@suse.com>

btrfs: kill unused writepage_io_hook callback

It seems to be long time unused, since 2008 and
6885f308b5570 ("Btrfs: Misc 2.6.25 updates").

Propagating the removal touches some code but has no functional effect.

Signed-off-by: David Sterba <dsterba@suse.com>


# 210aa277 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to convert_extent_bit

Single caller passes GFP_NOFS. We can get rid of the
gfpflags_allow_blocking checks as NOFS can block but does not recurse to
filesystem through reclaim.

Signed-off-by: David Sterba <dsterba@suse.com>


# 059f791c 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: make state preallocation more speculative in __set_extent_bit

Similar to __clear_extent_bit, do not fail if the state preallocation
fails as we might not need it. One less BUG_ON.

Signed-off-by: David Sterba <dsterba@suse.com>


# 03bf5387 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: untangle gotos a bit in convert_extent_bit

Signed-off-by: David Sterba <dsterba@suse.com>


# 7ab5cb2a 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: untangle gotos a bit in __clear_extent_bit

Signed-off-by: David Sterba <dsterba@suse.com>


# b5a4ba14 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: untangle gotos a bit in __set_extent_bit

Signed-off-by: David Sterba <dsterba@suse.com>


# 2c53b912 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to set_record_extent_bits

Single caller passes GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>


# f734c44a 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to clear_record_extent_bits

Callers pass GFP_NOFS. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>


# 91166212 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to clear_extent_bits

Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>


# ceeb0ae7 26-Apr-2016 David Sterba <dsterba@suse.com>

btrfs: sink gfp parameter to set_extent_bits

All callers pass GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>


# 894b36e3 07-Mar-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: cleanup error handling in extent_write_cached_pages

Now that we bail out immediately if ->writepage() returns an error,
we don't need an extra error to retain the error code.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a9132667 07-Mar-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: make mapping->writeback_index point to the last written page

If sequential writer is writing in the middle of the page and it just redirties
the last written page by continuing from it.

In the above case this can end up with seeking back to that firstly redirtied
page after writing all the pages at the end of file because btrfs updates
mapping->writeback_index to 1 past the current one.

For non-cow filesystems, the cost is only about extra seek, while for cow
filesystems such as btrfs, it means unnecessary fragments.

To avoid it, we just need to continue writeback from the last written page.

This also updates btrfs to behave like what write_cache_pages() does, ie, bail
out immediately if there is an error in writepage().

<Ref: https://www.spinics.net/lists/linux-btrfs/msg52628.html>

Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ea1754a0 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage

Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f827ba9a 22-Feb-2016 Arnd Bergmann <arnd@arndb.de>

btrfs: avoid uninitialized variable warning

With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
to partially inline the get_state_failrec() function but cannot
figure out that means the failrec pointer is always valid
if the function returns success, which causes a harmless
warning:

fs/btrfs/extent_io.c: In function 'clean_io_failure':
fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This marks get_state_failrec() and set_state_failrec() both
as 'noinline', which avoids the warning in all cases for me,
and seems less ugly than adding a fake initialization.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 47dc196ae719 ("btrfs: use proper type for failrec in extent_state")
Signed-off-by: David Sterba <dsterba@suse.com>


# 5598e900 29-Jan-2016 Kinglong Mee <kinglongmee@gmail.com>

btrfs: drop null testing before destroy functions

Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 47dc196a 11-Feb-2016 David Sterba <dsterba@suse.com>

btrfs: use proper type for failrec in extent_state

We use the private member of extent_state to store the failrec and play
pointless pointer games.

Signed-off-by: David Sterba <dsterba@suse.com>


# 7f042a83 27-Jan-2016 Filipe Manana <fdmanana@suse.com>

Btrfs: remove no longer used function extent_read_full_page_nolock()

Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".

Signed-off-by: Filipe Manana <fdmanana@suse.com>


# dbfdb6d1 21-Jan-2016 Chandan Rajendra <chandan@linux.vnet.ibm.com>

Btrfs: Search for all ordered extents that could span across a page

In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ee22184b 14-Dec-2015 Byongho Lee <bhlee.kernel@gmail.com>

Btrfs: use linux/sizes.h to represent constants

We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0f331229 29-Sep-2015 Omar Sandoval <osandov@fb.com>

Btrfs: add extent buffer bitmap sanity tests

Sanity test the extent buffer bitmap operations (test, set, and clear)
against the equivalent standard kernel operations.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 3e1e8bb7 29-Sep-2015 Omar Sandoval <osandov@fb.com>

Btrfs: add extent buffer bitmap operations

These are going to be used for the free space tree bitmap items.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 35de6db2 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make set_range_writeback return void

Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.

Signed-off-by: David Sterba <dsterba@suse.com>


# f6311572 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make extent_range_redirty_for_io return void

Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.

Signed-off-by: David Sterba <dsterba@suse.com>


# bd1fa4f0 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make extent_range_clear_dirty_for_io return void

Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.

Signed-off-by: David Sterba <dsterba@suse.com>


# b5227c07 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make end_extent_writepage return void

Does not return any errors, nor anything from the callgraph. The branch
in end_bio_extent_writepage has been skipped since
5fd02043553b ("Btrfs: finish ordered extents in their own thread").

Signed-off-by: David Sterba <dsterba@suse.com>


# a9d93e17 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make extent_clear_unlock_delalloc return void

Does not return any errors, nor anything from the callgraph.

Signed-off-by: David Sterba <dsterba@suse.com>


# 69ba3927 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make clear_extent_buffer_uptodate return void

Does not return any errors, nor anything from the callgraph.

Signed-off-by: David Sterba <dsterba@suse.com>


# 09c25a8c 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make set_extent_buffer_uptodate return void

Does not return any errors, nor anything from the callgraph.

Signed-off-by: David Sterba <dsterba@suse.com>


# cd716d8f 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make lock_extent static inline

One call less reduces stack usage, code slightly reduced as well.

Signed-off-by: David Sterba <dsterba@suse.com>


# ff13db41 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: drop unused parameter from lock_extent_bits

We've always passed 0. Stack usage will slightly decrease.

Signed-off-by: David Sterba <dsterba@suse.com>


# e83b1d91 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make clear_extent_bit helpers static inline

The funcions just wrap the clear_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
decreases:

text data bss dec hex filename
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko.before
939651 43670 23144 1006465 f5b81 fs/btrfs/btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.com>


# c6317955 03-Dec-2015 David Sterba <dsterba@suse.com>

btrfs: make set_extent_bit helpers static inline

The funcions just wrap the set_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
increases:

text data bss dec hex filename
938427 43670 23144 1005241 f56b9 fs/btrfs/btrfs.ko.before
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko

Signed-off-by: David Sterba <dsterba@suse.com>


# d0164adc 06-Nov-2015 Mel Gorman <mgorman@techsingularity.net>

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fefdc557 12-Oct-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: extent_io: Introduce new function clear_record_extent_bits()

Introduce new function clear_record_extent_bits(), which will clear bits
for given range and record the details about which ranges are cleared
and how many bytes in total it changes.

This provides the basis for later qgroup reserve codes.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d38ed27f 12-Oct-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: extent_io: Introduce new function set_record_extent_bits

Introduce new function set_record_extent_bits(), which will not only set
given bits, but also record how many bytes are changed, and detailed
range info.

This is quite important for later qgroup reserve framework.
The number of bytes will be used to do qgroup reserve, and detailed
range info will be used to cleanup for EQUOT case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5e6ecb36 13-Oct-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix double range unlock of hole region when reading page

If when reading a page we find a hole and our caller had already locked
the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end
up unlocking the hole's range and then later our caller unlocks it
again, which might have already been locked by some other task once
the first unlock happened.

Currently this can only happen during a call to the extent_same ioctl,
as it's the only caller of __do_readpage() that sets the bit
EXTENT_BIO_PARENT_LOCKED for bio flags.

Fix this by leaving the unlock exclusively to the caller.

Signed-off-by: Filipe Manana <fdmanana@suse.com>


# f14d104d 08-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: switch more printks to our helpers

Convert the simple cases, not all functions provide a way to reach the
fs_info. Also skipped debugging messages (print-tree, integrity
checker and pr_debug) and messages that are printed from possibly
unfinished mount.

Signed-off-by: David Sterba <dsterba@suse.com>


# 94647322 08-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: switch message printers to ratelimited variants

Signed-off-by: David Sterba <dsterba@suse.com>


# b14af3b4 08-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: switch message printers to ratelimited _in_rcu variants

Signed-off-by: David Sterba <dsterba@suse.com>


# 808f80b4 28-Sep-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: update fix for read corruption of compressed and shared extents

My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().

The following test case for fstests reproduces the issue:

seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15

_cleanup()
{
rm -f $tmp.*
}

# get standard environment, filters and checks
. ./common/rc
. ./common/filter

# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner

rm -f $seqres.full

test_clone_and_read_compressed_extent()
{
local mount_opts=$1

_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts

# Create our test file with a single extent of 64Kb that is going to
# be compressed no matter which compression algo is used (zlib/lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
$SCRATCH_MNT/foo | _filter_xfs_io

# Now clone the compressed extent into an adjacent file offset.
$CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo

echo "File digest before unmount:"
md5sum $SCRATCH_MNT/foo | _filter_scratch

# Remount the fs or clear the page cache to trigger the bug in
# btrfs. Because the extent has an uncompressed length that is a
# multiple of 16 pages, all the pages belonging to the second range
# of the file (64K to 128K), which points to the same extent as the
# first range (0K to 64K), had their contents full of zeroes instead
# of the byte 0xaa. This was a bug exclusively in the read path of
# compressed extents, the correct data was stored on disk, btrfs
# just failed to fill in the pages correctly.
_scratch_remount

echo "File digest after remount:"
# Must match the digest we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
}

echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"

_scratch_unmount

echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"

status=0
exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Timofey Titovets <nefelim4ag@gmail.com>


# 005efedf 14-Sep-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix read corruption of compressed and shared extents

If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.

Consider the following example:

File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K

[extent X, compressed length = 4K uncompressed length = 16K]

If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).

So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.

The following test case for fstests triggers the issue:

seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15

_cleanup()
{
rm -f $tmp.*
}

# get standard environment, filters and checks
. ./common/rc
. ./common/filter

# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner

rm -f $seqres.full

test_clone_and_read_compressed_extent()
{
local mount_opts=$1

_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts

# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io

# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo

# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io

$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar

echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch

# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount

echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}

echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"

_scratch_unmount

echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"

status=0
exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>


# 3a9508b0 21-Aug-2015 Chris Mason <clm@fb.com>

btrfs: fix compile when block cgroups are not enabled

bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
This adds an ifdef around them. It's not perfect, but our
use of bi_ioc is being removed in the 4.3 merge window.

The bi_css usage really should go into bio_clone, but I want to make
sure that doesn't introduce problems for other bio_clone use cases.

Signed-off-by: Chris Mason <clm@fb.com>


# d1b5c567 19-Aug-2015 Michal Hocko <mhocko@suse.com>

btrfs: Prevent from early transaction abort

Btrfs relies on GFP_NOFS allocation when committing the transaction but
this allocation context is rather weak wrt. reclaim capabilities. The
page allocator currently tries hard to not fail these allocations if
they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
currently but there is an attempt to move away from the default no-fail
behavior and allow these allocation to fail more eagerly. And this would
lead to a pre-mature transaction abort as follows:

[ 55.328093] Call Trace:
[ 55.328890] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.330518] [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[ 55.332738] [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[ 55.334910] [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[ 55.336844] [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[ 55.338973] [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[ 55.341329] [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[ 55.343566] [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[ 55.345577] [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[ 55.347679] [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[ 55.349434] [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[ 55.351681] [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[ 55.353979] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.356212] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.358378] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.360626] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.362894] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.365221] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.367273] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.369047] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.370654] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.372246] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.373851] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[ 55.384280] ------------[ cut here ]------------
[ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[ 55.384337] Call Trace:
[ 55.384353] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.384357] [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[ 55.384359] [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[ 55.384398] [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384400] [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[ 55.384423] [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384446] [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[ 55.384455] [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[ 55.384476] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.384499] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384521] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384543] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.384565] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384588] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.384591] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.384592] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.384593] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.384594] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.384595] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[ 55.384608] ---[ end trace c29799da1d4dd621 ]---
[ 55.437323] BTRFS info (device hdb1): forced readonly
[ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by being explicit about the no-fail behavior of this allocation
path and use __GFP_NOFAIL.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# b54ffb73 19-May-2015 Kent Overstreet <kent.overstreet@gmail.com>

block: remove bio_get_nr_vecs()

We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# da2f0f74 02-Jul-2015 Chris Mason <clm@fb.com>

Btrfs: add support for blkio controllers

This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.

Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.

Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.

The end result is able to throttle nicely on single disk filesystems. A
little more work is required for multi-device filesystems.

Signed-off-by: Chris Mason <clm@fb.com>


# 4246a0b6 20-Jul-2015 Christoph Hellwig <hch@lst.de>

block: add a bi_error field to struct bio

Currently we have two different ways to signal an I/O error on a BIO:

(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0d2b2372 19-May-2015 Josef Bacik <jbacik@fb.com>

Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap

We should be doing this, it's weird we hadn't been doing this.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 0f31871f 14-May-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: wake up extent state waiters on unlock through clear_extent_bits

When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
through free_io_failure(), we weren't waking up any tasks waiting for the
extent's state EXTENT_LOCKED bit, leading to an hang.

So make sure clear_extent_bits() ends up waking up any waiters if the
bit EXTENT_LOCKED is supplied by its callers.

Zygo Blaxell was experiencing such hangs at inode eviction time after
file unlinks. Thanks to him for a set of scripts to reproduce the issue.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# b25de9d6 24-Apr-2015 Christoph Hellwig <hch@lst.de>

block: remove BIO_EOPNOTSUPP

Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 062c19e9 23-Apr-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON

There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:

BUG_ON(extent_buffer_under_io(eb))

The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.

Here follows a sequence of steps that leads to the BUG_ON.

CPU 0 CPU 1 CPU 2

path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X

btrfs_release_path(path)
unlocks X

reads eb X
X->refs incremented to 3
locks eb X
btrfs_del_items(X)
X becomes empty
clean_tree_block(X)
clear EXTENT_BUFFER_DIRTY from X
btrfs_del_leaf(X)
unlocks X
extent_buffer_get(X)
X->refs incremented to 4
btrfs_free_tree_block(X)
X's range is not pinned
X's range added to free
space cache
free_extent_buffer_stale(X)
lock X->refs_lock
set EXTENT_BUFFER_STALE on X
release_extent_buffer(X)
X->refs decremented to 3
unlocks X->refs_lock
btrfs_release_path()
unlocks X
free_extent_buffer(X)
X->refs becomes 2

__btrfs_cow_block(Y)
btrfs_alloc_tree_block()
btrfs_reserve_extent()
find_free_extent()
gets offset == X->start
btrfs_init_new_buffer(X->start)
btrfs_find_create_tree_block(X->start)
alloc_extent_buffer(X->start)
find_extent_buffer(X->start)
finds eb X in radix tree

free_extent_buffer(X)
lock X->refs_lock
test X->refs == 2
test bit EXTENT_BUFFER_STALE is set
test !extent_buffer_under_io(eb)

increments X->refs to 3
mark_extent_buffer_accessed(X)
check_buffer_tree_ref(X)
--> does nothing,
X->refs >= 2 and
EXTENT_BUFFER_TREE_REF
is set in X
clear EXTENT_BUFFER_STALE from X
locks X
btrfs_mark_buffer_dirty()
set_extent_buffer_dirty(X)
check_buffer_tree_ref(X)
--> does nothing, X->refs >= 2 and
EXTENT_BUFFER_TREE_REF is set
sets EXTENT_BUFFER_DIRTY on X

test and clear EXTENT_BUFFER_TREE_REF
decrements X->refs to 2
release_extent_buffer(X)
decrements X->refs to 1
unlock X->refs_lock

unlock X
free_extent_buffer(X)
lock X->refs_lock
release_extent_buffer(X)
decrements X->refs to 0
btrfs_release_extent_buffer_page(X)
BUG_ON(extent_buffer_under_io(X))
--> EXTENT_BUFFER_DIRTY set on X

Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.

A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:

[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[<ffffffffa0507081>] [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008] [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008] [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008] [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008] [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008] [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008] [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008] [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008] [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008] [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008] [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008] [<ffffffff811b9e28>] evict+0xa8/0x190
[1212302.630008] [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0

I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):

while true; do
mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt
snapshot_cmd="btrfs subvolume snapshot -r /mnt"
snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
umount /mnt
done

Which usually triggers the BUG_ON within less than 24 hours:

[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[<ffffffffa0476b1a>] [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031] [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031] [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
[49558.620031] [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031] [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031] [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031] [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031] [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054] [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054] [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054] [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054] [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054] [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
[49558.728054] [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
[49558.728054] [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
[49558.728054] [<ffffffff81155939>] iterate_supers+0x76/0xc4
[49558.728054] [<ffffffff811798e8>] sys_sync+0x55/0x83
[49558.728054] [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 5d2361db 09-Feb-2015 Forrest Liu <forrestl@synology.com>

Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent

btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().

Running following command repeatly can check this memory leak problem

btrfs inspect-internal inode-resolve 256 /mnt/btrfs

Signed-off-by: Chien-Kuan Yeh <ckya@synology.com>
Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5ca64f45 24-Feb-2015 Omar Sandoval <osandov@osandov.com>

btrfs: fix race on ENOMEM in alloc_extent_buffer

Consider the following interleaving of overlapping calls to
alloc_extent_buffer:

Call 1:

- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages

Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
mark_extent_buffer_accessed on it

mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.

The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 26e726af 24-Mar-2015 Chengyu Song <csong84@gatech.edu>

btrfs: incorrect handling for fiemap_fill_next_extent return

fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
the last extent that will fit in user array. If 1 is returned, the return
value may eventually returned to user space, which should not happen, according
to manpage of ioctl.

Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# bcb7e449 16-Mar-2015 Josef Bacik <jbacik@fb.com>

Btrfs: just free dummy extent buffers

If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU. So check for
this case and just free the things directly. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>


# dcab6a3b 11-Feb-2015 Josef Bacik <jbacik@fb.com>

Btrfs: account for large extents with enospc

On our gluster boxes we stream large tar balls of backups onto our fses. With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent. The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need. To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 8d38633c 11-Feb-2015 Konstantin Khlebnikov <koct9i@gmail.com>

page_writeback: put account_page_redirty() after set_page_dirty()

Helper account_page_redirty() fixes dirty pages counter for redirtied
pages. This patch puts it after dirtying and prevents temporary
underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.

Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 289454ad 05-Jan-2015 Naohiro Aota <naota@elisp.net>

btrfs: clear bio reference after submit_one_bio()

After submit_one_bio(), `bio' can go away. However submit_extent_page()
leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
It will cause invalid paging request when submit_extent_page() is called
next time.

I reproduced ENOMEM case with the following script (need
CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).

#!/bin/bash

dmesgout=dmesg.txt
start=100000
end=300000
step=1000

# btrfs options
device=/dev/vdb1
directory=/mnt/btrfs

# fault-injection options
percent=100
times=3

mkdir -p $directory || exit 1
mount -o compress $device $directory || exit 1

rm -f $directory/file || exit 1
dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1

for interval in `seq $start $step $end`; do
dmesg -C
echo 1 > /proc/sys/vm/drop_caches
sync
export FAILCMD_TYPE=fail_page_alloc
./failcmd.sh -p $percent -t $times -i $interval \
--ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \
-- \
cat $directory/file > /dev/null
dmesg > ${dmesgout}
if grep -q BUG: ${dmesgout}; then
cat ${dmesgout}
exit 1
fi
done

umount $directory
exit 0

Signed-off-by: Naohiro Aota <naota@elisp.net>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 6e9606d2 20-Jan-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

Btrfs: add ref_count and free function for btrfs_bio

1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
forced data type checking in compile.

Changelog v1->v2:
Rename following by David Sterba's suggestion:
put_btrfs_bio() -> btrfs_put_bio()
get_btrfs_bio() -> btrfs_get_bio()
bbio->ref_count -> bbio->refs

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 9ee49a04 14-Jan-2015 David Sterba <dsterba@suse.cz>

btrfs: switch extent_state state to unsigned

Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.

The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.

struct extent_state {
u64 start; /* 0 8 */
u64 end; /* 8 8 */
struct rb_node rb_node; /* 16 24 */
wait_queue_head_t wq; /* 40 24 */
/* --- cacheline 1 boundary (64 bytes) --- */
atomic_t refs; /* 64 4 */

/* XXX 4 bytes hole, try to pack */

long unsigned int state; /* 72 8 */
u64 private; /* 80 8 */

/* size: 88, cachelines: 2, members: 7 */
/* sum members: 84, holes: 1, sum holes: 4 */
/* last cacheline: 24 bytes */
};

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 6e1103a6 25-Dec-2014 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>

btrfs: fix state->private cast on 32 bit machines

Suppress the following warning displayed on building 32bit (i686) kernel.

===============================================================================
...
CC [M] fs/btrfs/extent_io.o
fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
failrec = (struct io_failure_record *)state->private;
...
===============================================================================

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Chris Mason <clm@fb.com>


# ce3e6984 14-Jun-2014 David Sterba <dsterba@suse.cz>

btrfs: sink parameter len to alloc_extent_buffer

Because we're using globally known nodesize. Do the same for the sanity
test function variant.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 3f556f78 14-Jun-2014 David Sterba <dsterba@suse.cz>

btrfs: unify extent buffer allocation api

Make the extent buffer allocation interface consistent. Cloned eb will
set a valid fs_info. For dummy eb, we can drop the length parameter and
set it from fs_info.

The built-in sanity checks may pass a NULL fs_info that's queried for
nodesize, but we know it's 4096.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 23d79d81 14-Jun-2014 David Sterba <dsterba@suse.cz>

btrfs: use GFP_NOFS in __alloc_extent_buffer directly

Same mask from all callers.

Signed-off-by: David Sterba <dsterba@suse.cz>


# c7bc6319 03-Nov-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: avoid premature -ENOMEM in clear_extent_bit()

We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c8fd3de7 12-Oct-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early

We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e38e2ed7 12-Oct-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: make find_first_extent_bit be able to cache any state

Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).

This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 704de49d 06-Oct-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: set page and mapping error on compressed write failure

If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.

Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 656f30db 25-Sep-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: be aware of btree inode write errors to avoid fs corruption

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 81465028 23-Sep-2014 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix crash of btrfs_release_extent_buffer_page

This is actually inspired by Filipe's patch. When write_one_eb() fails on
submit_extent_page(), it'll give up writing this eb and mark it with
EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure,
there are some left pages which remain DIRTY, and if a later COW on this eb
happens, ie. eb is COWed and freed, it'd run into BUG_ON in
btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));

This adds the missing clear_page_dirty_for_io() for the rest pages of eb.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 55e3bd2e 22-Sep-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: add missing end_page_writeback on submit_extent_page failure

If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# fb85fc9a 30-Jul-2014 David Sterba <dsterba@suse.cz>

btrfs: kill extent_buffer_page helper

It used to be more complex but now it's just a simple array access.

Signed-off-by: David Sterba <dsterba@suse.cz>


# a50924e3 30-Jul-2014 David Sterba <dsterba@suse.cz>

btrfs: drop constant param from btrfs_release_extent_buffer_page

All callers use the same value, simplify the function.

Signed-off-by: David Sterba <dsterba@suse.cz>


# f612496b 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: cleanup the read failure record after write or when the inode is freeing

After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 8b110e39 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: implement repair function when direct read fails

This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
mirror.
- When the io on the mirror ends, we will insert the endio work into the
dedicated btrfs workqueue, not common read endio workqueue, because the
original endio work is still blocked in the btrfs endio workqueue, if we
insert the endio work of the io on the mirror into that workqueue, deadlock
would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1203b681 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: modify clean_io_failure and make it suit direct io

We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# ffdd2018 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: modify repair_io_failure and make it suit direct io

The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 2fe6303e 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: split bio_readpage_error into several functions

The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 454ff3de 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: Cleanup unused variant and argument of IO failure handlers

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 6c387ab2 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix missing error handler if submiting re-read bio fails

We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c1dc0896 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: do file data check by sub-bio's self

Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 23ea8e5a 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: load checksum data once when submitting a direct read io

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# dc046b10 10-Sep-2014 Josef Bacik <jbacik@fb.com>

Btrfs: make fiemap not blow when you have lots of snapshots

We have been iterating all references for each extent we have in a file when we
do fiemap to see if it is shared. This is fine when you have a few clones or a
few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
and stares at you. So add btrfs_check_shared which will use the backref walking
code but will short circuit as soon as it finds a root or inode that doesn't
match the one we currently have. This makes fiemap on my testbox go from
looking at me blankly for a day to spitting out actual output in a reasonable
amount of time. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# a583c026 19-Aug-2014 Liu Bo <bo.li.liu@oracle.com>

Btrfs: cleanup the same name in end_bio_extent_readpage

We've defined a 'offset' out of bio_for_each_segment_all.

This is just a clean rename, no function changes.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 27a3507d 06-Jul-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: reduce size of struct extent_state

The tree field of struct extent_state was only used to figure out if
an extent state was connected to an inode's io tree or not. For this
we can just use the rb_node field itself.

On a x86_64 system with this change the sizeof(struct extent_state) is
reduced from 96 bytes down to 88 bytes, meaning that with a page size
of 4096 bytes we can now store 46 extent states per page instead of 42.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 962a298f 04-Jun-2014 David Sterba <dsterba@suse.cz>

btrfs: kill the key type accessor helpers

btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 38c1c2e4 19-Aug-2014 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix crash on endio of reading corrupted block

The crash is

------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]

This is in fact a regression.

It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 2c91943b 17-Jul-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Return right extent when fiemap gives unaligned offset and len.

When page aligned start and len passed to extent_fiemap(), the result is
good, but when start and len is not aligned, e.g. start = 1 and len =
4095 is passed to extent_fiemap(), it returns no extent.

The problem is that start and len is all rounded down which causes the
problem. This patch will round down start and round up (start + len) to
return right extent.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 74316201 06-Jul-2014 NeilBrown <neilb@suse.de>

sched: Remove proliferation of wait_on_bit() action functions

The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.

Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.

All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 3e2426bd 11-Jun-2014 Eric Sandeen <sandeen@redhat.com>

btrfs: fix use of uninit "ret" in end_extent_writepage()

If this condition in end_extent_writepage() is false:

if (tree->ops && tree->ops->writepage_end_io_hook)

we will then test an uninitialized "ret" at:

ret = ret < 0 ? ret : -EIO;

The test for ret is for the case where ->writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no ->writepage_end_io_hook, nothing sets ret.

Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.

Signed-of-by: Eric Sandeen <sandeen@redhat.com>

Signed-off-by: Chris Mason <clm@fb.com>


# 550ac1d8 30-Jan-2014 Gerhard Heift <gerhard@heift.name>

btrfs: new function read_extent_buffer_to_user

This new function reads the content of an extent directly to user memory.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>


# 40f76580 21-May-2014 Chris Mason <clm@fb.com>

Btrfs: split up __extent_writepage to lower stack usage

__extent_writepage has two unrelated parts. First it does the delayed
allocation dance and second it does the mapping and IO for the page
we're actually writing.

This splits it up into those two parts so the stack from one doesn't
impact the stack from the other.

Signed-off-by: Chris Mason <clm@fb.com>


# 0e378df1 19-May-2014 Chris Mason <clm@fb.com>

Btrfs: cut down stack usage in btree_write_cache_pages

This adds noinline_for_stack to two helpers used by
btree_write_cache_pages. It shaves us down from 424 bytes on the
stack to 280.

Signed-off-by: Chris Mason <clm@fb.com>


# 7d788742 21-May-2014 Chris Mason <clm@fb.com>

Btrfs: fix double free in find_lock_delalloc_range

We need to NULL the cached_state after freeing it, otherwise
we might free it again if find_delalloc_range doesn't find anything.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org


# faa2dbf0 07-May-2014 Josef Bacik <jbacik@fb.com>

Btrfs: add sanity tests for new qgroup accounting code

This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5dca6eea 11-May-2014 Liu Bo <bo.li.liu@oracle.com>

Btrfs: mark mapping with error flag to report errors to userspace

According to commit 865ffef3797da2cac85b3354b5b6050dc9660978
(fs: fix fsync() error reporting),
it's not stable to just check error pages because pages can be
truncated or invalidated, we should also mark mapping with error
flag so that a later fsync can catch the error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 61391d56 09-May-2014 Filipe Manana <fdmanana@gmail.com>

Btrfs: fix hang on error (such as ENOSPC) when writing extent pages

When running low on available disk space and having several processes
doing buffered file IO, I got the following trace in dmesg:

[ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
[ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000
[ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
[ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
[ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
[ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
[ 4202.720922] Call Trace:
[ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
[ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
[ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
[ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530
(...)
[ 4202.721027] 2 locks held by kworker/u8:1/5450:
[ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
[ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001
[ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
[ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
[ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
[ 4202.721718] Call Trace:
[ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
[ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
[ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
[ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
[ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
[ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
[ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
[ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
(...)

It turns out that extent_io.c:__extent_writepage(), which ends up being called
through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
-ENOSPC when calling the fill_delalloc callback. In this situation, it returned
without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
ever being called for the respective page, which prevents the ordered extent's
bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
is never queued into the endio_write_workers queue. This makes the task that
called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
extent.

This is fairly easy to reproduce using a small filesystem and fsstress on
a quad core vm:

mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt

fsstress -p 6 -d /mnt -n 100000 -x \
"btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
-f allocsp=0 \
-f bulkstat=0 \
-f bulkstat1=0 \
-f chown=0 \
-f creat=1 \
-f dread=0 \
-f dwrite=0 \
-f fallocate=1 \
-f fdatasync=0 \
-f fiemap=0 \
-f freesp=0 \
-f fsync=0 \
-f getattr=0 \
-f getdents=0 \
-f link=0 \
-f mkdir=0 \
-f mknod=0 \
-f punch=1 \
-f read=0 \
-f readlink=0 \
-f rename=0 \
-f resvsp=0 \
-f rmdir=0 \
-f setxattr=0 \
-f stat=0 \
-f symlink=0 \
-f sync=0 \
-f truncate=1 \
-f unlink=0 \
-f unresvsp=0 \
-f write=4

So just ensure that if an error happens while writing the extent page
we call the writepage_end_io_hook callback. Also make it return the
error code and ensure the caller (extent_write_cache_pages) processes
all pages in the page vector even if an error happens only for some
of them, so that ordered extents end up released.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 2457aec6 04-Jun-2014 Mel Gorman <mgorman@suse.de>

mm: non-atomically mark page accessed during page cache allocation where possible

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e857c58 17-Mar-2014 Peter Zijlstra <peterz@infradead.org>

arch: Mass conversion of smp_mb__*()

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c50d3e71 31-Mar-2014 Filipe Manana <fdmanana@gmail.com>

Btrfs: more efficient io tree navigation on wait_extent_bit

If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>


# a26e8c9f 28-Mar-2014 Josef Bacik <jbacik@fb.com>

Btrfs: don't clear uptodate if the eb is under IO

So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.

Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# f2071b21 12-Feb-2014 Filipe Manana <fdmanana@gmail.com>

Btrfs: more efficient split extent state insertion

When we split an extent state there's no need to start the rbtree search
from the root node - we can start it from the original extent state node,
since we would end up in its subtree if we do the search starting at the
root node anyway.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# cbc0e928 25-Feb-2014 Filipe Manana <fdmanana@gmail.com>

Btrfs: remove unneeded field / smaller extent_map structure

We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.

This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# efe120a0 20-Dec-2013 Frank Holton <fholton@gmail.com>

Btrfs: convert printk to btrfs_ and fix BTRFS prefix

Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# f28491e0 16-Dec-2013 Josef Bacik <jbacik@fb.com>

Btrfs: move the extent buffer radix tree into the fs_info

I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 34b41ace 13-Dec-2013 Josef Bacik <jbacik@fb.com>

Btrfs: use a bit to track if we're in the radix tree

For creating a dummy in-memory btree I need to be able to use the radix tree to
keep track of the buffers like normal extent buffers. With dummy buffers we
skip the radix tree step, and we still want to do that for the tree mod log
dummy buffers but for my test buffers we need to be able to remove them from the
radix tree like normal. This will give me a way to do that. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# a5dee37d 13-Dec-2013 Josef Bacik <jbacik@fb.com>

Btrfs: deal with io_tree->mapping being NULL

I need to add infrastructure to allocate dummy extent buffers for running sanity
tests, and to do this I need to not have to worry about having an
address_mapping for an io_tree, so just fix up the places where we assume that
all io_tree's have a non-NULL ->mapping. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 12cfbad9 26-Nov-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: more efficient extent state insertions

Currently we do 2 traversals of an inode's extent_io_tree
before inserting an extent state structure: 1 to see if a
matching extent state already exists and 1 to do the insertion
if the fist traversal didn't found such extent state.

This change just combines those tree traversals into a single one.
While running sysbench tests (random writes) I captured the number
of elements in extent_io_tree trees for a while (into a procfs file
backed by a seq_list from seq_file module) and got this histogram:

Count: 9310
Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000
51.000 - 93.933: 693 ########
93.933 - 172.314: 938 ##########
172.314 - 315.408: 856 #########
315.408 - 576.646: 95 #
576.646 - 6415.830: 888 ##########
6415.830 - 11713.809: 1024 ###########
11713.809 - 21386.000: 4816 #####################################################

So traversing such trees can take some significant time that can
easily be avoided.

Ran the following sysbench tests, 5 times each, for sequential and
random writes, and got the following results:

sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync

sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync

Before this change:

sequential writes: 69.28Mb/sec (average of 5 runs)
random writes: 4.14Mb/sec (average of 5 runs)

After this change:

sequential writes: 69.91Mb/sec (average of 5 runs)
random writes: 5.69Mb/sec (average of 5 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c42ac0bc 26-Nov-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: add missing extent state caching calls

When we didn't find a matching extent state, we inserted a new one
but didn't cache it in the **cached_state parameter, which makes a
subsequent call do a tree lookup to get it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 68ba990f 24-Nov-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: fix extent boundary check in bio_readpage_error

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 50892bac 04-Nov-2013 Valentina Giusti <valentina.giusti@microon.de>

btrfs: remove unused variables from extent_io.c

Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c170bbb4 24-Nov-2013 Kent Overstreet <kmo@daterainc.com>

block: submit_bio_wait() conversions

It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4f024f37 11-Oct-2013 Kent Overstreet <kmo@daterainc.com>

block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6


# 2c30c71b 07-Nov-2013 Kent Overstreet <kmo@daterainc.com>

block: Convert various code to bio_for_each_segment()

With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>


# 33879d45 23-Nov-2013 Kent Overstreet <kmo@daterainc.com>

block: submit_bio_wait() conversions

It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>


# 908960c6 03-Nov-2013 Ilya Dryomov <idryomov@gmail.com>

Btrfs: disable online raid-repair on ro mounts

This disables the "if needed, write the good copy back before the read
is completed" part of the read sequence for read-only mounts.

Cc: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 67871254 30-Oct-2013 Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>

btrfs: Fix checkpatch.pl warning of spacing issues

Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# fae7f21c 30-Oct-2013 Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>

btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)

Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 8b558c5f 16-Oct-2013 Zach Brown <zab@redhat.com>

btrfs: remove fs/btrfs/compat.h

fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink(). This doesn't belong in mainline.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 1877e1a7 16-Oct-2013 Zach Brown <zab@redhat.com>

btrfs: remove move_pages()

move_pages() has an inefficient backwards byte copy of regions of two
different pages. They're different pages so the regions won't overlap
and it could use memcpy().

At that point, though, move_pages() would be a slightly dimmer
re-implementation of copy_pages() that lacked the test for overlapping
page regions.

So remove move_pages() and just call copy_pages().

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 452c75c3 07-Oct-2013 Chandra Seetharaman <sekharan@us.ibm.com>

Btrfs: Simplify the logic in alloc_extent_buffer() for existing extent buffer case

alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
fails with EEXIST. That part of the code is very similar to the code in
find_extent_buffer(). This patch replaces radix_tree_lookup() and
surrounding code in alloc_extent_buffer() with find_extent_buffer().

Note that radix_tree_lookup() does not need to be protected by
tree->buffer_lock. It is protected by eb->refs.

While at it, this patch
- changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
with find_extent_buffer() to reduce redundancy.
- removes the unused argument 'len' to find_extent_buffer().

Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Zach Brown <zab@redhat.com>

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 294e30fe 08-Oct-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: add tests for find_lock_delalloc_range

So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
stuff, me first by fixing extent size, then him by fixing something I broke and
then me again telling him to fix it a different way. So this is obviously a
candidate for some testing. This patch adds a pseudo fs so we can allocate fake
inodes for tests that need an inode or pages. Then it addes a bunch of tests to
make sure find_lock_delalloc_range is acting the way it is supposed to. With
this patch and all of our previous patches to find_lock_delalloc_range I am sure
it is working as expected now. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# fe09e16c 21-Sep-2013 Liu Bo <bo.li.liu@oracle.com>

Btrfs: export btrfs space shared info to userspace

Similar to ocfs2, btrfs also supports that extents can be shared by
different inodes, and there are some userspace tools requesting
for this kind of 'space shared infomation'.[1]

ocfs2 uses flag FIEMAP_EXTENT_SHARED, so does btrfs.

[1]: http://thr3ads.net/ocfs2-devel/2010/09/489052-PATCH-3-3-shared-du-using-fiemap-to-figure-up-the-shared-extents-per-file-and-the-footprint-in

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 7bf811a5 07-Oct-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: limit delalloc pages outside of find_delalloc_range

Liu fixed part of this problem and unfortunately I steered him in slightly the
wrong direction and so didn't completely fix the problem. The problem is we
limit the size of the delalloc range we are looking for to max bytes and then we
try to lock that range. If we fail to lock the pages in that range we will
shrink the max bytes to a single page and re loop. However if our first page is
inside of the delalloc range then we will end up limiting the end of the range
to a period before our first page. This is illustrated below

[0 -------- delalloc range --------- 256mb]
[page]

So find_delalloc_range will return with delalloc_start as 0 and end as 128mb,
and then we will notice that delalloc_start < *start and adjust it up, but not
adjust delalloc_end up, so things go sideways. To fix this we need to not limit
the max bytes in find_delalloc_range, but in find_lock_delalloc_range and that
way we don't end up with this confusion. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# b208c2f7 19-Sep-2013 Darrick J. Wong <darrick.wong@oracle.com>

btrfs: Fix crash due to not allocating integrity data for a bioset

When btrfs creates a bioset, we must also allocate the integrity data pool.
Otherwise btrfs will crash when it tries to submit a bio to a checksumming
disk:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150
PGD 2305e4067 PUD 23063d067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: btrfs scsi_debug xfs ext4 jbd2 ext3 jbd mbcache
sch_fq_codel eeprom lpc_ich mfd_core nfsd exportfs auth_rpcgss af_packet
raid6_pq xor zlib_deflate libcrc32c [last unloaded: scsi_debug]
CPU: 1 PID: 4486 Comm: mount Not tainted 3.12.0-rc1-mcsum #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8802451c9720 ti: ffff880230698000 task.ti: ffff880230698000
RIP: 0010:[<ffffffff8111e28a>] [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150
RSP: 0018:ffff880230699688 EFLAGS: 00010286
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000005f8445
RDX: 0000000000000001 RSI: 0000000000000010 RDI: 0000000000000000
RBP: ffff8802306996f8 R08: 0000000000011200 R09: 0000000000000008
R10: 0000000000000020 R11: ffff88009d6e8000 R12: 0000000000011210
R13: 0000000000000030 R14: ffff8802306996b8 R15: ffff8802451c9720
FS: 00007f25b8a16800(0000) GS:ffff88024fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000018 CR3: 0000000230576000 CR4: 00000000000007e0
Stack:
ffff8802451c9720 0000000000000002 ffffffff81a97100 0000000000281250
ffffffff81a96480 ffff88024fc99150 ffff880228d18200 0000000000000000
0000000000000000 0000000000000040 ffff880230e8c2e8 ffff8802459dc900
Call Trace:
[<ffffffff811b2208>] bio_integrity_alloc+0x48/0x1b0
[<ffffffff811b26fc>] bio_integrity_prep+0xac/0x360
[<ffffffff8111e298>] ? mempool_alloc+0x58/0x150
[<ffffffffa03e8041>] ? alloc_extent_state+0x31/0x110 [btrfs]
[<ffffffff81241579>] blk_queue_bio+0x1c9/0x460
[<ffffffff8123e58a>] generic_make_request+0xca/0x100
[<ffffffff8123e639>] submit_bio+0x79/0x160
[<ffffffffa03f865e>] btrfs_map_bio+0x48e/0x5b0 [btrfs]
[<ffffffffa03c821a>] btree_submit_bio_hook+0xda/0x110 [btrfs]
[<ffffffffa03e7eba>] submit_one_bio+0x6a/0xa0 [btrfs]
[<ffffffffa03ef450>] read_extent_buffer_pages+0x250/0x310 [btrfs]
[<ffffffff8125eef6>] ? __radix_tree_preload+0x66/0xf0
[<ffffffff8125f1c5>] ? radix_tree_insert+0x95/0x260
[<ffffffffa03c66f6>] btree_read_extent_buffer_pages.constprop.128+0xb6/0x120
[btrfs]
[<ffffffffa03c8c1a>] read_tree_block+0x3a/0x60 [btrfs]
[<ffffffffa03caefd>] open_ctree+0x139d/0x2030 [btrfs]
[<ffffffffa03a282a>] btrfs_mount+0x53a/0x7d0 [btrfs]
[<ffffffff8113ab0b>] ? pcpu_alloc+0x8eb/0x9f0
[<ffffffff81167305>] ? __kmalloc_track_caller+0x35/0x1e0
[<ffffffff81176ba0>] mount_fs+0x20/0xd0
[<ffffffff81191096>] vfs_kern_mount+0x76/0x120
[<ffffffff81193320>] do_mount+0x200/0xa40
[<ffffffff81135cdb>] ? strndup_user+0x5b/0x80
[<ffffffff81193bf0>] SyS_mount+0x90/0xe0
[<ffffffff8156d31d>] system_call_fastpath+0x1a/0x1f
Code: 4c 8d 75 a8 4c 89 6d e8 45 89 e0 4c 8d 6f 30 48 89 5d d8 41 83 e0 af 48
89 fb 49 83 c6 18 4c 89 7d f8 65 4c 8b 3c 25 c0 b8 00 00 <48> 8b 73 18 44 89 c7
44 89 45 98 ff 53 20 48 85 c0 48 89 c2 74
RIP [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150
RSP <ffff880230699688>
CR2: 0000000000000018
---[ end trace 7a96042017ed21e2 ]---

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 385fe0be 01-Oct-2013 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix crash of compressed writes

The crash[1] is found by xfstests/generic/208 with "-o compress",
it's not reproduced everytime, but it does panic.

The bug is quite interesting, it's actually introduced by a recent commit
(573aecafca1cf7a974231b759197a1aebcf39c2a,
Btrfs: actually limit the size of delalloc range).

Btrfs implements delay allocation, so during writeback, we
(1) get a page A and lock it
(2) search the state tree for delalloc bytes and lock all pages within the range
(3) process the delalloc range, including find disk space and create
ordered extent and so on.
(4) submit the page A.

It runs well in normal cases, but if we're in a racy case, eg.
buffered compressed writes and aio-dio writes,
sometimes we may fail to lock all pages in the 'delalloc' range,
in which case, we need to fall back to search the state tree again with
a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).

The mentioned commit has a side effect, that is, in the fallback case,
we can find delalloc bytes before the index of the page we already have locked,
so we're in the case of (delalloc_end <= *start) and return with (found > 0).

This ends with not locking delalloc pages but making ->writepage still
process them, and the crash happens.

This fixes it by just thinking that we find nothing and returning to caller
as the caller knows how to deal with it properly.

[1]:
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2170!
[...]
CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8
[...]
RIP: 0010:[<ffffffff810f5093>] [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[...]
[ 4934.248731] Stack:
[ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
[ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620
[ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
[ 4934.248731] Call Trace:
[ 4934.248731] [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
[ 4934.248731] [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs]
[ 4934.248731] [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b
[ 4934.248731] [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs]
[ 4934.248731] [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs]
[ 4934.248731] [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
[ 4934.248731] [<ffffffff810608f5>] kthread+0x8d/0x95
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
[ 4934.248731] RIP [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[ 4934.248731] RSP <ffff8801869f9c48>
[ 4934.280307] ---[ end trace 36f06d3f8750236a ]---

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 573aecaf 30-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: actually limit the size of delalloc range

So forever we have had this thing to limit the amount of delalloc pages we'll
setup to be written out to 128mb. This is because we have to lock all the pages
in this range, so anything above this gets a bit unweildly, and also without a
limit we'll happily allocate gigantic chunks of disk space. Turns out our check
for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
write out, we'd just stop looking for more space after we went over the limit.
So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
extents. This is fine normally, except when you go to relocate these extents
and we can't find enough space to relocate these moster extents, since we have
to be able to allocate exactly the same sized extent to move it around. So fix
this by actually enforcing the limit. With this patch I'm no longer seeing
giant 1.5gb extents. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 778746b5 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: PAGE_CACHE_SIZE is already unsigned long

PAGE_CACHE_SIZE == PAGE_SIZE is "unsigned long" everywhere, so there's no
need to cast it to "unsigned long".

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# c1c9ff7c 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: Remove superfluous casts from u64 to unsigned long long

u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 171170c1 14-Aug-2013 Sergei Trofimovich <slyfox@gentoo.org>

btrfs: mark some local function as 'static'

Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 35a3621b 14-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: get rid of sparse warnings

make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__

I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.

All these changes should be obviously safe.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 4b384318 06-Aug-2013 Mark Fasheh <mfasheh@suse.de>

btrfs: Introduce extent_read_full_page_nolock()

We want this for btrfs_extent_same. Basically readpage and friends do their
own extent locking but for the purposes of dedupe, we want to have both
files locked down across a set of readpage operations (so that we can
compare data). Introduce this variant and a flag which can be set for
extent_read_full_page() to indicate that we are already locked.

Partial credit for this patch goes to Gabriel de Perthuis <g2p.code@gmail.com>
as I have included a fix from him to the original patch which avoids a
deadlock on compressed extents.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 9ec72677 07-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: stop using GFP_ATOMIC when allocating rewind ebs

There is no reason we can't just set the path to blocking and then do normal
GFP_NOFS allocations for these extent buffers. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# db7f3436 07-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: deal with enomem in the rewind path

We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked
back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
in this path. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# c2790a2e 29-Jul-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: cleanup arguments to extent_clear_unlock_delalloc

This patch removes the io_tree argument for extent_clear_unlock_delalloc since
we always use &BTRFS_I(inode)->io_tree, and it separates out the extent tree
operations from the page operations. This way we just pass in the extent bits
we want to clear and then pass in the operations we want done to the pages.
This is because I'm going to fix what extent bits we clear in some cases and
rather than add a bunch of new flags we'll just use the actual extent bits we
want to clear. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 125bac01 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: cache the extent map struct when reading several pages

When we read several pages at once, we needn't get the extent map object
every time we deal with a page, and we can cache the extent map object.
So, we can reduce the search time of the extent map, and besides that, we
also can reduce the lock contention of the extent map tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 9974090b 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: batch the extent state operation when reading pages

In the past, we cached the checksum value in the extent state object, so we
had to split the extent state object by the block size, or we had no space
to keep this checksum value. But it increased the lock contention of the
extent state tree.

Now we removed this limit by caching the checksum into the bio object, so
it is unnecessary to do the extent state operations by the block size, we
can do it in batches, in this way, we can reduce the lock contention of
the extent state tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 883d0de4 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: batch the extent state operation in the end io handle of the read page

Before applying this patch, we set the uptodate flag and unlock the extent
by the page size, it is unnecessary, we can do it in batches, it can reduce
the lock contention of the extent state tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# facc8a22 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: don't cache the csum value into the extent state tree

Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.

Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# f2a09da9 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: add branch prediction hints in the read page end IO function

This patch add some branch prediction hints into the end IO function
of the read page, it reduced the percentage of the branch misses from
5.5% to 4.9%.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 09a7f7a2 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: remove unnecessary argument of bio_readpage_error()

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# b76bb701 05-Jul-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: do not offset physical if we're compressed

xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset. This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed. So we can return a physical value that isn't at all within
the area we have allocated on disk. Fix this by returning the start of the
extent if it is compressed no matter what the offset. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 7ee9e440 21-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: check if we can nocow if we don't have data space

We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write. This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue. With this patch we now pass xfstests
generic/274. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# a71754fc 17-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate

This has plagued us forever and I'm so over working around it. When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point. The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things. This also tends to bite the orphan
cleanup stuff too which keeps people from mounting. To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data. This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 8d599ae1 30-Apr-2013 David Sterba <dsterba@suse.cz>

btrfs: add debug check for extent_io range alignment

The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# d47992f8 21-May-2013 Lukas Czerner <lczerner@redhat.com>

mm: change invalidatepage prototype to accept length

Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.

Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).

This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.

We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.

Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>


# 9be3395b 17-May-2013 Chris Mason <chris.mason@fusionio.com>

Btrfs: use a btrfs bioset instead of abusing bio internals

Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>


# 17a5adcc 15-May-2013 Alexandre Oliva <oliva@gnu.org>

btrfs: do away with non-whole_page extent I/O

end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error. This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.

It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case. Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!

I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes. The
warnings should probably just go away some time.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# a52f4cd2 01-May-2013 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix off-by-one in fiemap

lock_extent/unlock_extent expect an exclusive end.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 41074888 29-Apr-2013 David Sterba <dsterba@suse.cz>

btrfs: use unsigned long type for extent state bits

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# f7a52a40 26-Apr-2013 David Sterba <dsterba@suse.cz>

btrfs: remove unused gfp mask parameter from release_extent_buffer callchain

It's unused since 0b32f4bbb423f02ac.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 48a3b636 25-Apr-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: make static code static & remove dead code

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 6d49ba1b 22-Apr-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: move leak debug code to functions

Clean up the leak debugging in extent_io.c by moving
the debug code into functions. This also removes the
list_heads used for debugging from the extent_buffer
and extent_state structures when debug is not enabled.

Since we need a global debug config to do that last
part, implement CONFIG_BTRFS_DEBUG to accommodate.

Thanks to Dave Sterba for the Kconfig bit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# fd8b2b61 24-Apr-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: cleanup destroy_marked_extents

We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# d4c7ca86 19-Apr-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: use REQ_META for all metadata IO

We need to tag metadata io with REQ_META to avoid priority inversion when using
io throttling cqroups. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# e4100d98 05-Apr-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: improve the performance of the csums lookup

It is very likely that there are several blocks in bio, it is very
inefficient if we get their csums one by one. This patch improves
this problem by getting the csums in batch.

According to the result of the following test, the execute time of
__btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).

# dd if=<mnt>/file of=/dev/null bs=1M count=1024

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 6b67a320 28-Mar-2013 Liu Bo <bo.li.liu@oracle.com>

Btrfs: pass NULL instead of 0

set_extent_bit()'s (u64 *failed_start) expects NULL not 0.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 4adaa611 26-Mar-2013 Chris Mason <chris.mason@fusionio.com>

Btrfs: fix race between mmap writes and compression

Btrfs uses page_mkwrite to ensure stable pages during
crc calculations and mmap workloads. We call clear_page_dirty_for_io
before we do any crcs, and this forces any application with the file
mapped to wait for the crc to finish before it is allowed to change
the file.

With compression on, the clear_page_dirty_for_io step is happening after
we've compressed the pages. This means the applications might be
changing the pages while we are compressing them, and some of those
modifications might not hit the disk.

This commit adds the clear_page_dirty_for_io before compression starts
and makes sure to redirty the page if we have to fallback to
uncompressed IO as well.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Alexandre Oliva <oliva@gnu.org>
cc: stable@vger.kernel.org


# f73a1c7d 25-Sep-2012 Kent Overstreet <koverstreet@google.com>

block: Add bio_end_sector()

Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>


# 180e001c 14-Feb-2013 Paul Gortmaker <paul.gortmaker@windriver.com>

btrfs: fixup/remove module.h usage as required

We want to avoid module.h where posible, since it in turn includes
nearly all of header space. This means removing it where it is not
required, and using export.h where we are only exporting symbols via
EXPORT_SYMBOL and friends.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# b8dae313 28-Feb-2013 David Sterba <dsterba@suse.cz>

btrfs: use only inline_pages from extent buffer

The nodesize is capped at 64k and there are enough pages preallocated in
extent_buffer::inline_pages. The fallback to kmalloc never happened
because even on the smallest page size considered (4k) inline_pages
covered the needs.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# fda2832f 26-Feb-2013 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: cleanup for open-coded alignment

Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
u64 mask = ((u64)root->stripesize - 1);
u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------

Sometimes these open-coded alignment is not so easy to understand for
newbie like me.

This commit changes the open-coded alignment to the ALIGN macro for a
better readability.

Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html

Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# c8f2f24b 11-Feb-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: remove unused extent io tree ops V2

Nobody uses these io tree ops anymore so just remove them and clean up the code
a bit. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# e2d84521 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use percpu counter for dirty metadata count

->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.

This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 4eee4fa4 21-Dec-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use wrapper page_offset

Use wrapper page_offset to get byte-offset into filesystem object for page.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 242e18c7 29-Jan-2013 Chris Mason <chris.mason@fusionio.com>

Btrfs: reduce lock contention on extent buffer locks

The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.

These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 53b381b3 29-Jan-2013 David Woodhouse <David.Woodhouse@intel.com>

Btrfs: RAID5 and RAID6

This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 64a16701 15-Jul-2009 David Woodhouse <David.Woodhouse@intel.com>

Btrfs: add rw argument to merge_bio_hook()

We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 61891923 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: handle errors from btrfs_map_bio() everywhere

With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 3ec706c8 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree

This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 5d964051 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree

This is required for the device replace procedure in a later step.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 31b1a2bd 03-Nov-2012 Julia Lawall <Julia.Lawall@lip6.fr>

fs/btrfs: use WARN

Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 84167d19 11-Oct-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: Fix wrong error handling code

gcc says "warning: comparison of unsigned expression >= 0 is always
true" because i is an unsigned long. And gcc is right this time.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# f60b1b49 05-Oct-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix page leakage

Alloc_dummy_extent_buffer will not free the first page in the eb array if we
fail to allocate a page, fix this. Thanks,

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 4804b382 05-Oct-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: do not warn_on when we cannot alloc a page for an extent buffer

It's just annoying and the user will have gotten a nice OOM killer message
so they are already fully aware they are screwed :). Thanks,

Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# edd33c99 05-Oct-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: don't bug on enomem in readpage

Get rid of the BUG_ON(ret == -ENOMEM) in __extent_read_full_page. Thanks,

Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 479ed9ab 29-Sep-2012 Robin Dong <sanbai@taobao.com>

btrfs: move inline function code to header file

When building btrfs from kernel code, it will report:

fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called
fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here
fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called
fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here

because of the wrong declaration of inline functions.

Signed-off-by: Robin Dong <sanbai@taobao.com>


# 7a2d6a64 01-Oct-2012 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: remove unnecessary IS_ERR in bio_readpage_error()

Because the value of extent_map is only a correct value or NULL,
so IS_ERR is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>


# e6138876 27-Sep-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: cache extent state when writing out dirty metadata pages

Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits. So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal. With this patch we are only
doing 2 searches for every cycle (modulo weird things happening). Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# de0022b9 25-Sep-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: do not async metadata csumming in certain situations

There are a coule scenarios where farming metadata csumming off to an async
thread doesn't help. The first is if our processor supports crc32c, in
which case the csumming will be fast and so the overhead of the async model
is not worth the cost. The other case is for our tree log. We will be
making that stuff dirty and writing it out and waiting for it immediately.
Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
workloads. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# b5bae261 14-Sep-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix race when getting the eb out of page->private

We can race when checking wether PagePrivate is set on a page and we
actually have an eb saved in the pages private pointer. We could have
easily written out this page and released it in the time that we did the
pagevec lookup and actually got around to looking at this page. So use
mapping->private_lock to ensure we get a consistent view of the
page->private pointer. This is inline with the alloc and releasepage paths
which use private_lock when manipulating page->private. Thanks,

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 8c0a8537 25-Sep-2012 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

fs: push rcu_barrier() from deactivate_locked_super() to filesystems

There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be3940c0 11-Sep-2012 Kent Overstreet <koverstreet@google.com>

btrfs: Kill some bi_idx references

For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.

scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.

The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>


# 837e1972 07-Sep-2012 David Sterba <dsterba@suse.cz>

btrfs: polish names of kmem caches

Usecase:

watch 'grep btrfs < /proc/slabinfo'

easy to watch all caches in one go.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 9e8a4a8b 05-Sep-2012 Liu Bo <bo.li.liu@oracle.com>

Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag

We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:

We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.

Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>


# 74dd17fb 07-Aug-2012 Chris Mason <chris.mason@fusionio.com>

Btrfs: fix btrfs send for inline items and compression

The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk. If we're compressed, this isn't
true, and so it was off into extents owned by other files.

It was also improperly handling inline extents. This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent. It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 5ee0844d 27-Aug-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: revert checksum error statistic which can cause a BUG()

Commit 442a4f6308e694e0fa6025708bd5e4e424bbf51c added btrfs device
statistic counters for detected IO and checksum errors to Linux 3.5.
The statistic part that counts checksum errors in
end_bio_extent_readpage() can cause a BUG() in a subfunction:
"kernel BUG at fs/btrfs/volumes.c:3762!"
That part is reverted with the current patch.
However, the counting of checksum errors in the scrub context remains
active, and the counting of detected IO errors (read, write or flush
errors) in all contexts remains active.

Cc: stable <stable@vger.kernel.org> # 3.5
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 67c9684f 20-Jul-2012 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: improve multi-thread buffer read

While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.

Here is a scenario in fio jobs:

We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.

And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:

t1 t2 t3 t4
add Page1
read Page1 add Page2
| read Page2 add Page3
| | read Page3 add Page4
| | | read Page4
-----|------------|-----------|-----------|--------
v v v v
bio bio bio bio

Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio. Thus, we can end up with far more bios
than we need.

Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.

With that said, we can make each bio hold more pages and reduce the number
of bios we need.

Here is some numbers taken from fio results:
w/o patch w patch
------------- -------- ---------------
READ: 745MB/s +25% 934MB/s

[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/

[READ]
filename=foobar
size=2000M
invalidate=1

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 51561ffe 20-Jul-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: lock the transition from dirty to writeback for an eb

There is a small window where an eb can have no IO bits set on it, which
could potentially result in extent_buffer_under_io() returning false when we
want it to return true, which could result in not fun things happening. So
in order to protect this case we need to hold the refs_lock when we make
this transition to make sure we get reliable results out of
extent_buffer_udner_io(). Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 594831c4 20-Jul-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix potential race in extent buffer freeing

This sounds sort of impossible but it is the only thing I can think of and
at the very least it is theoretically possible so here it goes.

If we are in try_release_extent_buffer we will check that the ref count on
the extent buffer is 1 and not under IO, and then go down and clear the tree
ref. If between this check and clearing the tree ref somebody else comes in
and grabs a ref on the eb and the marks it dirty before
try_release_extent_buffer() does it's tree ref clear we can end up with a
dirty eb that will be freed while it is still dirty which will result in a
panic. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# e64860aa 20-Jul-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: don't return true in releasepage unless we actually freed the eb

I noticed while looking at an extent_buffer race that we will
unconditionally return 1 if we get down to release_extent_buffer after
clearing the tree ref. However we can easily race in here and get a ref on
the eb and not actually free the eb. So make release_extent_buffer return 1
if it free'd the eb and 0 if not so we can be a little kinder to the vm.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# d5b025d5 02-Jul-2012 Anand Jain <anand.jain@oracle.com>

btrfs read error corrected message floods the console during recovery

Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 10983f2e 11-Jul-2012 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: fix typo in convert_extent_bit

It should be convert_extent_bit.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 7fd1a3f7 27-Jun-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: hold a ref on the inode during writepages

We can race with unlink and not actually be able to do our igrab in
btrfs_add_ordered_extent. This will result in all sorts of problems.
Instead of doing the complicated work to try and handle returning an error
properly from btrfs_add_ordered_extent, just hold a ref to the inode during
writepages. If we cannot grab a ref we know we're freeing this inode anyway
and can just drop the dirty pages on the floor, because screw them we're
going to invalidate them anyway. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 606686ee 04-Jun-2012 Josef Bacik <josef@redhat.com>

Btrfs: use rcu to protect device->name

Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory. Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock(). This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch. Thanks,

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>


# 442a4f63 25-May-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add device counters for detected IO and checksum errors

The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# d1ac6e41 10-May-2012 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: use fastpath in extent state ops as much as possible

Fully utilize our extent state's new helper functions to use
fastpath as much as possible.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>


# 5fd02043 02-May-2012 Josef Bacik <josef@redhat.com>

Btrfs: finish ordered extents in their own thread

We noticed that the ordered extent completion doesn't really rely on having
a page and that it could be done independantly of ending the writeback on a
page. This patch makes us not do the threaded endio stuff for normal
buffered writes and direct writes so we can end page writeback as soon as
possible (in irq context) and only start threads to do the ordered work when
it is actually done. Compression needs to be reworked some to take
advantage of this as well, but atm it has to do a find_get_page in its endio
handler so it must be done in its own thread. This makes direct writes
quite a bit faster. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# d7dbe9e7 23-Apr-2012 Josef Bacik <josef@redhat.com>

Btrfs: fix compile warnings in extent_io.c

These warnings are bogus since we will always have at least one page in an
eb, but to make the compiler happy just set ret = 0 in these two cases.
Thanks,
Btrfs: fix compile warnings in extent_io.c

These warnings are bogus since we will always have at least one page in an
eb, but to make the compiler happy just set ret = 0 in these two cases.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 815a51c7 16-May-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: dummy extent buffers for tree mod log

The tree modification log needs two ways to create dummy extent buffers,
once by allocating a fresh one (to rebuild an old root) and once by
cloning an existing one (to make private rewind modifications) to it.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# fd5e62a3 06-Apr-2012 Wang Sheng-Hui <shhuiw@gmail.com>

Btrfs: remove the useless assignment to *entry in function tree_insert of file extent_io.c

In tree_insert, var *entry is used in the loop only, and is useless
out of the loop. Remove the useless assignment after the loop.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>


# 477d7eaf 06-Apr-2012 Wang Sheng-Hui <shhuiw@gmail.com>

Btrfs: fix the comment for find_first_extent_bit

The return value of find_first_extent_bit is 1 or 0, no < 0.
And if found something, return 0; if nothing was found, return 1.
Fix the comment.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>


# 39bab87b 06-Apr-2012 Wang Sheng-Hui <shhuiw@gmail.com>

Btrfs: fix btrfs_release_extent_buffer_page with the right usage of num_extent_pages

num_extent_pages returns the number of pages in the specific range, not
the index of the last page in the eb range.

btrfs_release_extent_buffer_page is called with start_idx set 0 in current
codes, so it's not a problem yet. But the logic is indeed wrong.

Fix it here.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>


# 1b303fc0 06-Apr-2012 Wang Sheng-Hui <shhuiw@gmail.com>

Btrfs: cleanup the comment for clear_state_bit in extent_io.c

No 'delete' arg is used for clear_state_bit.
Cleanup the comment.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>


# 17de39ac 04-May-2012 Josef Bacik <josef@redhat.com>

Btrfs: fix page leak when allocing extent buffers

If we happen to alloc a extent buffer and then alloc a page and notice that
page is already attached to an extent buffer, we will only unlock it and
free our existing eb. Any pages currently attached to that eb will be
properly freed, but we don't do the page_cache_release() on the page where
we noticed the other extent buffer which can cause us to leak pages and I
hope cause the weird issues we've been seeing in this area. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5cf1ab56 16-Apr-2012 Josef Bacik <josef@redhat.com>

Btrfs: always store the mirror we read the eb from

A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# cdc6a395 12-Mar-2012 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: avoid possible use-after-free in clear_extent_bit()

clear_extent_bit()
{
next_node = rb_next(&state->rb_node);
...
clear_state_bit(state); <-- this may free next_node
if (next_node) {
state = rb_entry(next_node);
...
}
}

clear_state_bit() calls merge_state() which may free the next node
of the passing extent_state, so clear_extent_bit() may end up
referencing freed memory.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 8e52acf7 12-Mar-2012 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: retrurn void from clear_state_bit

Currently it returns a set of bits that were cleared, but this return
value is not used at all.

Moreover it doesn't seem to be useful, because we may clear the bits
of a few extent_states, but only the cleared bits of last one is
returned.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# e627ee7b 12-Apr-2012 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: check return value of bio_alloc() properly

bio_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d95603b2 12-Apr-2012 Chris Mason <chris.mason@oracle.com>

Btrfs: fix uninit variable in repair_eb_io_failure

We'd have to be passing bogus extent buffers for this uninit variable to
actually be used, but set it to zero just in case.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ea466794 26-Mar-2012 Josef Bacik <josef@redhat.com>

Btrfs: deal with read errors on extent buffers differently

Since we need to read and write extent buffers in their entirety we can't use
the normal bio_readpage_error stuff since it only works on a per page basis. So
instead make it so that if we see an io error in endio we just mark the eb as
having an IO error and then in btree_read_extent_buffer_pages we will manually
try other mirrors and then overwrite the bad mirror if we find a good copy.
This works with larger than page size blocks. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a098d8e8 20-Mar-2012 Chris Mason <chris.mason@oracle.com>

Btrfs: loop waiting on writeback

lock_extent_buffer_for_io needs to loop around and make sure the
writeback bits are not set.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0b32f4bb 13-Mar-2012 Josef Bacik <josef@redhat.com>

Btrfs: ensure an entire eb is written at once

This patch simplifies how we track our extent buffers. Previously we could exit
writepages with only having written half of an extent buffer, which meant we had
to track the state of the pages and the state of the extent buffers differently.
Now we only read in entire extent buffers and write out entire extent buffers,
this allows us to simply set bits in our bflags to indicate the state of the eb
and we no longer have to do things like track uptodate with our iotree. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5df4235e 15-Mar-2012 Josef Bacik <josef@redhat.com>

Btrfs: introduce mark_extent_buffer_accessed

Because an eb can have multiple pages we need to make sure that all pages within
the eb are markes as accessed, since releasepage can be called against any page
in the eb. This will keep us from possibly evicting hot eb's when we're doing
larger than pagesize eb's. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 3083ee2e 09-Mar-2012 Josef Bacik <josef@redhat.com>

Btrfs: introduce free_extent_buffer_stale

Because btrfs cow's we can end up with extent buffers that are no longer
necessary just sitting around in memory. So instead of evicting these pages, we
could end up evicting things we actually care about. Thus we have
free_extent_buffer_stale for use when we are freeing tree blocks. This will
make it so that the ref for the eb being in the radix tree is dropped as soon as
possible and then is freed when the refcount hits 0 instead of waiting to be
released by releasepage. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 115391d2 09-Mar-2012 Josef Bacik <josef@redhat.com>

Btrfs: only use the existing eb if it's count isn't 0

We can run into a problem where we find an eb for our existing page already on
the radix tree but it has a ref count of 0. It hasn't yet been removed by RCU
yet so this can cause issues where we will use the EB after free. So do
atomic_inc_not_zero on the exists->refs and if it is zero just do
synchronize_rcu() and try again. We won't have to worry about new allocators
coming in since they will block on the page lock at this point. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 4f2de97a 07-Mar-2012 Josef Bacik <josef@redhat.com>

Btrfs: set page->private to the eb

We spend a lot of time looking up extent buffers from pages when we could just
store the pointer to the eb the page is associated with in page->private. This
patch does just that, and it makes things a little simpler and reduces a bit of
CPU overhead involved with doing metadata IO. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 727011e0 06-Aug-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: allow metadata blocks larger than the page size

A few years ago the btrfs code to support blocks lager than
the page size was disabled to fix a few corner cases in the
page cache handling. This fixes the code to properly support
large metadata blocks again.

Since current kernels will crash early and often with larger
metadata blocks, this adds an incompat bit so that older kernels
can't mount it.

This also does away with different blocksizes for nodes and leaves.
You get a single block size for all tree blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 79787eaa 12-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: replace many BUG_ONs with proper error handling

btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.

This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 3fbe5c02 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: split extent_state ops

set_extent_bit can do exclusive locking but only when called by lock_extent*,

Drop the exclusive bits argument except when called by lock_extent.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# d0082371 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: drop gfp_t from lock_extent

lock_extent and unlock_extent are always called with GFP_NOFS, drop the
argument and use GFP_NOFS consistently.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 143bede5 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: return void in functions without error conditions

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 355808c2 03-Oct-2011 Jeff Mahoney <jeffm@suse.com>

btrfs: ->submit_bio_hook error push-up

This pushes failures from the submit_bio_hook callbacks,
btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including
callers of submit_one_bio where it catches the failures with BUG_ON.

It also pushes up through the ->readpage_io_failed_hook to
end_bio_extent_writepage where the error is already caught with BUG_ON.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 3444a972 03-Oct-2011 Jeff Mahoney <jeffm@suse.com>

btrfs: Factor out tree->ops->merge_bio_hook call

In submit_extent_page, there's a visually noisy if statement that, in
the midst of other conditions, does the tree dependency for tree->ops
and tree->ops->merge_bio_hook before calling it, and then another
condition afterwards. If an error is returned from merge_bio_hook,
there's no way to catch it. It's considered a routine "1" return
value instead of a failure.

This patch factors out the dependency check into a new local merge_bio
routine and BUG's on an error. The if statement is less noisy as a side-
effect.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 6763af84 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: Remove set bits return from clear_extent_bit

There is only one caller of clear_extent_bit that checks the return value
and it only checks if it's negative. Since there are no users of the
returned bits functionality of clear_extent_bit, stop returning it
and avoid complicating error handling.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# c2d904e0 03-Oct-2011 Jeff Mahoney <jeffm@suse.com>

btrfs: Catch locking failures in {set,clear,convert}_extent_bit

The *_state functions can only return 0 or -EEXIST. This patch addresses
the cases where those functions returning -EEXIST represent a locking
failure. It handles them by panicking with an appropriate error message.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 7ac687d9 25-Nov-2011 Cong Wang <amwang@redhat.com>

btrfs: remove the second argument of k[un]map_atomic()

Signed-off-by: Cong Wang <amwang@redhat.com>


# 50653190 21-Feb-2012 Chris Mason <chris.mason@oracle.com>

Btrfs: clear the extent uptodate bits during parent transid failures

If btrfs reads a block and finds a parent transid mismatch, it clears
the uptodate flags on the extent buffer, and the pages inside it. But
we only clear the uptodate bits in the state tree if the block straddles
more than one page.

This is from an old optimization from to reduce contention on the extent
state tree. But it is buggy because the code that retries a read from
a different copy of the block is going to find the uptodate state bits
set and skip the IO.

The end result of the bug is that we'll never actually read the good
copy (if there is one).

The fix here is to always clear the uptodate state bits, which is safe
because this code is only called when the parent transid fails.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 692e5759 16-Feb-2012 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: be less strict on finding next node in clear_extent_bit

In clear_extent_bit, it is enough that next node is adjacent in tree level.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>


# 9d47c767 16-Feb-2012 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: kick out redundant stuff in convert_extent_bit

clear_state_bit will do merge_state for us, so kick out the redundant one.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>


# 0449314a 16-Feb-2012 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: skip states when they does not contain bits to clear

Clearing a range's bits is different with setting them, since we don't
need to touch them when states do not contain bits we want.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>


# 285190d9 16-Feb-2012 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: check return value of lookup_extent_mapping() correctly

This patch corrects error checking of lookup_extent_mapping().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>


# 013bd4c3 15-Feb-2012 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: fix return value check of extent_io_ops

This patch adds the check on the return value of extent_io_ops.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>


# 87826df0 15-Feb-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: delalloc for page dirtied out-of-band in fixup worker

We encountered an issue that was easily observable on s/390 systems but
could really happen anywhere. The timing just seemed to hit reliably
on s/390 with limited memory.

The gist is that when an unexpected set_page_dirty() happened, we'd
run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
properly set up for delalloc.

This patch does the following:
- Performs the missing delalloc in the fixup worker
- Allow the start hook to return -EBUSY which informs __extent_writepage
that it should mark the page skipped and not to redirty it. This is
required since the fixup worker can fail with -ENOSPC and the page
will have already been redirtied. That causes an Oops in
drop_outstanding_extents later. Retrying the fixup worker could
lead to an infinite loop. Deferring the page redirty also saves us
some cycles since the page would be stuck in a resubmit-redirty loop
until the fixup worker completes. It's not harmful, just wasteful.
- If the fixup worker fails, we mark the page and mapping as errored,
and end the writeback, similar to what we would do had the page
actually been submitted to writeback.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 8bedd51b 26-Jan-2012 Mitch Harder <mitch.harder@sabayonlinux.org>

Btrfs: Check for NULL page in extent_range_uptodate

A user has encountered a NULL pointer kernel oops in btrfs when
encountering media errors. The problem has been identified
as an unhandled NULL pointer returned from find_get_page().
This modification simply checks for a NULL page, and returns
with an error if found (the extent_range_uptodate() function
returns 1 on errors).

After testing this patch, the user reported that the error with
the NULL pointer oops was solved. However, there is still a
remaining problem with a thread becoming stuck in
wait_on_page_locked(page) in the read_extent_buffer_pages(...)
function in extent_io.c

for (i = start_i; i < num_pages; i++) {
page = extent_buffer_page(eb, i);
wait_on_page_locked(page);
if (!PageUptodate(page))
ret = -EIO;
}

This patch leaves the issue with the locked page yet to be resolved.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5b25f70f 13-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: add nested locking mode for paths

This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 21adbd5c 09-Nov-2011 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: integrate integrity check module into btrfs

This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.

Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 1cf4ffdb 07-Dec-2011 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: drop spin lock when memory alloc fails

Drop spin lock in convert_extent_bit() when memory alloc fails,
otherwise, it will be a deadlock.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f4a8e656 01-Dec-2011 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: fix meta data raid-repair merge problem

Commit 4a54c8c16 introduced raid-repair, killing the individual
readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
disk-io.c.

The raid-repair commit had logic to disable raid-repair, if
readpage_io_failed_hook is set. Thus, the readahead commit effectively
disabled raid-repair for meta data.

This commit changes the logic to always attempt raid-repair when needed and
call the readpage_io_failed_hook in case raid-repair fails. This is much
more straight forward and should have been like that from the beginning.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4d479cf0 17-Nov-2011 Josef Bacik <josef@redhat.com>

Btrfs: sectorsize align offsets in fiemap

We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop. This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 32240a91 20-Nov-2011 Jan Schmidt <list.btrfs@jan-o-sch.net>

btrfs: mirror_num should be int, not u64

My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 745c4d8e 20-Nov-2011 Jeff Mahoney <jeffm@suse.com>

btrfs: Fix up 32/64-bit compatibility for new ioctls

This patch casts to unsigned long before casting to a pointer and fixes
the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bf0da8c1 03-Nov-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: ClearPageError during writepage and clean_tree_block

Failure testing was tripping up over stale PageError bits in
metadata pages. If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.

During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.

This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit. In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 01d658f2 01-Nov-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: make sure to flush queued bios if write_cache_pages waits

write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.

Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# fee187d9 29-Sep-2011 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>


# 462d6fac 26-Sep-2011 Josef Bacik <josef@redhat.com>

Btrfs: introduce convert_extent_bit

If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 9e487107 31-Jul-2011 Josef Bacik <josef@redhat.com>

Btrfs: skip looking for delalloc if we don't have ->fill_delalloc

We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check. So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 4bb31e92 10-Jun-2011 Arne Jansen <sensille@gmx.net>

btrfs: hooks for readahead

This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.

Changes for v2:
- eliminate race condition

Signed-off-by: Arne Jansen <sensille@gmx.net>


# bb82ab88 10-Jun-2011 Arne Jansen <sensille@gmx.net>

btrfs: add an extra wait mode to read_extent_buffer_pages

read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.

Changes v5:
- merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
WAIT_PAGE_LOCK

Change v6:
- fix bug introduced in v5

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 4a54c8c1 22-Jul-2011 Jan Schmidt <list.btrfs@jan-o-sch.net>

btrfs: Moved repair code from inode.c to extent_io.c

The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.

Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 8ddc7d9c 13-Jun-2011 Jan Schmidt <list.btrfs@jan-o-sch.net>

btrfs: add mirror_num to extent_read_full_page

Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 0d10ee2e 01-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: don't call writepages from within write_full_page

When doing a writepage we call writepages to try and write out any other dirty
pages in the area. This could cause problems where we commit a transaction and
then have somebody else dirtying metadata in the area as we could end up writing
out a lot more than we care about, which could cause latency on anybody who is
waiting for the transaction to completely finish committing. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 69261c4b 13-Jul-2011 Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

Btrfs: clean up for find_first_extent_bit()

find_first_extent_bit() and find_first_extent_bit_state() share
most of the code, and we can just make the former call the latter.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ded91f08 13-Jul-2011 Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

Btrfs: clean up for wait_extent_bit()

We can just use cond_resched_lock().

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3150b699 13-Jul-2011 Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

Btrfs: clean up for insert_state()

Don't duplicate set_state_bits().

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1bf85046 21-Jul-2011 Jeff Mahoney <jeffm@suse.de>

btrfs: Make extent-io callbacks that never fail return void

The set/clear bit and the extent split/merge hooks only ever return 0.

Changing them to return void simplifies the error handling cases later.

This patch changes the hook prototypes, the single implementation of each,
and the functions that call them to return void instead.

Since all four of these hooks execute under a spinlock, they're necessarily
simple.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 19b6caf4 25-Jul-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: reduce extent_state lock contention for metadata

For metadata buffers that don't straddle pages (all of them), btrfs
can safely use the page uptodate bits and extent_buffer uptodate bit
instead of needing to use the extent_state tree.

This greatly reduces contention on the state tree lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bd681513 16-Jul-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: switch the btrfs tree locks to reader/writer

The btrfs metadata btree is the source of significant
lock contention, especially in the root node. This
commit changes our locking to use a reader/writer
lock.

The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking. Atomics
count the number of blocking readers or writers at any
given time.

It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.

In read heavy workloads this is dramatically faster. In write
heavy workloads we're still faster because of less contention
on the root node lock.

We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a6591715 18-Jul-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: stop using highmem for extent_buffers

The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.

The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.

This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f7aaa06b 15-Jul-2011 Josef Bacik <josef@redhat.com>

Btrfs: tag pages for writeback in sync

Everybody else does this, we need to do it too. If we're syncing, we need to
tag the pages we're going to write for writeback so we don't end up writing the
same stuff over and over again if somebody is constantly redirtying our file.
This will keep us from having latencies with heavy sync workloads. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# df98b6e2 20-Jun-2011 Josef Bacik <josef@redhat.com>

Btrfs: fix how we merge extent states and deal with cached states

First, we can sometimes free the state we're merging, which means anybody who
calls merge_state() may have the state it passed in free'ed. This is
problematic because we could end up caching the state, which makes caching
useless as the state will no longer be part of the tree. So instead of free'ing
the state we passed into merge_state(), set it's end to the other->end and free
the other state. This way we are sure to cache the correct state. Also because
we can merge states together, instead of only using the cache'd state if it's
start == the start we are looking for, go ahead and use it if the start we are
looking for is within the range of the cached state. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# d46db3d5 04-May-2011 Wu Fengguang <fengguang.wu@intel.com>

writeback: make writeback_control.nr_to_write straight

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>


# c309df07 26-May-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: return -ENOMEM in clear_extent_bit

The btrfs releasepage function depends on ENOMEM coming
back when it is called atomic.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 90a887c9 26-May-2011 Dan Magenheimer <dan.magenheimer@oracle.com>

btrfs: add cleancache support

This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs. Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>


# c7f895a2 20-Apr-2011 Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

Btrfs: fix unsafe usage of merge_state

merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8233767a 20-Apr-2011 Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

Btrfs: allocate extent state and check the result properly

It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
allowed wait

- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
we trigger BUG_ON() if it's fail

- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
the return value of clear_extent_bit() is ignored by many callers

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# af60bed2 04-May-2011 Josef Bacik <josef@redhat.com>

Btrfs: set range_start to the right start in count_range_bits

In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results. For example, if the range

[0-8192]

is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096]. So instead set range_start = max(cur_start, state->start).
This makes everything come out right. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 268bb0ce 20-May-2011 Linus Torvalds <torvalds@linux-foundation.org>

sanitize <linux/prefetch.h> usage

Commit e66eed651fd1 ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f2a97a9d 04-May-2011 David Sterba <dsterba@suse.cz>

btrfs: remove all unused functions

Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>


# ba144192 20-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: drop gfp parameter from alloc_extent_buffer

pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>


# f09d1f60 20-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: drop gfp parameter from find_extent_buffer

pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>


# f993c883 20-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: drop unused argument from extent_io_tree_init

all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.

Signed-off-by: David Sterba <dsterba@suse.cz>


# c704005d 19-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: unify checking of IS_ERR and null

use IS_ERR_OR_NULL when possible, done by this coccinelle script:

@ match @
identifier id;
@@
(
- BUG_ON(IS_ERR(id) || !id);
+ BUG_ON(IS_ERR_OR_NULL(id));
|
- IS_ERR(id) || !id
+ IS_ERR_OR_NULL(id)
|
- !id || IS_ERR(id)
+ IS_ERR_OR_NULL(id)
)

Signed-off-by: David Sterba <dsterba@suse.cz>


# 306e16ce 19-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: rename variables clashing with global function names

reported by gcc -Wshadow:
page_index, page_offset, new_inode, dev_name

Signed-off-by: David Sterba <dsterba@suse.cz>


# 43e817a1 25-Apr-2011 Itaru Kitayama <kitayama@cl.bb4u.ne.jp>

btrfs: fix wrong allocating flag when reading page

the space cache use extent_readpages() to read free space information,
so we can not use GFP_KERNEL flag to allocate memory, or it may lead
to deadlock.

Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 33345d01 19-Apr-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Always use 64bit inode number

There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 0d399205 16-Apr-2011 Chris Mason <chris.mason@oracle.com>

Btrfs end_bio_extent_readpage should look for locked bits

A recent commit caches the extent state in end_bio_extent_readpage,
but the search it does should look for locked extents. This
fixes things to make it more effective.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 109b36a2 12-Apr-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: make uncache_state unconditional

The extent_io code can take cached pointers into the extent state trees,
and these can make lookups much faster in common operations. The
caching only happens when specific bits are set that prevent merging
and splitting of the extent state.

A help function was added to uncache the state, and it was testing
the same set of conditionals. This can leak in very strange corner
cases where the lock bit goes away unexpectedly.

The uncaching should be unconditional. Once we have a ref on the
extent we should always give it up.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 507903b8 06-Apr-2011 Arne Jansen <sensille@gmx.net>

btrfs: using cached extent_state in set/unlock combinations

In several places the sequence (set_extent_uptodate, unlock_extent) is used.
This leads to a duplicate lookup of the extent state. This patch lets
set_extent_uptodate return a cached extent_state which can be passed to
unlock_extent_cached.
The occurences of the above sequences are updated to use the cache. Only
end_bio_extent_readpage is updated that it first gets a cached state to
pass it to the readpage_end_io_hook as the prototype requested and is later
on being used for set/unlock.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3387206f 11-Apr-2011 Sergei Trofimovich <slyich@gmail.com>

btrfs: properly handle overlapping areas in memmove_extent_buffer

Fix data corruption caused by memcpy() usage on overlapping data.
I've observed it first when found out usermode linux crash on btrfs.

?all chain is the following:
------------[ cut here ]------------
WARNING: at /home/slyfox/linux-2.6/fs/btrfs/extent_io.c:3900 memcpy_extent_buffer+0x1a5/0x219()
Call Trace:
6fa39a58: [<601b495e>] _raw_spin_unlock_irqrestore+0x18/0x1c
6fa39a68: [<60029ad9>] warn_slowpath_common+0x59/0x70
6fa39aa8: [<60029b05>] warn_slowpath_null+0x15/0x17
6fa39ab8: [<600efc97>] memcpy_extent_buffer+0x1a5/0x219
6fa39b48: [<600efd9f>] memmove_extent_buffer+0x94/0x208
6fa39bc8: [<600becbf>] btrfs_del_items+0x214/0x473
6fa39c78: [<600ce1b0>] btrfs_delete_one_dir_name+0x7c/0xda
6fa39cc8: [<600dad6b>] __btrfs_unlink_inode+0xad/0x25d
6fa39d08: [<600d7864>] btrfs_start_transaction+0xe/0x10
6fa39d48: [<600dc9ff>] btrfs_unlink_inode+0x1b/0x3b
6fa39d78: [<600e04bc>] btrfs_unlink+0x70/0xef
6fa39dc8: [<6007f0d0>] vfs_unlink+0x58/0xa3
6fa39df8: [<60080278>] do_unlinkat+0xd4/0x162
6fa39e48: [<600517db>] call_rcu_sched+0xe/0x10
6fa39e58: [<600452a8>] __put_cred+0x58/0x5a
6fa39e78: [<6007446c>] sys_faccessat+0x154/0x166
6fa39ed8: [<60080317>] sys_unlink+0x11/0x13
6fa39ee8: [<60016b80>] handle_syscall+0x58/0x70
6fa39f08: [<60021377>] userspace+0x2d4/0x381
6fa39fc8: [<60014507>] fork_handler+0x62/0x69
---[ end trace 70b0ca2ef0266b93 ]---

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg09302.html

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1abe9b8a 24-Mar-2011 liubo <liubo2009@cn.fujitsu.com>

Btrfs: add initial tracepoint support for btrfs

Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
btrfs_sync_file
btrfs_sync_fs

These show sync arguments.

6) transaction:
btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 85026533 15-Mar-2011 Josef Bacik <josef@redhat.com>

Btrfs: return error if the range we want to map is bogus

Currently if we have corrupt metadata map_extent_buffer will complain about it,
but not return an error so the caller has no idea a problem was hit. Fix this.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 721a9602 09-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: kill off REQ_UNPLUG

With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# ea8efc74 08-Mar-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: make sure not to return overlapping extents to fiemap

The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases. cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.

The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ec29ed5b 23-Feb-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: fix fiemap bugs with delalloc

The Btrfs fiemap code wasn't properly returning delalloc extents,
so applications that trust fiemap to decide if there are holes in the
file see holes instead of delalloc.

This reworks the btrfs fiemap code, adding a get_extent helper that
searches for delalloc ranges and also adding a helper for extent_fiemap
that skips past holes in the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e3f24cc5 13-Feb-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: don't release pages when we can't clear the uptodate bits

Btrfs tracks uptodate state in an rbtree as well as in the
page bits. This is supposed to enable us to use block sizes other than
the page size, but there are a few parts still missing before that
completely works.

But, our readpage routine trusts this additional range based tracking
of uptodateness, much in the same way the buffer head up to date bits
are trusted for the other filesystems.

The problem is that sometimes we need to allocate memory in order to
split records in the rbtree, even when we are just clearing bits. This
can be difficult when our clearing function is called GFP_ATOMIC, which
can happen in the releasepage path.

So, what happens today looks like this:

releasepage called with GFP_ATOMIC
btrfs_releasepage calls clear_extent_bit
clear_extent_bit fails to allocate ram, leaving the up to date bit set
btrfs_releasepage returns success

The end result is the page being gone, but btrfs thinking the range is
up to date. Later on if someone tries to read that same page, the
btrfs readpage code will return immediately thinking the page is already
up to date.

This commit fixes things to fail the releasepage when we can't clear the
extent state bits. It covers both data pages and metadata tree blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# eb14ab8e 09-Feb-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: fix page->private races

There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata. Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.

This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5df67083 01-Feb-2011 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

btrfs: checking NULL or not in some functions

Because NULL is returned when the memory allocation fails,
it is checked whether it is NULL.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6b82ce8d 25-Jan-2011 liubo <liubo2009@cn.fujitsu.com>

btrfs: fix uncheck memory allocation in btrfs_submit_compressed_read

btrfs_submit_compressed_read() is lack of memory allocation checks and
corresponding error route.

After this fix, if it comes to "no memory" case, errno will be returned
to userland step by step, and tell users this operation cannot go on.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 91ca338d 04-Jan-2011 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

btrfs: check NULL or not

Should check if functions returns NULL or not.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 261507a0 16-Dec-2010 Li Zefan <lizf@cn.fujitsu.com>

btrfs: Allow to add new compression algorithm

Make the code aware of compression type, instead of always assuming
zlib compression.

Also make the zlib workspace function as common code for all
compression types.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 975f84fe 23-Nov-2010 Josef Bacik <josef@redhat.com>

Btrfs: fix fiemap

There are two big problems currently with FIEMAP

1) We return extents for holes. This isn't supposed to happen, we just don't
return extents for holes and then userspace interprets the lack of an extent as
a hole.

2) We sometimes don't set FIEMAP_EXTENT_LAST properly. This is because we wait
to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
fiemap to map up to the last extent in a file, and there is nothing but holes up
to the i_size. To fix this we need to lookup the last extent in this file and
save the logical offset, so if we happen to try and map that extent we can be
sure to set FIEMAP_EXTENT_LAST.

With this patch we now pass xfstest 225, which we never have before.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 45f49bce 21-Nov-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: avoid NULL pointer deref in try_release_extent_buffer

If we fail to find a pointer in the radix tree, don't try
to deref the NULL one we do have.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 88f794ed 21-Nov-2010 Miao Xie <miaox@cn.fujitsu.com>

btrfs: cleanup duplicate bio allocating functions

extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 559af821 29-Oct-2010 Andi Kleen <andi@firstfloor.org>

Btrfs: cleanup warnings from gcc 4.6 (nonbugs)

These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 411fc6bc 29-Oct-2010 Andi Kleen <andi@firstfloor.org>

Btrfs: Fix variables set but not read (bugs found by gcc 4.6)

These are all the cases where a variable is set, but not
read which are really bugs.

- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things

Still needs more review.

Found by gcc 4.6's new warnings.

[akpm@linux-foundation.org: fix build. Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 19fe0a8b 26-Oct-2010 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: Switch the extent buffer rbtree into a radix tree

This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.

I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].

Before applying this patch:
Create files:
Total files: 50000
Total time: 0.971531
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.366761
Average time: 0.000027

After applying this patch:
Create files:
Total files: 50000
Total time: 0.927455
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.292280
Average time: 0.000026

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 897ca6e9 26-Oct-2010 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: restructure try_release_extent_buffer()

restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9c3a8ee8 09-Jun-2010 Christoph Hellwig <hch@lst.de>

writeback: remove writeback_inodes_wbc

This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 4845e44f 25-May-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: rework O_DIRECT enospc handling

This changes O_DIRECT write code to mark extents as delalloc
while it is processing them. Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.

There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong. This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage. We have
to clear them ourselves as the DIO ends.

With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.

btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# eaf25d93 25-May-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: use async helpers for DIO write checksumming

The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster. This
changes the O_DIRECT write code to use them. The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 11c65dcc 23-May-2010 Josef Bacik <josef@redhat.com>

Btrfs: do aio_write instead of write

In order for AIO to work, we need to implement aio_write. This patch converts
our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and
nothing broke, and the AIO stuff magically started working. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0ca1f7ce 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Update metadata reservation for delayed allocation

Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 28ecb609 17-Mar-2010 Nick Piggin <npiggin@suse.de>

Btrfs: use add_to_page_cache_lru, use __page_cache_alloc

Pagecache pages should be allocated with __page_cache_alloc, so they
obey pagecache memory policies.

add_to_page_cache_lru is exported, so it should be used. Benefits over
using a private pagevec: neater code, 128 bytes fewer stack used, percpu
lru ordering is preserved, and finally don't need to flush pagevec
before returning so batching may be shared with other LRU insertions.

Signed-off-by: Nick Piggin <npiggin@suse.de>:
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 2ac55d41 03-Feb-2010 Josef Bacik <josef@redhat.com>

Btrfs: cache the extent state everywhere we possibly can V2

This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function. This gives me about a 3 mb/s boots on my random write test. I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written. I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state. I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c2a128d2 02-Feb-2010 Josef Bacik <josef@redhat.com>

Btrfs: cache extent state in find_delalloc_range

This patch makes us cache the extent state we find in find_delalloc_range since
we'll have to lock the extent later on in the function. This will keep us from
re-searching for the rang when we try to lock the extent.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4125bf76 03-Feb-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: finish read pages in the order they are submitted

The endio is done at reverse order of bio vectors.

That means for a sequential read, the page first submitted will finish
last in a bio. Considering we will do checksum (making cache hot) for
every page, this does introduce delay (and chance to squeeze cache used
soon) for pages submitted at the begining.

I don't observe obvious performance difference with below patch at my
simple test, but seems more natural to finish read in the order they are
submitted.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6bef4d31 23-Feb-2010 Eric Paris <eparis@redhat.com>

Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULL

btrfs inialize rb trees in quite a number of places by settin rb_node =
NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the
linux-next tree adds a new field to that struct which needs to be NULL for
the new rbtree library code to work properly. This patch uses RB_ROOT as
the intializer so all of the relevant fields will be NULL'd. Without the
patch I get a panic.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f044ba78 04-Feb-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: fix race between allocate and release extent buffer.

Increase extent buffer's reference count while holding the lock.
Otherwise it can race with try_release_extent_buffer.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 32c00aff 08-Oct-2009 Josef Bacik <josef@redhat.com>

Btrfs: release delalloc reservations on extent item insertion

This patch fixes an issue with the delalloc metadata space reservation
code. The problem is we used to free the reservation as soon as we
allocated the delalloc region. The problem with this is if we are not
inserting an inline extent, we don't actually insert the extent item until
after the ordered extent is written out. This patch does 3 things,

1) It moves the reservation clearing stuff into the ordered code, so when
we remove the ordered extent we remove the reservation.
2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
delalloc bits in the cases where we want to clear the metadata reservation
when we clear the delalloc extent, in the case that we do an inline extent
or we invalidate the page.
3) It adds another waitqueue to the space info so that when we start a fs
wide delalloc flush, anybody else who also hits that area will simply wait
for the flush to finish and then try to make their allocation.

This has been tested thoroughly to make sure we did not regress on
performance.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a791e35e 08-Oct-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: cleanup extent_clear_unlock_delalloc flags

extent_clear_unlock_delalloc has a growing set of ugly parameters
that is very difficult to read and maintain.

This switches to a flag field and well named flag defines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9ed74f2d 11-Sep-2009 Josef Bacik <josef@redhat.com>

Btrfs: proper -ENOSPC handling

At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying. Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.

For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents. This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.

The only place where the metadata space accounting is not done is in the
relocation code. This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic. This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.

This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation. It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook. These help
us keep track of when we merge delalloc extents together and split them
up. Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.

btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have. Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs. This has survived dbench, fs_mark and an fsx torture
test.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 11ef160f 23-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: fix releasepage to avoid unlocking extents we haven't locked

During releasepage, we try to drop any extent_state structs for the
bye offsets of the page we're releaseing. But the code was incorrectly
telling clear_extent_bit to delete the state struct unconditionallly.

Normally this would be fine because we have the page locked, but other
parts of btrfs will lock down an entire extent, the most common place
being IO completion.

releasepage was deleting the extent state without first locking the extent,
which may result in removing a state struct that another process had
locked down. The fix here is to leave the NODATASUM and EXTENT_LOCKED
bits alone in releasepage.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 46562cec 23-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix test_range_bit for whole file extents

If test_range_bit finds an extent that goes all the way to (u64)-1, it
can incorrectly wrap the u64 instead of treaing it like the end of
the address space.

This just adds a check for the highest possible offset so we don't wrap.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 42daec29 23-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: fix errors handling cached state in set/clear_extent_bit

Both set and clear_extent_bit allow passing a cached
state struct to reduce rbtree search times. clear_extent_bit
was improperly bypassing some of the checks around making sure
the extent state fields were correct for a given operation.

The fix used here (from Yan Zheng) is to use the hit_next
goto target instead of jumping all the way down to start clearing
bits without making sure the cached state was exactly correct
for the operation we were doing.

This also fixes up the setting of the start variable for both
ops in the case where we find an overlapping extent that
begins before the range we want to change. In both cases
we were incorrectly going backwards from the original
requested change.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f85d7d6c 18-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: properly honor wbc->nr_to_write changes

When btrfs fills a delayed allocation, it tries to increase
the wbc nr_to_write to cover a big part of allocation. The
theory is that we're doing contiguous IO and writing a few
more blocks will save seeks overall at a very low cost.

The problem is that extent_write_cache_pages could ignore
the new higher nr_to_write if nr_to_write had already gone
down to zero. We fix that by rechecking the nr_to_write
for every page that is processed in the pagevec.

This updates the math around bumping the nr_to_write value
to make sure we don't leave a tiny amount of IO hanging
around for the very end of a new extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8b62b72b 02-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Use PagePrivate2 to track pages in the data=ordered code.

Btrfs writes go through delalloc to the data=ordered code. This
makes sure that all of the data is on disk before the metadata
that references it. The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.

This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.

One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called. The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.

These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.

This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.

As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.

At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9655d298 02-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: use a cached state for extent state operations during delalloc

This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit. It reduces
one of the biggest causes of rbtree searches in the writeback path.

test_range_bit is also modified to take the cached state as a starting
point while searching.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d5550c63 02-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: don't lock bits in the extent tree during writepage

At writepage time, we have the page locked and we have the
extent_map entry for this extent pinned in the extent_map tree.
So, the page can't go away and its mapping can't change.

There is no need for the extra extent_state lock bits during writepage.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2c64c53d 02-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: cache values for locking extents

Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.

This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree. A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.

This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1edbb734 02-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: reduce CPU usage in the extent_state tree

Btrfs is currently mirroring some of the page state bits into
its extent state tree. The goal behind this was to use it in supporting
blocksizes other than the page size.

But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock. This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.

This commit:

* Adds the ability to lock an extent in the state tree and also set
other bits. The idea is to do locking and delalloc in one call

* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits. Btrfs is using
a combination of the page bits and the ordered write code for this
instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e48c465b 11-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix new state initialization order

As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared. One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.

When new states are inserted, this callback is being done before we
properly setup the new state. This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.

This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 890871be 02-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: switch extent_map to a rw lock

There are two main users of the extent_map tree. The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a97adc9f 07-Aug-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: use larger nr_to_write for larger extents

When btrfs fills a large delayed allocation extent, it is a good idea
to try and convince the write_cache_pages caller to go ahead and
write a good chunk of that extent. The extra IO is basically free
because we know it is contiguous.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 40431d6c 04-Aug-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: optimize set extent bit

The Btrfs set_extent_bit call currently searches the rbtree
every time it needs to find more extent_state objects to fill
the requested operation.

This adds a simple test with rb_next to see if the next object
in the tree was adjacent to the one we just found. If so,
we skip the search and just use the next object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5c939df5 27-May-2009 Yan Zheng <zheng.yan@oracle.com>

btrfs: Fix set/clear_extent_bit for 'end == (u64)-1'

There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b7967db7 27-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: remove #if 0 code

Btrfs had some old code sitting around under #if 0, this drops it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9601e3f6 13-Apr-2009 Christoph Hellwig <hch@lst.de>

Btrfs: kill btrfs_cache_create

Just use kmem_cache_create directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 11c8349b 20-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: fix oops on page->mapping->host during writepage

The extent_io writepage call updates the writepage index in the inode
as it makes progress. But, it was doing the update after unlocking the page,
which isn't legal because page->mapping can't be trusted once the page
is unlocked.

This lead to an oops, especially common with compression turned on. The
fix here is to update the writeback index before unlocking the page.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d313d7a3 20-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: add a priority queue to the async thread helpers

Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
higher priority. But, the checksumming helper threads prevent it
from being fully effective.

There are two problems. First, a big queue of pending checksumming
will delay the synchronous IO behind other lower priority writes. Second,
the checksumming uses an ordered async work queue. The ordering makes sure
that IOs are sent to the block layer in the same order they are sent
to the checksumming threads. Usually this gives us less seeky IO.

But, when we start mixing IO priorities, the lower priority IO can delay
the higher priority IO.

This patch solves both problems by adding a high priority list to the async
helper threads, and a new btrfs_set_work_high_prio(), which is used
to make put a new async work item onto the higher priority list.

The ordering is still done on high priority IO, but all of the high
priority bios are ordered separately from the low priority bios. This
ordering is purely an IO optimization, it is not involved in data
or metadata integrity.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ffbd517d 20-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: use WRITE_SYNC for synchronous writes

Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future. This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.

Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios. The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.

This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 93dbfad7 03-Apr-2009 Heiko Carstens <hca@linux.ibm.com>

Btrfs: fix __ucmpdi2 compile bug on 32 bit builds

We get this on 32 builds:

fs/built-in.o: In function `extent_fiemap':
(.text+0x1019f2): undefined reference to `__ucmpdi2'

Happens because of a switch statement with a 64 bit argument.
Convert this to an if statement to fix this.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b9473439 13-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: leave btree locks spinning more often

btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a48ddf08 12-Feb-2009 Qinghuang Feng <qhfeng.kernel@gmail.com>

Btrfs: remove unused code in split_state()

These two lines are not used, remove them.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9b0d3ace 04-Feb-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: don't return congestion in write_cache_pages as often

On fast devices that go from congested to uncongested very quickly, pdflush
is waiting too often in congestion_wait, and the FS is backing off to
easily in write_cache_pages.

For now, fix this on the btrfs side by only checking congestion after
some bios have already gone down. Longer term a real fix is needed
for pdflush, but that is a larger project.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b4ce94de 04-Feb-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Change btree locking to use explicit blocking points

Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3935127c 04-Feb-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: disable leak debugging checks in extent_io.c

extent_io.c has debugging code to report and free leaked extent_state
and extent_buffer objects at rmmod time. This helps track down
leaks and it saves you from rebooting just to properly remove the
kmem_cache object.

But, the code runs under a fairly expensive spinlock and the checks to
see if it is currently enabled are not entirely consistent. Some use
#ifdef and some #if.

This changes everything to #if and disables the leak checking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1506fcc8 21-Jan-2009 Yehuda Sadeh <yehuda@hq.newdream.net>

Btrfs: fiemap support

Now that bmap support is gone, this is the only way to get extent
mappings for userland. These are still not valid for IO, but they
can tell us if a file has holes or how much fragmentation there is.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>


# 7eaebe7d 21-Jan-2009 Huang Weiyi <weiyi.huang@gmail.com>

Btrfs: removed unused #include <version.h>'s

Removed unused #include <version.h>'s in btrfs

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 43b774ba 05-Jan-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: drop EXPORT symbols from extent_io.c

They should stay out until this is turned into generic code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d397712b 05-Jan-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix checkpatch.pl warnings

There were many, most are fixed now. struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c584482b 05-Jan-2009 Liu Hui <onlyflyer@gmail.com>

Btrfs: Fix typo in clear_state_cb

In clear_state_cb, we should check 'tree->ops->clear_bit_hook' instead
of 'tree->ops->set_bit_hook'.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cad321ad 17-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: shift all end_io work to thread pools

bio_end_io for reads without checksumming on and btree writes were
happening without using async thread pools. This means the extent_io.c
code had to use spin_lock_irq and friends on the rb tree locks for
extent state.

There were some irq safe vs unsafe lock inversions between the delallock
lock and the extent state locks. This patch gets rid of them by moving
all end_io code into the thread pools.

To avoid contention and deadlocks between the data end_io processing and the
metadata end_io processing yet another thread pool is added to finish
off metadata writes.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 934d375b 08-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Use map_private_extent_buffer during generic_bin_search

It is possible that generic_bin_search will be called on a tree block
that has not been locked. This happens because cache_block_block skips
locking on the tree blocks.

Since the tree block isn't locked, we aren't allowed to change
the extent_buffer->map_token field. Using map_private_extent_buffer
avoids any changes to the internal extent buffer fields.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d20f7043 08-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: move data checksumming into a dedicated tree

Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b2950863 02-Dec-2008 Christoph Hellwig <hch@lst.de>

Btrfs: make things static and include the right headers

Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 641f5219 02-Dec-2008 Christoph Hellwig <hch@lst.de>

Btrfs: sparse lock verification annotations for wait_on_state

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0e6bd956 20-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: only flush down bios for writeback pages

The btrfs write_cache_pages call has a flush function so that it submits
the bio it has been building before it waits on any writeback pages.

This adds a check so that flush only happens on writeback pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 15916de8 19-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fixes for 2.6.28-rc API changes

* open/close_bdev_excl -> open/close_bdev_exclusive
* blkdev_issue_discard takes a GFP mask now
* Fix blkdev_issue_discard usage now that it is enabled

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d2c3f4f6 18-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Avoid writeback stalls

While building large bios in writepages, btrfs may end up waiting
for other page writeback to finish if WB_SYNC_ALL is used.

While it is waiting, the bio it is building has a number of pages with the
writeback bit set and they aren't getting to the disk any time soon. This
lowers the latencies of writeback in general by sending down the bio being
built before waiting for other pages.

The bio submission code tries to limit the total number of async bios in
flight by waiting when we're over a certain number of async bios. But,
the waits are happening while writepages is building bios, and this can easily
lead to stalls and other problems for people calling wait_on_page_writeback.

The current fix is to let the congestion tests take care of waiting.

sync() and others make sure to drain the current async requests to make
sure that everything that was pending when the sync was started really get
to disk. The code would drain pending requests both before and after
submitting a new request.

But, if one of the requests is waiting for page writeback to finish,
the draining waits might block that page writeback. This changes the
draining code to only wait after submitting the bio being processed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5b050f04 11-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix compile warnings on 32 bit machines

Simple casting here and there to fix things up.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b47eda86 09-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Turn off extent state leak debugging

The extent_io.c code has a #define to find and cleanup extent state leaks
on module unmount. This adds a very highly contended spinlock to a
hot path for most FS operations.

Turn it off by default. A later changeset will add a .config option
for it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 39be25cd 10-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Use invalidatepage when writepage finds a page outside of i_size

With all the recent fixes to the delalloc locking, it is now safe
again to use invalidatepage inside the writepage code for
pages outside of i_size. This used to deadlock against some of the
code to write locked ranges of pages, but all of that has been fixed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f2b1c41c 10-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Make sure pages are dirty before doing delalloc for them

This adds a PageDirty check to the writeback path that locks pages
for delalloc. If a page wasn't dirty at this point, it is in the
process of being truncated away.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3b7885bf 06-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: enforce metadata allocation clustering

The allocator uses the last allocation as a starting point for metadata
allocations, and tries to allocate in clusters of at least 256k.

If the search for a free block fails to find the expected block, this patch
forces a new cluster to be found in the free list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 771ed689 06-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Optimize compressed writeback and reads

When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.

Add an async work queue to handle transformations at delayed allocation processing
time. Right now this is just compression. The workflow is:

1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc code

The async delalloc code clears the state lock bits and delalloc bits. It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.

The file pages are compressed, and if the compression doesn't work the
pages are written back directly.

An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.

This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 70b99e69 30-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Compression corner fixes

Make sure we keep page->mapping NULL on the pages we're getting
via alloc_page. It gets set so a few of the callbacks can do the right
thing, but in general these pages don't have a mapping.

Don't try to truncate compressed inline items in btrfs_drop_extents.
The whole compressed item must be preserved.

Don't try to create multipage inline compressed items. When we try to
overwrite just the first page of the file, we would have to read in and recow
all the pages after it in the same compressed inline items. For now, only
create single page inline items.

Make sure we lock pages in the correct order during delalloc. The
search into the state tree for delalloc bytes can return bytes before
the page we already have locked.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d899e052 30-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Add fallocate support v2
This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it. This leverages the
-o nodatacow checks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 6643558d 30-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Fix bookend extent race v2

When dropping middle part of an extent, btrfs_drop_extents truncates
the extent at first, then inserts a bookend extent.

Since truncation and insertion can't be done atomically, there is a small
period that the bookend extent isn't in the tree. This causes problem for
functions that search the tree for file extent item. The way to fix this is
lock the range of the bookend extent before truncation.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 25179201 29-Oct-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: nuke fs wide allocation mutex V2

This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# c8b97818 29-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add zlib compression support

This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d352ac68 29-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: add and improve comments

This improves the comments at the top of many functions. It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5b21f2ed 26-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: extent_map and data=ordered fixes for space balancing

* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated. This allows us to do accounting for them properly.

* The balancing code relocates data extents indepdent of the underlying
inode. The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).

* Don't take the drop_mutex in the create_subvol ioctl. It isn't
required.

* Fix walking of the ordered extent list to avoid races with sys_unlink

* Change the lock ordering rules. Transaction start goes outside
the drop_mutex. This allows btrfs_commit_transaction to directly
drop the relocation trees.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2b1f55b0 24-Sep-2008 Chris Mason <chris.mason@oracle.com>

Remove Btrfs compat code for older kernels

Btrfs had compatibility code for kernels back to 2.6.18. These have
been removed, and will be maintained in a separate backport
git tree from now on.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 31840ae1 23-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: Full back reference support

This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0f9dd46c 23-Sep-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: free space accounting redo

1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.

4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4bef0848 08-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Tree logging fixes

* Pin down data blocks to prevent them from being reallocated like so:

trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file

Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.

With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent. If we crash,
we're left with two different files using the extent.

* Don't wait in start_transaction if log replay is going on. This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.

* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory

* Hold the buffer lock while we mark a buffer as written. This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b214107e 05-Sep-2008 Christoph Hellwig <hch@lst.de>

Btrfs: trivial sparse fixes

Fix a bunch of trivial sparse complaints.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a1b32a59 05-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add debugging checks to track down corrupted metadata

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 902b22f3 20-Aug-2008 David Woodhouse <David.Woodhouse@intel.com>

Btrfs: Remove broken optimisations in end_bio functions.

These ended up freeing objects while they were still using them. Under
guidance from Chris, just rip out the 'clever' bits and do things the
simple way.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2db04966 07-Aug-2008 David Woodhouse <dwmw2@infradead.org>

Btrfs: Change TestSetPageLocked() to trylock_page()

Add backwards compatibility in compat.h

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
compat.h | 3 +++
extent_io.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletions(-)

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0ee0fda0 30-Jul-2008 Sven Wegener <sven.wegener@stealer.net>

Btrfs: Add compatibility for kernels >= 2.6.27-rc1

Add a couple of #if's to follow API changes.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bcc63abb 30-Jul-2008 Yan <zheng.yan@oracle.com>

Btrfs: implement memory reclaim for leaf reference cache

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 33958dc6 30-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix verify_parent_transid

It was incorrectly clearing the up to date flag on the buffer even
when the buffer properly verified.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 89642229 24-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Search data ordered extents first for checksums on read

Checksum items are not inserted into the tree until all of the io from a
given extent is complete. This means one dirty page from an extent may
be written, freed, and then read again before the entire extent is on disk
and the checksum item is inserted.

The checksums themselves are stored in the ordered extent so they can
be inserted in bulk when IO is complete. On read, if a checksum item isn't
found, the ordered extents were being searched for a checksum record.

This all worked most of the time, but the checksum insertion code tries
to reduce the number of tree operations by pre-inserting checksum items
based on i_size and a few other factors. This means the read code might
find a checksum item that hasn't yet really been filled in.

This commit changes things to check the ordered extents first and only
dive into the btree if nothing was found. This removes the need for
extra locking and is more reliable.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f421950f 22-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix some data=ordered related data corruptions

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree. The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed. It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a61e6f29 22-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Use a mutex in the extent buffer for tree block locking

This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.

The mutexes alone don't fix either problem, but they are the first step.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6af118ce 22-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Index extent buffers in an rbtree

Before, extent buffers were a temporary object, meant to map a number of pages
at once and collect operations on them.

But, a few extra fields have crept in, and they are also the best place to
store a per-tree block lock field as well. This commit puts the extent
buffers into an rbtree, and ensures a single extent buffer for each
tree block.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7f3c74fb 17-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Keep extent mappings in ram until pending ordered extents are done

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 211f90e6 18-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is set

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 247e743c 16-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Use async helpers to deal with pages that have been improperly dirtied

Higher layers sometimes call set_page_dirty without asking the filesystem
to help. This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e6dcd2dc 16-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: New data=ordered implementation

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk. This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
the extent bit EXTENT_ORDERED is set on the entire range of the extent.
A struct btrfs_ordered_extent is allocated an inserted into a per-inode
rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
are written, The previously reserved extent is inserted into the FS
btree and into the extent allocation trees. The checksums for the file
data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 079899c2 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Change find_extent_buffer to use TestSetPageLocked

This makes it possible for callers to check for extent_buffers in cache
without deadlocking against any btree locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 925baedd 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Start btree concurrency work.

The allocation trees and the chunk trees are serialized via their own
dedicated mutexes. This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree. Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked. Releasing or freeing the path drops any locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 211c17f5 15-May-2008 Chris Mason <chris.mason@oracle.com>

Fix corners in writepage and btrfs_truncate_page

The extent_io writepage calls needed an extra check for discarding
pages that started on th last byte in the file.

btrfs_truncate_page needed checks to make sure the page was still part
of the file after reading it, and most importantly, needed to wait for
all IO to the page to finish before freeing the corresponding extents on
disk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1259ab75 12-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Handle write errors on raid1 and raid10

When duplicate copies exist, writes are allowed to fail to one of those
copies. This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4235298e 28-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Drop some verbose printks

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5e478dc9 25-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: write_cache_pages came in 2.6.22

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 004fb575 25-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: write_extent_pages came in 2.6.23

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e1c4b745 22-Apr-2008 Chris Mason <chris.mason@oracle.com>

Fix btrfs_get_extent and get_block corner cases, and disable O_DIRECT reads

The generic O_DIRECT code assumes all the bios have the same bdev,
which isn't true for multi-device btrfs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7b13b7b1 18-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Don't drop extent_map cache during releasepage on the btree inode

The btree inode should only have a single extent_map in the cache,
it doesn't make sense to ever drop it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 41471e83 17-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Remove bogus max_sector warnings from the extent_io code

It was testing the bio before doing logical->physical mapping, so the
test was always wrong.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3b951516 17-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Use the extent map cache to find the logical disk block during data retries

The data read retry code needs to find the logical disk block before it
can resubmit new bios. But, finding this block isn't allowed to take
the fs_mutex because that will deadlock with a number of different callers.

This changes the retry code to use the extent map cache instead, but
that requires the extent map cache to have the extent we're looking for.
This is a problem because btrfs_drop_extent_cache just drops the entire
extent instead of the little tiny part it is invalidating.

The bulk of the code in this patch changes btrfs_drop_extent_cache to
invalidate only a portion of the extent cache, and changes btrfs_get_extent
to deal with the results.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 594994aa 11-Apr-2008 Miguel <miguel.filipe@gmail.com>

Btrfs: define write_cache_pages for linux kernel <= 2.6.20 instead

write_cache_pages doesn't exist in linux 2.6.20, change the #if
condition to match that.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7e38326f 09-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Handle checksumming errors while reading data blocks

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f188591e 09-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Retry metadata reads in the face of checksum failures

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ce9adaa5 09-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Do metadata checksums for reads via a workqueue

Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.

But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.

The new code validates checksums when the page is read. It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 728131d8 09-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add additional debugging for metadata checksum failures

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2b114d1d 01-Apr-2008 Peter <htmldeveloper@gmail.com>

Btrfs: Correct usage of IS_ERR() in extent_io.c

Signed-off-by: Peter Teoh <htmldeveloper@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2d2ae547 26-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add leak debugging for extent_buffer and extent_state

This also fixes one leak around the super block when failing to mount the
FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 239b14b3 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Bring back mount -o ssd optimizations

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0b86a832 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for multiple devices per filesystem

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 065631f6 19-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: checksum file data at bio submission time instead of during writepage

When we checkum file data during writepage, the checksumming is done one
page at a time, making it difficult to do bulk metadata modifications
to insert checksums for large ranges of the file at once.

This patch changes btrfs to checksum on a per-bio basis instead. The
bios are checksummed before they are handed off to the block layer, so
each bio is contiguous and only has pages from the same inode.

Checksumming on a bio basis allows us to insert and modify the file
checksum items in large groups. It also allows the checksumming to
be done more easily by async worker threads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d7fc640e 17-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Allocator improvements

Reduce CPU time searching for free blocks by optimizing find_first_extent_bit

Fix find_free_extent to make better use of the last_alloc hint. Before it
was often finding blocks just before the hint.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 39b5637f 15-Feb-2008 Yan <yanzheng@21cn.com>

Btrfs: Fix "no csum found for inode" issue.

A few codes were not properly updated for changes of extent map. This
may be the causes of "no csum found for inode" issue.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a86c12c7 07-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Create larger bios for btree blocks

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 961d0232 06-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Don't case unsigned long to int in bio submission

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c2e639f0 04-Feb-2008 Yan <yanzheng@21cn.com>

Btrfs: Fix typo in extent_io.c
---

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ae9d1285 01-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix delalloc account on state deletion

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 80ea96b1 01-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add a lookup cache to the extent state tree

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b0c68f8b 31-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Enable delalloc accounting

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6f568d35 29-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: mount -o max_inline=size to control the maximum inline extent size

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 291d673e 29-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Do delalloc accounting via hooks in the extent_state code

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 85e21bac 29-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: During deletes and truncate, remove many items at once from the tree

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 70dec807 29-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: extent_io and extent_state optimizations

The end_bio routines are changed to take a pointer to the extent state
struct, and the state tree is walked in order to set/clear appropriate
bits as IO completes. This greatly reduces the number of rbtree searches
done by the end_bio handlers, and reduces lock contention.

The extent_io releasepage function is changed to avoid expensive searches
for locked state.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d1310b2e 24-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Split the extent_map code into two parts

There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller. This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>